This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
, which points into a contiguous subset of a Java array. In lines –, the array is A) writes P:* { /* Ordinary quicksort partition */ int p = quicksortPartition(A); /* Split array int two disjoint pieces */ final DPJPartition segs = new DPJPartition (A, p); cobegin { /* writes segs:[0]:* */ quicksort(segs.get(0)); /* writes segs:[1]:* */ quicksort(segs.get(1)); } }
D
D
Deterministic Parallel Java
class Body
Deterministic Parallel Java. Fig. Code using an index-parameterized array 10
Deterministic Parallel Java. Fig. Parallel quicksort
90
...
...
...
Body
Body
double force Root : [10]
double force Root : [90]
Deterministic Parallel Java. Fig. Runtime representation of the index-parameterized array segs
DPJPartition
DPJArray<segs:[0]:*>
static
DPJArray<segs:[1]:*>
p
Deterministic Parallel Java. Fig. Runtime representation of the dynamically partitioned array
split at index p, creating a DPJPartition object holding references to two disjoint subarrays. The cobegin in lines – calls quicksort recursively on these two subarrays. Figure shows what this partitioning looks like at runtime, for the root of the tree and its children. A
DPJPartition object referred to by the final local variable segs points to two DPJArray objects, each
of which points to a segment of the original array. The DPJArray objects are instantiated with owner RPLs segs:[0]:* and segs:[1]:* in their types. An owner RPL is like an RPL, except that it begins with a final local reference variable instead of Root. This allows different partitions of the same array to be represented. Again because of the tree structure of DPJ regions, segs:[0]:* and segs:[1]:* are disjoint region sets, and so the effects in lines and are noninterfering. Also, as in ownership type systems, region segs is under the region P bound to the first parameter of its type, so the effect summary writes P:* in line covers the effects of the method body.
Commutativity Annotations Sometimes, a parallel computation is deterministic, even if it has interfering reads and writes. For example, two inserts to a concurrent set can go in either order, and preserve determinism, even though there are interfering writes to the set object. DPJ supports this kind of computation with a commutativity annotation on methods (typically API methods for a concurrent data structure). For example: interface Set
Deterministic Parallel Java
This annotation is provided by a library or framework programmer and trusted by the compiler (it is not checked by the DPJ effect system). It means that cobegin { add(e1); add(e2); } isequivalentto doingthe add operationsinsequence(isolation) and that both sequence orders are equivalent (commutativity). When the compiler encounters a method invocation, it generates an invocation effect that records both the method that was invoked, and the read-write effects of that method: foreach (int i in 0, n) { /* invokes Set.add with writes R */ set.add(A[i]); }
The compiler uses the commutativity annotations to check interference of effect. For example, the effect invokes Set.add with writes R is noninterfering with itself (because add is declared commutative). However, the same effect interferes with invokes Set. size with reads R.
Expressivity and Limitations DPJ can express a wide variety of algorithms that use a fork-join style of parallelism and access data through a combination of () read-only sharing; () disjoint updates to arrays and trees; () commutative operations on concurrent data structures like sets and maps; and () accesses to task-private heap objects. DPJ has been used to write the following parallel algorithms, among others: IDEA encryption, sorting, k-means clustering, an n-body force computation, collision detection algorithm from a game engine, and a Monte Carlo financial simulation. The main advantages of the DPJ approach are () a compile-time guarantee of determinism in all executions; and () no overhead from checking determinism at runtime, as the region information is erased before parallel code generation. Further, the effect annotations provide checkable documentation and enable modular analysis, and the defaults allow incremental transformation of sequential to parallel programs. DPJ does have some limitations. One limitation, mentioned above, is that currently only fork-join parallelism can be expressed. A second limitation is that
D
to ensure soundness, the language disallows some patterns that are in fact deterministic, but cannot be proved deterministic by the type system. One example is that the type system disallows reshuffling of the elements in an index-parameterized array. This limitation poses a problem for the Barnes-Hut n-body force computation, as rearranging the array of bodies is required for good locality. A workaround for this limitation is to recopy the bodies on reinsertion in to the array, but this is somewhat clumsy and could degrade performance. A third limitation is that some programmer effort is required to write the DPJ annotations, although this effort is repaid by the benefits discussed above.
Ongoing Work The foregoing describes the state of the public release of DPJ as of October . The following subsections describe ongoing work that is expected to be part of future releases.
Controlled Nondeterminism DPJ is being extended to support controlled nondeterminism [], for algorithms such as branch and bound search, where several correct answers are equally acceptable, and it would unduly constrain the schedule to require the same answer on every execution. The key new language features for controlled nondeterminism are () foreach_nd and cobegin_nd statements that operate like foreach and cobegin, except they allow interference among their parallel tasks; () atomic statements supported by software transactional memory (STM); and () atomic regions and atomic effects for reasoning about which memory regions may have interfering effects, and where effects occur inside atomic statements. Together, these features provide several strong guarantees for nondeterministic programs. First, the extended language is deterministic by default: the program is guaranteed to be deterministic unless the programmer explicitly introduces nondeterminism with foreach_nd or cobegin_nd. Second, the extended language provides both strong isolation (i.e., the program behaves as if it is a sequential interleaving of atomic statements and unguarded memory accesses) and data race freedom. This is true even if
D
D
Deterministic Parallel Java
the underlying STM provides only weak isolation (i.e., it allows interleavings between unguarded accesses and atomic statements). Third, foreach and cobegin are guaranteed to behave as isolated statements (as if they are enclosed in an atomic statement), even inside a foreach_nd or cobegin_nd. Finally, the extended type and effect system allows the compiler to boost the STM performance by removing unnecessary synchronization for memory accesses that can never cause interference.
Object-Oriented Parallel Frameworks DPJ is being extended to support object-oriented parallel frameworks. Current support focuses on two kinds of frameworks: () collection frameworks like Java’s ParallelArray [] that provide parallel operations such as map, reduce, filter, and scan on their elements via user-defined methods; and () a framework for pipeline parallelism, similar to the pipeline framework in Intel’s Threading Building Blocks []. The key challenge is to prevent the user’s methods from causing interference when invoked by the parallel framework. For example, a user-defined map function must not do an unsynchronized write to a global variable. The OO framework support incorporates several extensions to the DPJ type and effect system for expressing generic types and effects in framework APIs, with appropriate constraints on the effects of user-defined methods. Frameworks represent one possible solution to the problem of array reshuffling and other operations that the type system cannot express. Such operations can be encapsulated in a framework. The type and effect system can then check the uses of the framework, while the framework internals are checked by more flexible (but more complex and/or weaker) methods such as program logic or testing. This approach separates concerns between framework designer and user, and it fosters modular checking. It also enables different forms of verification with different trade-offs in complexity and power to work together. In addition, frameworks provide a natural way to extend the fork-join model of DPJ by adding other parallel control abstractions, including finer-grain synchronization (e.g., pipelined loops) or domain-specific abstractions (e.g., kd-tree querying and sparse matrix computations).
Inferring DPJ Annotations Research is underway to develop an interactive tool that helps the user write DPJ programs by partially inferring annotations. A tool called DPJizer has been developed [] that uses interprocedural constraint solving to infer method effect summaries for DPJ programs annotated with the other type, region, and effect information. DPJizer simplifies DPJ development by automating one of the more tedious, yet straightforward, aspects of writing DPJ annotations. The tool also reports the effect of each program statement, helping the programmer to isolate and eliminate effects that are causing unwanted interference. The next step is to infer region information, given a program annotated with parallel constructs. This is a more challenging goal, because there are many more degrees of freedom. The current strategy is to have the programmer provide partial information (e.g., partition the heap into regions and put in the parallel constructs), and then have the tool infer types and effects that guarantee noninterference. Runtime Checks Research is underway to add runtime checking as a complementary mode for guaranteeing determinism, in addition to the compile-time type and effect checking. The advantage of runtime checking is that it is more precise. The disadvantages are () it adds overhead, so it is probably useful only for testing and debugging; and () it provides only a fail-stop check, instead of a compiletime guarantee of correctness. One place where runtime checking could be particularly useful is in relaxing the prohibition on inconsistent type assignments (e.g., in reshuffling an index-parameterized array). In particular, if the runtime can guarantee that a reference is unique, then a safe cast to a different type can be performed at runtime. Using both static and runtime checks, and providing good integration between the two, would offer the user a powerful set of options for managing the trade-offs among the expressivity, performance, and correctness guarantees of the different approaches.
Related Entries Determinacy Formal Methods–Based Tools for Race, Deadlock, and Other Errors Transactional Memories
Distributed Memory Computers
Bibliographic Notes and Further Reading For further information on the DPJ project and its goals, see the DPJ web site [] and the position paper presented at HotPar []. The primary reference for the DPJ type and effect system is the paper presented at OOPSLA []. A paper on the nondeterminism work will appear at POPL []. The Ph.D. thesis of Robert Bocchino [] presents the full language, together with a formal core language and effect system, soundness results for the core system, and proofs. A paper presented at ASE [] describes the work on inferring method effect summaries. The main influence on DPJ has been the prior work on type and effect systems, including FX [] and the various effect systems based on ownership types [, ]. Other influences include work on design-by-contract for framework APIs [], work on type and effect inference [], and the vast body of literature on transactional memory [].
Bibliography . http://gee.cs.oswego.edu/dl/jsr/dist/extraydocs/index.html? extray/package-tree.html . http://dpj.cs.uiuc.edu . Bocchino RL Jr () An effect system and language for deterministic-by-default parallel programming. PhD thesis, University of Illinois, Urbana-Champaign . Bocchino RL Jr, Adve VS, Adve SV, Snir M () Parallel programming must be deterministic by default. In: HOTPAR ’: USENIX workshop on hot topics in parallelism, Berkeley, March . Bocchino RL Jr, Adve VS, Dig D, Adve SV, Heumann S, Komuravelli R, Overbey J, Simmons P, Sung H, Vakilian M () A type and effect system for deterministic parallel Java. In: OOPSLA ’: Proceedings of the th ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, New York, pp – . Bocchino RL Jr, Heumann S, Honarmand N, Adve S, Adve V, Welc A, Shpeisman T, Ni Y () Safe nondeterminism in a deterministic-by-default parallel language. In: POPL ’: Proceedings of the th ACM SIGACT-SIGPLAN symposium on principles of programming languages, New York . Cameron NR, Drossopoulou S, Noble J, Smith MJ () Multiple ownership. In: OOPSLA ’: Proceedings of the nd ACM SIGPLAN conference on object-oriented programming systems and applications, New York, pp – . Clarke D, Drossopoulou S () Ownership, encapsulation and the disjointness of type and effect. In: OOPSLA ’: Proceedings of the th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications, New York, pp –
D
. Harris T, Larus J, Rajwar R () Transactional memory (Synthesis lectures on computer architecture), nd edn. Morgan & Claypool Publishers, San Rafael . Lucassen JM, Gifford DK () Polymorphic effect systems. In: POPL ’: Proceedings of the th ACM SIGPLAN-SIGACT symposium on principles of programming languages, New York, pp – . Reinders J () Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O’Reilly Media Sebastopol . Talpin J-P, Jouvelot P () Polymorphic type, region and effect inference. J Funct Program :– . Thomas P, Weedon R () Object-oriented programming in Eiffel, nd edn. Addison-Wesley Longman Publishing Co., Inc., Boston . Vakilian M, Dig D, Bocchino R, Overbey J, Adve V, Johnson R () Inferring method effect summaries for nested heap regions. In: ASE ’: Proceedings of the th IEEE/ACM International Conference on Automated Software Engineering, Auckland, pp –
Direct Schemes Dense Linear System Solvers
Distributed Computer Clusters Distributed-Memory Multiprocessor Hypercubes and Meshes
Distributed Hash Table (DHT) Peer-to-Peer
Distributed Logic Languages Logic Languages
Distributed Memory Computers Clusters Disturbed-Memory Multiprocess Hypercubes and Meshes
D
D
Distributed Process Management
Distributed Process Management Single System Image
Distributed Switched Networks Networks, Direct Networks, Multistage
Distributed-Memory Multiprocessor Marc Snir University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Distributed computer; Distributed Memory Computers; Massively parallel processor (MPP); Multicomputers; Multiprocessors; Server farm
Definition A computer system consisting of a multiplicity of processors, each with its own local memory, connected via a network. Load or store instructions issued by a processor can only address the local memory, and different mechanisms are provided for global communication.
Discussion Terminology A Distributed-Memory Multiprocessor (DMM) is built by connecting nodes, which consist of uniprocessors or of shared memory multiprocessors (SMPs), via a network, also called Interconnection Network (IN) or Switch. While the terminology is fuzzy, Cluster generally refers to a DMM mostly built of commodity components, while Massively Parallel Processor (MPP) generally refers to a DMM built of more specialized components that can scale to a larger number of nodes and is used for large, compute-intensive tasks – in particular, in scientific computing. A server farm is a cluster
used as a server, in order to provide more performance or better cost-performance than a shared memory multiprocessor server. A high-availability (HA) cluster is a cluster with software and possibly hardware that supports fast failover so as to tolerate component failures. Finally, a cloud is a (generally very large) cluster that supports dynamic provisioning of computing resources, usually virtualized, to remote users, via the Internet.
Historical Background High-performance computing was focused until the late s on SMPs and vector computers built of bipolar gates. The fast evolution of VLSI technology indicated the possibility of achieving much better costperformance by assembling a larger number of less performing, but much cheaper, microprocessors. The node Cosmic Cube built at Caltech in the early s was an early demonstration of this technology []. Several companies started manufacturing such systems in the mid-s; a , node nCUBE system was installed at Sandia National Lab in the late s and a Transputer nodes system was running at Edinburgh in . The success of these systems, which was documented in the influential talk “The Attack of the Killer Micros” by Eugene D. Brooks III at the Supercomputing Conference, led to a flurry of commercial and governmental activities. Companies such as Thinking Machines (CM-), Intel (iPSC, Paragon), and Meiko (CS-) started offering Distributed-Memory Multiprocessor (DMM) products in the mid-s, followed by Fujitsu (AP), IBM (SP, SP), and Cray (TD, TE) in the early s. DSMs have dominated the TOP list of fastest supercomputers since the early s. The use of clusters evolved in parallel, and largely independently, in the commercial world, driven by several trends: Clusters were used in the early s for server consolidation, replacing multiple, distinct servers with one cluster, thus reducing operating costs. The explosive growth of the Internet in the late s led to a need for large clusters with a small footprint (to fit in downtown buildings) that can be quickly expanded, to serve Internet companies. Web requests can easily be served on a DMM, as each TCP/IP request is a small, independent action and request failures can be tolerated.
Distributed-Memory Multiprocessor
DMMs were also used to extend SMP architectures beyond the point where hardware could conveniently support coherent shared memory. Thus, IBM introduced in the parallel sysplex architecture that allowed coupling of multiple S/ mainframes into a larger cluster. The use of such clusters became more widespread as software vendors developed distributed computing firmware and adapted applications to run on clusters. Clusters have become increasingly used since the late s for providing fault tolerance. Rather than using very specialized hardware, such as in the early Tandem computers, High Availability (HA) Clusters provide fault tolerance through hardware replication that eliminates single points of failure, and suitable software. Many vendors (IBM, Microsoft, HP, Sun, etc.) offer HA cluster software solutions. Utility computing, namely the provisioning of computing resources (CPU time and storage) using a payper-use model, has been pursued by multiple companies over decades. Changes in technology (low-cost, highbandwidth networking, decline in hardware cost and relative increase in operation cost, increased use of offthe-shelf software and improved support for hardware virtualization) and in market needs (very rapid and unpredictable growth of the compute needs of Internet companies) have led to the recent success of this model, under the name of cloud computing. Cloud computing platforms are large clusters with software that supports the dynamic provisioning of virtualized compute and storage resources (hardware as a service – HaaS) and, possibly, firmware and application software (software as a service – SaaS)
Architecture At its simple most, a DMM can be assembled by connecting personal computers or workstations with a commercial network such as Ethernet, as shown in Fig. . The Beowulf system built in is an early example []. A more compact and more maintainable cluster can be assembled from rack-mounted servers, as shown in Fig. . In addition to the network that is used for communication between nodes, such a cluster will have a control network that is used for controlling and monitoring each node from a centralized console. The network of a DMM can be (a) a commodity Local Area Network (LAN), such as Ethernet;
D
D
Distributed-Memory Multiprocessor. Fig. Beowulf cluster
(b) a higher performance, third party switched network, such as Infiniband or Myrinet; or (c) a vendorspecific network, such as available on IBM and Cray systems. Communication over Ethernet will typically use the TCP/IP protocols; communication over nonEthernet fabrics can support lower latency protocols. Adapters for networks of type (a) and (b) will connect to a standard I/O bus, such as PCIe; vendor-specific networks can connect to the (nonstandard) memory bus, thus achieving higher bandwidth and lower latency. The basic communication protocol supported by any of these networks is message passing: Two nodes communicate when one node sends a message to the second node and the second node receives the message. Networks of type (b) and (c) often support remote direct memory access (rDMA): A node can directly access data in the memory of another node, using a get operation, or update the remote memory, using a put operation.
D
Distributed-Memory Multiprocessor
Network Attached Storage (NAS) technologies. Some commercial clusters, such as the IBM Parallel Sysplex, provide clustering hardware that accelerates data sharing and synchronization across nodes.
Software DMMs reuse, on each node, software designed for node-sized uniprocessors or SMPs. This includes the operating system, programming languages, compilers, libraries, and tools. An additional layer of parallel software is built atop the node software, in order to integrate the multiple nodes into one system. This includes, for compute clusters: ●
Distributed-Memory Multiprocessor. Fig. Rackmounted cluster
Support for such operations in user space requires an adapter that can translate virtual addresses to physical addresses. Oftentimes, such an adapter will also support read-modify-write operations on operands in remote memory, such as increment; this avoids the need for a round-trip for such operations and facilitates their atomic execution. Networks of type (c) may have hardware support for collective operations, such as barrier, broadcast, and reduce. The topology of the interconnection network impacts its performance: Networks with a low-dimension mesh topology (D or D) can be packaged so that only short wires are needed; this enables the use of cheaper copper cables while maintaining high bandwidth. Radixnetwork topologies, such as butterfly or fat-tree, reduce the average number of hops between nodes, but require longer wires []. Optical cables are needed for highspeed transmission over long distances. Commercial clusters (server farms and clouds) typically use Ethernet networking technology and communicate using TCP/IP protocols. The nodes often share storage, using Storage Area Networks (SAN) or
●
●
●
●
●
Software for system initialization, monitoring and control. This enables the control of a large number of nodes from one console; concurrent booting of a large number of nodes; network firmware initialization; error detection; etc. Scheduler and resource manager for allocating resources to parallel jobs, loading them and starting them. Parallel jobs often run on dedicated partitions; the scheduler has to allocate nodes in large chunks, while ensuring high node utilization and reducing waiting time for batch jobs. The resource manager has to reduce job-loading time (e.g., by using a parallel tree broadcast algorithm) and ensure that all nodes of a job start correctly. Parallel file system. Such a system integrates a large number of disks into one system and provides scalable file coherence protocols. Files are striped across many nodes so as to support parallel access to different parts of the same file and ensure load balancing. Software RAID schemes may be used to enhance the reliability of such a large file system. Checkpoint/restart facilities. On large DMMS, the mean time between failures is often shorter than the running time of large jobs. To ensure job completion, jobs periodically checkpoint their state; if a failure occurs, then job execution is restarted from the last checkpoint. A checkpoint facility supports coordinated checkpoint or restart by many nodes. Message passing libraries for supporting communication across nodes. MPI is the de facto standard for message passing on large DMMs. Tools for debugging and performance monitoring of parallel jobs. These are often built atop node
Distributed-Memory Multiprocessor
level debugging and performance monitoring; they enable users to control many nodes in debug mode, and to aggregate information from many nodes in debug or monitoring nodes. The use of conventional software on the nodes of DMMs is limiting; in particular, the use of a conventional operating systems on each node results in jitter: Background OS activities may slow down nodes at random times; applications where barriers are used frequently can slow down significantly as all threads wait at each barrier to the slowest thread. Some systems, e.g., the IBM Blue Gene system, are using on their compute node–specialized light-weight kernels (LWK) that provide only critical services and offload to remote server nodes heavier or more asynchronous system services. Commercial clusters use TCP/IP for inter-node communication, and distributed file systems for file sharing. They leverage firmware and application software designed for distributed systems, such as support for remote procedure calls (or remote method invocations), distributed transaction processing monitors, messaging frameworks, and “shared-nothing” databases. A critical component of server farms that handle Web requests is a load balancer that distributes TCP/IP requests coming on one port to distinct nodes. HA clusters have software that monitor the health of the system components and its applications (heartbeat) and report failures; software to achieve consensus after failures and to initiate recovery actions; and applicationspecific recovery software to restart each application in a consistent state. Disk state is preserved, either though mirroring, or through the sharing of highly reliable RAID storage via multiple connections. Clouds often use virtual machine software to virtualize processors and storage, thus facilitating their dynamic allocation. They need a resource management infrastructure that can allocate virtual machines on demand.
D
on DMMs in the late s was the conference on Hypercube Concurrent Computers and Applications). The abstract model used was of unit time communication on each edge, with communications either being limited to one incident edge per node at a time, or occurring concurrently on all edges incident to a node. For example, the problem of sorting on a hypercube is treated by several tens of publications. As indirect networks became more prevalent, significant amount of research focused on the topology of routing networks and cost-performance tradeoffs, and routing protocols for different topologies that are efficient and deadlock free, and that can handle failures. Modern DMMs largely hide the network topology from the user and may not provide mechanisms for controlling the mapping of processes and communications onto the network nodes and edges. Therefore, the topology is not part of the programming model. The abstract programming models used to represent such systems assume that communication time between two nodes is a function of the size of the message communicated. Two of the most frequently used models are the postal model [] and the logP model []. System research on DMMs has focused on communication protocols, operating systems, file systems, programming environments, and applications. Some of the larger efforts in this area are the Network of Workstation (NOW) effort at Berkeley [], and the SHRIMP project at Princeton [].
Related Entries Checkpointing Clusters Connection Machine Cray TE Distributed-Memory Multiprocessor Ethernet Hypercubes and Meshes IBM RS/ SP InfiniBand Meiko
Research
Moore’s Law
Early DMMs had directly connected networks; the network topology was part of the programming model. Theoretical research focused on the mapping of parallel algorithms to specific topologies; in particular, hypercubes and related networks (the primary conference
MPI (Message Passing Interface) Myrinet Network Interfaces Network of Workstations Networks, Direct
D
D
Ditonic Sorting
Networks, Multistage Parallel Computing PCI Express Routing (Including Deadlock Avoidance)
Bibliographic Notes and Further Reading The book of Fox et al. [] covers the early history of MPPs, their hardware and software, and the applications enabled on them. The book of Culler and Singh [] is a good introduction to MPP architectures and their networks, with a detailed description of the Cray TD and IBM SP networks. The book of Pfister [] is a general introduction to clusters, their design, and their use, while the two books of Buyya [, ] provide in-depth coverage of cluster hardware, software, and applications. The book of Dally and Towles [] provides a very good coverage of IN theory and practice. Finally, the book of Leighton [] is an excellent introduction to the theory results that are relevant to DMMs and INs.
Bibliography . Anderson TE, Culler DE, Patterson DA, the NOW team () A case for NOW (Networks of Workstations). IEEE Micro ():– . Bar-Noy A, Kipnis S () Designing broadcasting algorithms in the postal model for message-passing systems. In: Proceedings of the fourth annual ACM symposium on parallel algorithms and architectures SPAA ’. San Diego, California, United States, June –July , . ACM, New York, pp – . Blumrich MA, Li K, Alpert R, Dubnicki C, Felten EW, Sandberg J () Virtual memory mapped network interface for the SHRIMP multicomputer. In: Sohi GS (ed) years of the international symposia on computer architecture (selected papers) (ISCA ’). ACM Press, New York, pp – . Buyya R () High performance cluster computing: architectures and systems, vol . Prentice-Hall, New Jersey . Buyya R () High performance cluster computing: programming and applications, vol , Prentice-Hall, New Jersey . Culler DF, Singh JP () Parallel computer architecture: a hardware/software approach. Morgan Kaufman, San Francisco . Culler D, Karp R, Patterson D, Sahay A, Schauser EK, Santos E, Subramonian R, von Eicken T () LogP: towards a realistic model of parallel computation. In: Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of parallel programming, San Diego, California, United States, – May , pp – . Dally WJ, Towles BP () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco . Fox GC, Williams RD, Messina PC () Parallel computing works, Morgan Kaufman, San Francisco
. Leighton FT () Introduction to parallel algorithms and architectures: arrays, trees, hypercubes. Morgan Kaufman, San Francisco . Pfister GF () In search of clusters, nd edn, Prentice Hall, New Jersey . Sterling T, Becker DJ, Savarese D, Dorband JE, Ranawake UA, Packer CV () BEOWULF: a parallel workstation for scientific computation. In: Proceedings of the th international conference on parallel processing. Oconomowoc, Wisconsin, pp –
Ditonic Sorting Bitonic Sort
DLPAR Dynamic Logical Partitioning for POWER Systems
Doall Loops Loops, Parallel
Domain Decomposition Thomas George , Vivek Sarin IBM Research, Delhi, India Texas A&M University, College Station, TX, USA
Synonyms Functional decomposition; Grid partitioning
Definition Domain decomposition, in the context of parallel computing, refers to partitioning of computational work among multiple processors by distributing the computational domain of a problem, in other words, data associated with the problem. In the scientific computing literature, domain decomposition mainly refers to techniques for solving partial differential equations (PDE) by iteratively solving subproblems corresponding to smaller subdomains. Although the evolution of these
Domain Decomposition
D
techniques is motivated by PDE-based computational simulations, the general methodology is applicable in a number of scientific domains not dominated by PDEs.
Introduction One of the key steps in parallel computing is partitioning the problem to be solved into multiple smaller components that can be addressed simultaneously by separate processors with minimal communication. There are two main approaches for dividing the computational work. The first one is domain decomposition, which involves partitioning the data associated with the problem into smaller chunks and each parallel processor works on a portion of the data. The second approach, functional decomposition, on the other hand, focuses on the computational steps rather than the data and involves processors executing different independent portions of the same program concurrently. Most parallel program designs involve a combination of these two complementary approaches. In this entry, we focus on domain decomposition techniques, which are based on data parallelism. Figure shows a -D finite element mesh model of the human heart [] used for simulation of the human cardiovascular system and in particular, the flow of blood through the various chambers and valves (http://www.andrew.cmu.edu/user/jessicaz/ medical_data/Heart_Valve_new.htm). Computation is performed repeatedly for each point on the mesh using Darcy’s law [], which describes the flow of a fluid through a porous medium and is widely used in blood flow simulations. Figure also illustrates decomposition of this mesh model into subdomains (indicated by different colors) that are relatively easier to model due to their simpler geometry. Domain decomposition is vital for high-performance scientific computing. In addition to providing enhanced parallelism, it also offers a practical framework for solving computational problems over complex, heterogeneous and irregular domains by decomposing them into homogeneous subdomains with regular geometries. The subdomains are each modeled separately using faster, simpler computational algorithms, and the solutions are then carefully merged to ensure consistency at the boundary cases. Such decomposition is also useful when the modeling equations are different across subdomains as in the case of singularities and
D
Domain Decomposition. Fig. Tetrahedral meshing of the heart. The figure on the right shows the mesh partitioned into subdomains indicated with different colors
anomalous regions. Lastly, it is also useful for developing out-of-core (or external memory) solution techniques for large computational problems with billions of unknowns by facilitating the partitioning of the problem domain into smaller chunks, each of which can fit into main memory.
Domain Decomposition for PDEs Scientific computing applications such as weather modeling, heat transfer, and fluid mechanics involve modeling complex physical systems via partial differential equations. Though there exist many different classes of PDEs, the most widely used ones are second-order elliptic equations, for example, the Poisson’s equation, which manifests in many different forms in a number of applications. For example, the Fourier law for thermal conduction, Darcy’s law for flows in porous media and Ohm’s law for electrical networks, all arise as special cases of the Poisson’s equation. Example . For the sake of illustration, we consider a problem that requires solving the Poisson equation over a complex domain Ω with Γ denoting the boundary ∇ u = f in Ω, u = uΓ on Γ.
D
Domain Decomposition
In practice, the domain is discretized using either finite difference or finite element methods, which yields a large sparse symmetric positive definite linear system, Au = f where A is determined by the discretization and u is a vector of unknowns. Domain decomposition for this problem involves partitioning the domain Ω into subdomains {Ω i }si= , which may be overlapping or nonoverlapping depending on the nature of the problem. The partitioning into subdomains is sometimes referred to as the coarse grid with the original discretization being described as the fine grid. Given a discretized computational problem as in the example, there are four main steps that need to be addressed in order to obtain an efficient parallel solution. ●
The first step is to identify the preferred characteristics of subdomains for the given problem. This includes choosing the number of subdomains, the type of partitioning (element/edge/vertex-based), and the level of overlap between the subdomains (with disjoint subdomains being the extreme case). Each of these choices depend on a variety of factors such as size, type, and geometry of the problem domain, the number of parallel processors, and the PDE/linear system being solved. ● The next step is to partition the problem domain into the desired subdomains. This can be achieved by using a variety of geometric and graph partitioning algorithms. For certain applications involving adaptive mesh refinement, one may chose to iteratively refine the partitioning based on the load imbalance. ● Given the subdomains, the next step involves solving the PDE for the entire problem domain. This consists of : – Solving the PDE over each subdomain, that is, ∇ u = f in Ω i , for ≤ i ≤ s, or, equivalently the linear systems Ai ui = fi , for ≤ i ≤ s, where Ai , ui , and fi are restrictions of A, u, and f to Ω i , respectively – Ensuring consistency at the interfaces (or the overlap in case of overlapping regions) of the subdomains, which is referred to as the coarse solve ● Depending on the overlap between the subdomains, there are two classes of solution approaches.
Specifically, Schur’s complement techniques are suitable for subdomains with no overlap and Schwarz alternating techniques are suited for overlapping subdomains. Since these solution techniques iteratively solve the constituent subproblems, these subproblems need not be solved accurately in each step and different choices of approximation over the subproblems lead to variants of the overall algorithms. ● Given a parallel architecture, the final step is to map the subdomains and the interfaces to the individual processors and translate the domaindecomposition-based solution approach to a parallel program that can be implemented efficiently on this architecture. This mapping depends on the relative computational and communication requirements for solving the interfaces. Each of the above steps will be discussed in the following sections.
Nature of Subdomains The effectiveness of any domain decomposition approach critically depends on the relative suitability of the subdomains both with respect to the computational problem (PDE/linear system) being solved and the parallel architecture being employed. Below, we discuss certain key characteristics of subdomains that influence the overall performance as well as some general guidelines proposed in literature for choosing them for a given problem. Number of Subdomains
The choice of the number of subdomains has a significant effect on both the computational time and the quality of the final solution. In general, a large number of subdomains, or alternately a small size for the coarse grid, results in a better, but more expensive solution since a large number of subdomain problems need to be solved and there is much more communication. On the other hand, choosing a small number of subdomains has the opposite effect. Given a fine grid, that is, a discretized problem, it has been observed empirically [] that there exists an optimal choice for the number of subdomains that minimizes the overall computational time. However, theoretical results [] exist only for fairly simple scenarios
Domain Decomposition
where the same solver is used for all the subdomain problems as well as the original problem, and the convergence rate is independent of the coarse grid size. For example, for a uni-processor setting on a structured nd grid with fine grid scale h and a solver with complexity O(nα ), the optimal choice for the coarse grid size is given by α−d α α Hopt = ( ) h α−d . α−d Practical scientific computing problems, however, often involve fairly complex domains as in Fig. and this requires taking into account a number of other considerations. Firstly, it is preferable to partition the domain into subdomains with regular geometry on which faster solvers can be deployed. Secondly, in a parallel setting, an ideal choice is to assign each subdomain to an individual processor to minimize communication costs. Such a choice might, however, lead to load imbalance in case the number of subdomains is less than that of the available processors or there is vast difference in the relative sizes of the subdomains. For a parallel scenario, there exist limited theoretical results. For example, for a structured nd grid where all the subdomain solves are performed in parallel followed by an integration phase performed by one of the solvers, it can be shown that the optimal number of processors is nd/ , but this analysis only considers load distribution and ignores the communication costs. For a real-world application, one needs to empirically determine an appropriate partitioning size that optimizes both the load balance as well as the communication costs. Type of Partitioning
There are three main types of partitioning depending on the constraints imposed on subdomain interfaces. The first type is element-based partitioning, common in case of finite-element discretizations, where each element is solely assigned to a single subdomain. The second type is edge-based partitioning, more suitable for finite-volume methods, where each edge is solely assigned to a single subdomain so as to simplify the computation of fluxes across edges. The last type is vertex-based partitioning, which is the least restrictive and only requires each vertex to be assigned to a unique subdomain. The choice of partitioning type determines how the subdomain interface values are computed and can have a significant effect on the complexity and
D
convergence of the algorithm, for example, Schur complement method can be implemented much more efficiently for edge-based partitionings than the other two types []. For most problems, the appropriate type of partitioning is determined by the nature of the computational problem (PDE) and the discretization itself, but where there is flexibility, it is important to make a judicious choice in order to optimize the computation time for the coarse solve. Level of Overlap
Overlap between the subdomains is another important characteristic that determines the solution approach. In particular, Schur complement–based approaches are applicable only in case of nonoverlapping subdomains whereas Schwarz alternating procedures (SAP) are applicable in case of overlapping subdomains. The appropriate level of overlap is often dependent on the geometry of the problem domain and the PDE to be solved. A decomposition into subdomains with regular geometry (e.g., circles, rectangles) is often preferable irrespective of the level of overlap. In general, SAPbased techniques are easier to implement in addition to providing close to optimal convergence rate and being more robust. However, extra computational effort is required for the regions of overlap and they are not suitable for handling discontinuities.
Partitioning Algorithms For scientific computing tasks, the problem domain is invariably discretized making it natural to use a graph (or in certain cases hypergraph) representation. Decomposing the problem domain into subdomains, therefore, becomes a graph partitioning problem. For parallel computing with homogeneous processors, it is desirable to have a partitioning such that the resulting subdomains require comparable amounts of computational effort, that is, balanced load, and there are minimal communication costs for handling the subdomain interfaces. In other words, the subdomains need to be of similar size with few edges between them. In case of structured and quasi-uniformly refined grids, such a partitioning can be obtained with relative ease. However, for unstructured grids resulting from most realworld applications, it is an NP-complete problem []. A number of algorithms have been proposed in literature to address this problem and [, , ] provides
D
D
Domain Decomposition
a detailed survey of these algorithms. Some of the key approaches are discussed below. Geometric Approaches
Geometric partitioning approaches rely mainly on the physical mesh coordinates to partition the problem domain into subdomains corresponding to regions of space. Such algorithms are preferable for applications such as crash and particle simulations, where geometric proximity between grid points is much more critical than the graph connectivity. Recursive bisection methods form an important class of geometric approaches that involve computing a separator that partitions a region into two subregions in a desired ratio (usually equal). This bisection process is performed recursively on the subregions until the desired number of partitions are obtained. The simplest of these techniques is the Recursive Coordinate Bisection (RCB) [], which uses separators orthogonal to coordinate axes. A more effective technique is Recursive Inertial Bisection (RIB) [] that employs separators orthogonal to principal inertial axes of the geometry. Over the last decades, a number of extensions [] as well as hybrid techniques [] that combine RIB with local search methods such as the Kernighan-Lin’s method, in order to simultaneously minimize the edge cut between the subdomains have been proposed. Space-filling curve (SFC)-based partitioning approaches [] are another class of geometric techniques. In these methods, a space-filling curve maps an n-dimensional space to a single dimension, which provides a linear ordering of all the mesh points. The mesh is then partitioned by dividing the ordered points into contiguous sets of desired sizes, which correspond to the subdomains. The explicit use of geometric information and the incremental nature of the above techniques makes them highly beneficial for applications where geometric locality is critical and there are frequent changes in the geometry requiring dynamic load balancing. However, these techniques have limited applicability for computational problems where one needs to consider graph connectivity and communication costs.
These techniques are widely used in a number of scientific computing applications since they can provide better control of communication costs and do not require an embedding of the problem in a geometric space. There currently exist a large number of such algorithms [], but most of them have been derived from combinations of three basic techniques: (a) Recursive Graph Bisection, (b) Recursive Spectral Bisection, and (c) Greedy approach. Of these, the recursive graph bisection method [] begins by finding a pseudoperipheral vertex, (i.e., one of a pair of vertices on the longest graph geodesic), computing the graph distance to all the other vertices from this vertex. This distance is then used to sort the vertices, which are then split into two groups in the desired ratio (usually equal) such that one group is closer to the peripheral vertex and the other one is away, and this process is repeated recursively till the desired number of partitions are obtained. The second technique, that is, recursive spectral bisection method [] makes use of the Fielder vector, which is defined as the eigenvector corresponding to the second lowest eigenvalue of the Laplacian matrix of the graph. The size of this vector is the same as the number of vertices in the graph and the difference between the Fielder coordinates is closely related to the graph distance between the corresponding vertices. The Fielder vector, thus, provides a linear ordering of the vertices, which can be used to divide the vertices into two groups and recursively deployed as in the case of SFC-based partitioning algorithms. The greedy approach typically uses a breadth first search and uses graph growing heuristics to create partitions []. There are also a number of fast multilevel techniques that are especially suitable for large problem domains. These techniques typically consist of three phases: coarsening, partitioning, and refining. In the first phase, a sequence of increasingly coarse graphs is constructed by collapsing vertices that are tightly connected to each other. The coarsened graph is then partitioned into the desired number of subdomains in the second phase. The graph is then progressively uncoarsened and refined using local search methods such as the Kernighan-Lin method [], which improves the edge cut via a series of vertex-pair exchanges, in the third phase.
Coordinate-Free Approaches
Coordinate-free partitioning approaches include spectral and graph-theory-based techniques that rely only on the connectivity structure of the problem domain.
Dynamic Approaches
When the computational structure is relatively invariant through out a simulation, it is sufficient to consider a
Domain Decomposition
static partitioning of the problem domain. However, for applications where the structure changes dynamically (e.g., in crash simulations with geometric deformations or in case of adaptive mesh refinement methods), it is critical to dynamically repartition the problem domain to ensure load balance. Such a repartitioning needs to account for not only the load distribution and communication costs, but also the costs of data redistribution across the partitionings []. Furthermore, in case of large problems, since it is expensive to load the entire mesh onto a single processor, these partitionings themselves need to be computed in parallel. Currently, there exist a number of dynamic partitioning algorithms such as [, , ] which can viewed as extensions of existing static partitioning algorithms that incorporate an additional objective term based on the data redistribution costs. These techniques often involve a clever remapping of subdomain labels to reduce the redistribution costs. In case of multilevel methods such as LocallyMatched Scratch-Remap (LMSR) [], the coarsening phase is similar to that in the static case, the partitioning (i.e., repartitioning) phase leads to tentative new subdomains, and the final uncoarsening/refinement phase is adapted so as to additionally optimize the data redistribution costs.
D
expressed as ⎡ ⎤ ⎡ int ⎢ B ⎢ D ⎥ ⎢ ⎥ ⎢ u ⎢ ⎥⎢ ⎢ ⎥ ⎢ int ⎢ ⎢ B D ⎥ ⎢ ⎥ ⎢ u ⎢ ⎥⎢ ⎢ ⎥⎢ ⎢ ⎢ ⋱ ⋮ ⎥ ⎢ ⎥⎢ ⋮ ⎢ ⎥⎢ ⎢ ⎥ ⎢ int ⎢ ⎢ Bs D s ⎥ ⎢ ⎥ ⎢ us ⎢ ⎥⎢ ⎢ ⎥⎢ ⎢ E E ⋯ Es C ⎥ ⎢ ubnd ⎣ ⎦⎣
⎤ ⎡ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎦ ⎣
⎤ fint ⎥ ⎥ ⎥ ⎥ fint ⎥ ⎥ ⎥ ⎥ , ⋮ ⎥ ⎥ ⎥ ⎥ fsint ⎥ ⎥ ⎥ bnd ⎥ ⎥ f ⎦
()
D
where each fiint denotes the vector of unknowns interior to subdomain Ω i and ubnd denotes the vector of interface unknowns. Figure shows the nonzero structure for such a linear system. This can be equivalently expressed in a block form as ⎛B D ⎜ ⎜ ⎝E C
⎞ ⎛ uint ⎟⎜ ⎟⎜ ⎠ ⎝ ubnd
⎞ ⎛ f int ⎟=⎜ ⎟ ⎜ ⎠ ⎝ f bnd
⎞ ⎟, ⎟ ⎠
()
where B, D, E, f int , and uint are aggregations over the corresponding subdomain specific variables. From Eq. , by substituting for uint , we obtain a reduced system (C − EB− D)ubnd = f bnd − EB− f int .
PDE Solution Techniques There are two main classes of approaches for solving PDEs over a partitioned problem domain [] depending on the overlap, namely, (a) Schur complement– based approaches and (b) Schwarz alternating procedures, which we discuss below.
Schur Complement–Based Approaches
This class of approaches is based on the notion of a special matrix called the Schur complement, which enables one to solve for the subdomain interface values independent of the interior unknowns. The computation of the unknowns in the interior of each of the subdomains is then performed in parallel independently. Consider a computational problem corresponding to a second-order elliptic PDE as in Example . As mentioned earlier, a discretization leads to the linear system Au = f , u ∈ Ω. In case of nonoverlapping subdomains, that is, Ω i ⋂ Ω j = /, and edge-based partitioning on a finite difference mesh, this linear system can be
Domain Decomposition. Fig. Nonzero structure of the resulting linear system
D
Domain Decomposition
Γ3,0
Algorithm Schur complement method . Form the reduced system based on Schur complement matrix . Solve the reduced system for the interface unknowns . Solve the decoupled systems for interior unknowns by back-substituting the interface unknowns
The matrix S = (C − EB− D) is called the Schur complement matrix. This matrix can be computed directly from the original coefficient matrix A given the partitioning and can be used to compute the interface values ubnd . Back-substitution of ubnd is then used to solve smaller int decoupled linear systems Bi uint − Di ubnd . i = fi In the case of vertex-based and element-based partitionings, the coefficient matrix A exhibits a more complex block structure unlike in Eq. . However, even for these cases, one can construct a reduced system based on a Schur complement matrix assembled from local (subdomain specific) Schur complements and the interface coupling matrices, which can be used to solve for the interface unknowns. Algorithm describes the key steps for this class of techniques irrespective of the partitioning type. Depending on the choice of the methods (e.g., direct vs. iterative methods, exact vs. approximate solve) used for forming the Schur complement and solving the reduced systems, there are multiple possible variants. Schwarz Alternating Procedures
Schwarz alternating procedures refer to techniques that involve alternating between overlapping subdomains and in each case, performing the corresponding subdomain solve assuming boundary conditions based on the most recent solution. Consider the discretized elliptic PDE problem in Example for the case where each subdomain Ω i overlaps with its neighboring subdomains (as in Fig. ). Let Γij denote the boundary of Ω i that is included in Ω j . The original solution method proposed by Schwarz [] in was a sequential method where each step consists of solving the restriction of Au = f over the subdomain Ω i assuming the boundary conditions over Γij , ∀j. Since the approach is iterative and sweeps through each subdomain multiple times, it suffices to perform
Ω3 Γ1,3
Γ3,1 Γ2,1
Γ1,0
Γ1,2
Ω2
Γ2,0
Ω1
Domain Decomposition. Fig. An L-shaped domain showing overlapping subdomains
an approximate subdomain solve or in the extreme case, one iteration of the subdomain solve. Thus, each step can be viewed as an additive update of u. This update step can be concretely specified using the notion of a restriction operator. Let Ri ∈ {, }∣Ω i ∣×∣Ω∣ be a binary matrix that denotes the restriction of Ω to Ω i , that is, the jth component of Ri is iff the jth unknown belongs to Ω i . Given {Ri }si= , one can define the restriction of A to Ω i as the matrix Ai = Ri ARTi and the update is then given by u = u + RTi A− i Ri ( f − Au). Since the updates over each subdomain are applied sequentially, it is equivalent to a multiplicative composition and this procedure is referred to as Multiplicative Schwarz Alternation. To make this approach amenable to parallelization, an alternate procedure Additive Schwarz Alternation [] was proposed, which involves using the same residual ( f − Au) to compute the incremental updates over each subdomain in a single sweep and the incremental updates are then aggregated to obtain the new solution. Algorithm shows the main steps for the basic additive Schwarz alternation and each of the δ i computations in a single iteration are independent of each other and can be performed in parallel. Algorithm corresponds to the most basic version of additive Schwarz. Real applications, however
Domain Decomposition
Algorithm Additive Schwarz method Choose an initial guess for u repeat for i = ⋯s do Compute δ i = RTi A− i R i ( f − Au) unew = u + ∑si= δ i until convergence
employ variants of this approach with similar independent additive updates, but are based on more complex preconditioners [].
Mapping to Parallel Processors A natural way to map a domain decomposition algorithm to a parallel architecture is to assign each subdomain to a separate processor. In this case, the subdomain solves can be performed independently in parallel, but the coarse solve (handling of interface values) is nontrivial since the relevant interface data is scattered among all the processors. For example, in Schur complement approaches, solving the reduced system based on the Schur complement, in the general case, requires access to data from all the subdomains. There are three common ways to address the above problem. The first is to keep the data in place and perform the solve in parallel with the necessary communication. Such an approach would require an intelligent grouping of the interface unknowns with minimal dependencies across groups, which can then be assigned to different subdomains so as to minimize the communication cost. The second option is to collect all the data on a single processor, perform the coarse solve, and broadcast the output while the third approach is to replicate the relevant data on all the processors and perform the coarse solve in parallel. There have been empirical studies [] that evaluated the relative benefits of these approaches, but the appropriate choice typically depends heavily on the characteristics of the specific problem and the parallel architecture.
Advanced Methods More recent work on domain decomposition has focused on efficient techniques for specific topics. These include advanced methods for graph partitioning and for solving PDEs.
D
In graph partitioning, researchers have developed hyper-graph algorithms [, ] that can be used for graphs with edges that involve more than two nodes. Such graphs arise naturally in many applications such as higher-order finite elements, and can also be constructed from regular graphs. Empirical evidence points to improvement in the efficiency of partitioning algorithms as well as in the quality of partitions. It has also been recognized for a while that large graph problems that require domain decomposition techniques often arise in a parallel setting, and as a result, work has focused on developing parallel partitioning methods. These methods are, in general, straightforward parallelization schemes, with additional care taken to ensure quality of the partitions. Another important area of work involves adaptive partitioning and refining [, ] for applications where the graph changes frequently during the simulation. A key component of these algorithms is the ability to repartition quickly and efficiently starting from an initial partition. The new partitioning must ensure good load balance without increasing communication requirement or introducing large overhead. The domain decomposition step has implications on the numerical procedure used to solve the PDE. In particular, the rate of convergence of iterative solvers can be improved significantly by considering domain decomposition as a “preconditioning” step that reduces the number of iterations. In this regard, there has been some work on improving the Schur-complement approach by using Schwarz-type methods for the reduced linear system. This can be useful when the Schur complement is complex and large enough to require a parallel solver. Another approach that is motivated by solving PDEs in parallel is the Finite Element Tearing and Integration (FETI) [, ] method in which splitting of the domain into subdomains is accompanied by modification of the linear operator to achieve faster convergence of the iterative solver. After splitting the domain, one needs to enforce continuity across subdomain boundaries using Lagrange multipliers. The solution process determines these Lagrange multipliers via an iterative method that requires repeated subdomain solves that can be done in parallel.
Application Areas Domain decomposition has been used in a wide variety of disciplines. These include aerospace, nuclear,
D
D
Domain Decomposition
mechanics, electrical, materials, petroleum, and others. The benefits of the method are most apparent when used for applications that require large systems to be solved. In such cases, the solver phase can consume up to % or more of the total running time of the application. The domain decomposition concept can be applied in a fairly straightforward manner once the subdomains are identified. However, it may be important to finetune the parameters for each application in order to get the best performance.
Future Directions The advent of multicore processors and the availability of large multiprocessor machines pose significant challenges to researchers and developers in the area of domain decomposition. Challenges that arise from large-scale compute capability involve the ability to use a large number of cores concurrently and efficiently. It will be much harder to achieve good partitioning and load balance on such platforms. The ability to perform asynchronous computation and to adapt to changing computational structure is equally important. In addition, it is imperative to build fault tolerance into the algorithmic structure to protect against inevitable architectural instabilities. One must also consider relaxing the numerical requirements on the computations in favor of obtaining an approximate but acceptable solution quickly. Another area of interest is the automatic tuning of parameters required by the algorithms to ensure good performance. It is also important to understand that domain decomposition is often used as a low-level computational tool in applications where the main goal is to optimize within a design space or to conduct sensitivity studies. A tighter integration of the domain decomposition step with other computational modules is necessary to realize the full benefit of this technique for an application.
Related Entries Graph Partitioning
Bibliography . Berger MJ, Bokhari SH () A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans Comput ():– . Çatalyürek UV, Boman EG, Devine KD, Bozdag D, Heaphy RT, Riesen LA () A repartitioning hypergraph model for dynamic load balancing. J Parallel Distrib Comput ():–
. Chan TF, Shao JP () Parallel complexity of domain decomposition methods and optimal coarse grid size. Parallel Comput ():– . Devine KD, Boman EG, Heaphy RT, Hendrickson BA, Teresco JD, Faik J, Flaherty JE, Gervasio LG () New challenges in dynamic load balancing. Appl Numer Math (–):– . Devine KD, Boman EG, Heaphy RT, Bisseling RH, Çatalyürek UV () Parallel hypergraph partitioning for scientific computing. In: Proceedings of the th parallel and distributed processing symposium , Rhodes Island, Greece, pp . Dryja M, Widlund OB () Domain decomposition algorithms with small overlap. SIAM J Sci Comput ():– . Farhat C, Lesoinne M () Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics. Int J Numer Methods Eng ():– . Farhat C () A simple and efficient automatic fem domain decomposer. Comput Struct ():– . Farhat C, Pierson K, Lesoinne M () The second generation feti methods and their application to the parallel solution of large-scale linear and geometrically non-linear structural analysis problems. Comput Methods Appl Mech Eng (–): – . Farhat C, Roux F-X () A method of finite element tearing and interconnecting and its parallel solution algorithm. Int J Numer Meth Eng :– . Fjällström P-O () Algorithms for graph partitioning: a survey, vol , part . Linköping University Electronic Press, Sweden . Gropp WD, Keyes DE () Domain decomposition on parallel computers. IMPACT Comput Sci Eng ():– . Gropp WD, Smith BF () Experiences with domain decomposition in three dimensions: overlapping schwarz methods. In: Proceedings of the th international symposium on domain decomposition, AMS, Providence, , pp – . Hendrickson B, Kolda TG () Graph partitioning models for parallel computing. Parallel Comput ():– . Karypis G, Kumar V () Parallel multilevel k-way partitioning scheme for irregular graphs. In: Proceedings of the th ACM/IEEE conference on supercomputing (CDROM) (Super computing ’), Article , Pittsburgh, PA, USA . Karypis G, Kumar V () Multilevel -way hypergraph partitioning. In: Proceedings of the th design automation conference, New Orleans, LA, USA, pp – . Kernighan BW, Lin S () An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J ():– . Leland R, Hendrickson B () An empirical study of static load balancing. In: Proceedings of scalable high performance computing conference, pp – . Neuman SP () Theoretical derivation of Darcy’s law. Acta Mech ():– . Oliker L, Biswas R () Plum: parallel load balancing for adaptive unstructured meshes. J Parallel Distrib Comput ():– . Pothen A, Simon HD, Liou K () Partitioning sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl (): – . Saad Y () Iterative methods for sparse linear systems, nd edn. SIAM, Philadelphia, PA
Dynamic Logical Partitioning for POWER Systems
. Schamberger S, Wierum J () Partitioning finite element meshes using space-filling curves. Future Gen Comp Syst ():– . Schloegel K, Karypis G, Kumar V () Wavefront diffusion and LMSR: Algorithms for dynamic repartitioning of adaptive meshes. IEEE Transactions on Parallel and Distributed Systems ():– . Schwarz HA () Uber einen Grenzübergang durch alternierendes Verfahren. Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich, :– . Simon HD () Partitioning of unstructured problems for parallel processing. Comput Syst Eng (–):– . Sohn A, Simon H () Jove: a dynamic load balancing framework for adaptive computations on an sp- distributed multiprocessor. Technical report, CIS, New Jersey Institute of Technology, NJ . Sun K, Zhou Q, Mohanram K, Sorensen DC () Parallel domain decomposition for simulation of large-scale power grids. In: ICCAD, pp – . Zhang Y, Bajaj C () Finite element meshing for cardiac analysis. Technical Report –, ICES, University of Texas at Austin, TX
DPJ Deterministic Parallel Java
DR Dynamic Logical Partitioning for POWER Systems
Dynamic Logical Partitioning for POWER Systems Joefon Jann T. J. Watson Research Center, IBM Corp., Yorktown Heights, NY, USA
Synonyms DLPAR; DR; Dynamic LPAR; Dynamic reconfiguration
Definition Dynamic Logical Partitioning (DLPAR) is the nondisruptive reconfiguration of the real or virtual resources made available to the operating system (OS) running in a logical partition (LPAR). Nondisruptive means that no OS reboot is needed.
D
Discussion Introduction A Logical PARtition (LPAR) is a subset of hardware resources of a computer upon which an operating system instance can run. One can define one or more LPARs in a physical server, and an OS instance running in an LPAR has no access and visibility to the other OS instances running in other LPARs of the same server. In October , IBM introduced the DLPAR technology on POWER-based servers with the general availability of AIX .. This DLPAR technology enables resources such as memory, processors, disks, network adaptors, and CD/DVD/tape drives to be moved nondisruptively (i.e., without requiring a reboot of the source and target operating systems) between LPARs in the same physical server. In simple terms, one can perform the following basic nondisruptive dynamic operations with DLPAR: ● ● ●
Remove a unit of resource from an LPAR. Add a unit of resource to an LPAR. Move a unit of resource from one LPAR to another in the same physical server.
For POWER systems, the granularity/unit of resource for a DLPAR operation was CPU for processors, Logical Memory Block (LMB) of size MB for memory, and IO-slot for I/O (disks, network, and media) adapters. Those units of processor, memory, and I/Oslots that are not assigned to any LPAR reside in a “free pool” of the physical server, and can be allocated to a new or existing LPARs via DLPAR operations later on. Similarly, when a unit of resource is DLPAR-removed from an LPAR, it goes into the free pool. The fundamentals of DLPAR are described in []. AIX has supported DLPAR removal of memory since /; however, DLPAR removal of memory was not available for Linux on POWER until , with the availability of SUSE SLES . On POWER and follow-on POWER systems, the minimum size of an LMB (unit of DLPAR removal or add) can be as small as MB. This minimum size is a function of the total amount of physical memory in the physical server. Since the availability of POWER, all POWER servers are partitioned into one or more LPARs. These LPARs are defined via the POWER HYPervisor firmware (PHYP). An LPAR can be a dedicatedprocessor LPAR or a Shared-Processor LPAR (SPLPAR). A dedicated-processor LPAR is an LPAR which contains
D
D
Dynamic Logical Partitioning for POWER Systems
an integral number of CPUs. SPLPAR was introduced with AIX . on POWER systems in /. One can define a pool of CPUs to be shared by a set of LPARs, which will be called shared-processor LPARs (SPLPARs). The PHYP controls the time-slicing and allocation of CPU resources across the SPLPARs sharing the pool. An SPLPAR has the following attributes: (a) Minimum, assigned, and maximum number of Virtual Processors (VPs). The number of VPs is the degree of processorconcurrency made available to the OS instance. (b) Minimum, assigned, and maximum number of processor-units of Capacity. Capacity is the amount of cumulative real-CPUequivalent resources allocated to the OS instance in a reasonable interval of time (e.g., ms). (c) Capped or uncapped mode. A capped SPLPAR cannot use excess CPU resources in the pool beyond its maximum capacity. (d) If uncapped, an SPLPAR has an uncapped-weight value (currently, an integer between and ), which is used for prioritizing the SPLPARs in a pool. The DLPAR technology was then enhanced to be able to dynamically move units of a SPLPAR’s CPU-resources nondisruptively from one SPLPAR to another in the same SPLPAR pool. More precisely, DLPAR technology can nondisruptively and dynamically ●
Remove or add an integral number of Virtual Processors from or to an SPLPAR, or move such from one SPLPAR to another ● Remove or add Capacity (in processor-units) from or to an SPLPAR, or move capacity from one SPLPAR to another ● Change the processor Mode of an SPLPAR to Capped or Uncapped ● Change the Weight of an Uncapped SPLPAR (between a value of and )
DLPAR Technology DLPAR Removal of a CPU DLPAR enables the dynamic removal of a CPU from a running OS instance in one LPAR, without requiring a reboot of the OS instance.
At a very high level, the following sequence of operations initiated from the OS is used to accomplish a DLPAR removal of a CPU: . Notify all applications and kernel extensions that have registered to be notified of CPU DLPAR events, so that they may remove dependencies on the CPU to be removed. This typically involves unbinding their threads that are bound to the CPU being removed. . Migrate threads that are bound to the CPU being removed to another CPU in the same LPAR. . Reroute existing and future hardware-interrupts that are directed to the CPU being removed. This involves changing the interrupt controller data structures. . Migrate the timers and threads from the CPU being removed to another CPU in the same LPAR. . Notify the hypervisor/firmware to complete the removal task.
DLPAR Addition of a CPU Dynamic addition of a CPU involves the following tasks initiated from the OS: . Create a process (waitproc) to run an idle loop on the incoming CPU, before it starts to do real work. . Set up various hardware registers (e.g., GPR for the kernel stack, GPR for the kernel Table Of Contents) of the incoming CPU. . Allocate and/or initialize the processor-specific kernel data structures (e.g., runqueue, per-processor data area, interrupt stack) for the incoming CPU. . Add support for the incoming CPU to the interrupt subsystem. . Notify the DLPAR-registered applications and kernel extensions that a new CPU has been added. Dynamic addition and removal of processors to and from an LPAR allow it to adapt to varying workloads. Figure shows that the throughput (in requests-persecond) of the WebSphere Trade benchmark scales almost linearly with the number of CPUs in a dedicatedprocessor-LPAR, as CPUs were removed and added to the LPAR with DLPAR operations [].
Dynamic Logical Partitioning for POWER Systems
900 8
7
6
5
4
<---Number of CPUs--> 3 2 1 2 3
4
5
6
D
8
7
800 700
D
Request per second
600 500 400 300 200 100 0
12
24
36
48
60 72 84 96 108 120 132 Elapsed time in hundreds of seconds
144
156
168
180
Dynamic Logical Partitioning for POWER Systems. Fig. WebSphere Trade benchmark: throughput (in requests-per-second) as a function of the number of CPUs (the upper x-axis labels), and as a function of elapsed time (the lower x-axis labels)
DLPAR Addition of a Memory Block (LMB) When a DLPAR memory-add request arrives at the OS, the latter has to perform two tasks: . Allocate and initialize software page descriptors that will hold metadata for the incoming memory. . Distribute the incoming memory among the framesets of a mempool (memopools are described two paragraphs below). The challenges encountered in implementing these two tasks in AIX and how they were resolved are described as follows. Prior to implementing DLPAR, software page descriptors could be accessed in translation-off mode (where virtual address is used as real address) while trying to reload a page mapping into the hardware page table. If page descriptors are allowed to be accessed in translation-off mode, then the memory allocated for the incoming new descriptors has to be physically contiguous to the memory for the existing descriptors; which implies that memory has to be reserved at boottime for enough descriptors for the maximal amount
of memory that the OS instance can grow into. This reservation can incur much memory wastage and performance degradation, particularly if not utilized. We avoided this wastage by changing the AIX kernel so that software page descriptor data structures are always accessed in translation-on mode. The challenge of distributing the incoming memory across different page replacement daemons, so that each daemon handles a roughly equal load was resolved as follows: In AIX, memory is hierarchically represented by the data structures vmpool, mempool, and frameset. A vmpool represents an affinity domain of memory. A vmpool is divided into multiple mempools, and each mempool is managed by a single page replacement LRU (least recently used) daemon. Each mempool is further subdivided into one or more framesets that contain the free-frame lists, so as to improve the scalability of free-frame allocators. When new memory is added to AIX, the vmpool that it should belong to is defined by the physical location of the memory chip. Within that vmpool, we wanted to distribute the
D
Dynamic Logical Partitioning for POWER Systems
memory across all the available mempools to balance the load on page replacement daemons. However, the kernel assumed that a mempool consisted of physically contiguous memory. Thus, to be able to break up the new memory (LMB) into several parts and distribute them across different mempools, the kernel was modified to allow mempools to be made up of discontiguous sections of memory.
DLPAR Removal of a Memory Block (LMB) This is by far the hardest-to-implement of all the DLPAR operations, particularly when there are ongoing DMA operations in the OS instance. For the purpose of DLPAR memory removal, we classify the memory pageframes in AIX into five categories: unused, page-able, pinned, DMA-mapped, and translation-off memory. The approach taken to remove a page-frame (, bytes) in each of these categories is as follows: . A page-frame containing a free/unused page is simply removed from its free-list. . A page-frame containing a page-able page can be made to either page out its contents to disk or to migrate its contents to a different free page-frame. . A page-frame containing a pinned page will have its contents migrated to a different page-frame; also the page-fault reload handler had to be made to spin during the migration. . A page-frame containing a DMA-mapped page cannot be removed nor have its contents migrated until all the accesses to the page are blocked. Here, the term “DMA-mapped page-frame” is used generically to mean a page-frame whose physical address is subject to read or write by an external (to kernel) entity such as a DMA engine. The contents of a DMA-mapped page-frame are migrated to a different page-frame with a new hypervisor call (h_migrate_dma) that will selectively suspend the bus traffic while it is modifying the TCEs (translation control entries) in system memory used by the bus unit controller for DMA accesses. Other page-frames whose physical addresses are exposed to external entities are handled by invoking preregistered DLPAR callback routines and then waiting for completion of the removal of their dependencies on the page. . A page-frame containing translation-off pages will not be removed by DLPAR in AIX .. Fortunately,
there is just a small amount of these page-frames, and they are usually collocated in low memory. The design and implementation of DLPAR memory removal has manifested itself as three modular functions, such that one can mix and match these functions in several possible ways, adapting to the state of the system at the time of memory removal, thus achieving the desired end result with the most optimal path. These three modular functions perform, respectively, the following three tasks on the memory (LMB) being removed: (a) Remove its free and clean pages from the regular use of VMM (Virtual Memory Manager) (b) Page out its page-able dirty pages (c) Migrate the contents of each remaining page-frame in the LMB to a free page-frame outside the LMB Memory removal can be implemented with any one of the sequences: abc, ac, bc, or just c. For example, the decision to either invoke page-out (task b) or to migrate (task c) all the pages depends on the load in the LPAR at that particular time. If the LMB being removed contains a lot of dirty pages that belong to highly active threads, then it does not make sense to invoke task b, because these pages will be paged back in almost immediately, hence negatively impacting the efficiency of the system. As a second example, if there are not enough free frames in other LMBs to migrate the pages to, then the memory removal procedure can invoke task b before invoking task c, so that there will be far fewer pages left that need to be migrated in task c.
DLPAR of I/O Slots The methods to dynamically configure or unconfigure a device have been introduced as early as AIX version . The changes required in the kernel design for the DLPAR of I/O slots were not in the same scale as those for the DLPAR of processors, and particularly as those for the DLPAR of memory. The reason is that the onus of configuring or unconfiguring a device lies with the device driver software, which operates in the kernel extension environment. The kernel just acts as a provider of serialization mechanisms for devices accessing common resources, and as an intermediary between the applications and the device drivers.
Dynamic Logical Partitioning for POWER Systems
Values and Uses of DLPAR DLPAR is the foundation of many advanced technologies for ensuring scalability, reliability, and full utilization of the resources in POWER Systems servers. It opens up a whole set of possibilities for great flexibility in dealing with dynamically changing workload demands, server consolidations of workloads from different time zones, and high availability for applications.
Some Obvious Values Enabled by DLPAR ●
●
●
●
●
Ability to dynamically move processors and memory from a test LPAR to a production LPAR in periods of peak production workload demand, then move them back as demand decreases. Ability to programmatically and dynamically move processors and memory from less loaded LPARs to busy LPARs, whenever the need arises []. This is particularly useful for economically providing resources to LPARs that service workloads in different time zones or even different continents. Ability to dynamically move an infrequently used I/O device between LPARs, such as a CD/DVD/tape drives for installations, applying updates, and for backups. Occasionally, one may also want to move Ethernet adaptors and disk drives dynamically from LPAR to another. Ability to dynamically release a set of processor, memory, and I/O resources into the “free pool,” so that a new LPAR can be created using the resources in the free pool. Providing high availability for the applications in an LPAR. For example, you can configure a set of minimal LPARs on a single system to act as “failover” backup LPARs to a set of primary LPARs, and also keep some set of resources free. If one of the associated primaries fails, then its backup LPAR can pick up the workload, and resources can be dynamically added to the backup LPAR as needed.
Some Non-Obvious Values Enabled by DLPAR ●
Dynamic CPU Guard, which is the automatic and graceful de-configuration of a processor which has intermittent errors. ● Dynamic CPU Sparing, which is the Dynamic CPU Guard feature enhanced with automatic enablement of a spare CPU to replace a defective CPU.
D
●
CUoD (Capacity Upgrade on Demand), which allows a customer to activate preinstalled but inactive and unpaid for processors as resource needs arise, as soon as IBM is notified and enables the corresponding license keys. Many enterprises with seasonally varying business activities extensively utilize this feature in their IT infrastructure. ● Hot repair of PCI devices, and possibly future hot repair of MCM (Multi-Chip Modules). ● The failover of an LPAR to another LPAR in another physical server, used with the IBM HACMP LPP (Licensed Program Product). This is detailed in the -page IBM Redpaper titled “HACMP v., Dynamic LPAR, and Virtualization” []. HACMP is acronym for High Availability Cluster Multiprocessing. It is IBM’s solution for high-availability clusters on the AIX and Linux POWER platforms.
DLPAR-Safe, DLPAR-Aware, and DLPAR-Friendly Programs/Applications A DLPAR-safe program is one that does not fail as a result of DLPAR operations. Its performance may suffer when resources are taken away, and performance may not scale up when new resources are added. However, by default, most applications are DLPAR-safe, and only a few applications are expected to be impacted by DLPAR. A DLPAR-aware program is one that has code that is designed to adjust its use of system resources commensurate with the actual resources of the LPAR, which is expected to vary over time. This may be accomplished in two ways. The first way is by regularly polling the system to discover changes in its LPAR resources. The second way is by registering a set of DLPAR code that is designed to be executed in the context of a DLPAR operation. At a minimum, DLPAR-aware programs are designed to avoid introducing conditions that may cause DLPAR operations to fail. They are not necessarily concerned with the impact that DLPAR may have on their performance. A DLPAR-friendly program is a DLPAR-aware application or middleware that automatically tunes itself in response to the changing LPAR resources. AIX provides a set of DLPAR scripts and APIs for applications to dynamically resize. Notable vendor applications
D
D
Dynamic LPAR
that are DLPAR-friendly include IBM DB [], Lotus Domino [], Oracle [], etc. More information about these scripts and APIs can be found in chapter (Dynamic Logical Partitioning) of the online IBM manual “AIX Version . General Programming Concepts” []. To maintain expected levels of application performance when memory is removed, buffers may need to be drained and resized. Similarly when memory is added, buffers may need to be dynamically increased to gain performance. Similar treatment needs to be applied to threads, whose numbers, at least in theory, need to be dynamically adjusted to the changes in the number of online CPUs. However, thread-based adjustments are not necessarily limited to CPU-based decisions; for example, the best way to reduce memory consumption in Java programs may be to reduce the number of threads, since this should reduce the number of active objects that need to be preserved by the Java Virtual Machine’s garbage collector.
Future Directions Future uses of DLPAR are already being explored; among these are: ●
Power/energy conservation for the server by dynamically DLPAR-remove of unused CPUs and memory ● Dynamic hot repair of server components, even MCM (Multi-Chip Modules)
Related Entries
. Shah P () DB and dynamic logical partitioning. http://www. ibm.com/developerworks/eserver/articles/db_dlpar.html . Bassemir R, Faurot G () Lotus Domino and AIX DLPAR. http://www.ibm.com/developerworks/aix/library/ au-DominoandDLPARver.html . Shanmugam R () Oracle database and Oracle RAC gR on IBM AIX. http://www.ibm.com/servers/enable/site/peducation/ wp/a/a.pdf . IBM manual “AIX Version .: General Programming Concepts: Writing and Debugging Programs” SC--, / pp – http://publib.boulder.ibm.com/infocenter/aix/vr/ topic/com.ibm.aix.genprogc/doc/genprogc/genprogc.pdf . Quintero D, Bodily S, Pothier P, Lascu O () HACMP v., dynamic LPAR, and virtualization. IBM Redpaper, http://www. redbooks.ibm.com/redpapers/pdfs/redp.pdf . Bodily S, Killeen R, Rosca L () PowerHA for AIX cookbook. IBM Redbook, http://www.redbooks.ibm.com/redbooks/ pdfs/sg.pdf . Matsubara K, Guérin N, Reimbold S, Niijima T () The complete partitioning guide for IBM eServer pSeries servers. IBM Redbook, SG--, http://portal.acm.org/citation. cfm?id= ISBN: . Irving N, Jenner M, Kortesniemi A () Partitioning implementation for IBM eServer p servers. IBM Redbook, SG-, http://www.redbooks.ibm.com/redbooks/pdfs/sg.pdf . Hales C, Milsted C, Stadler O, Vågmo M () PowerVM virtualization on IBM system p: introduction and configuration, th edn. IBM Redbook, http://www.redbooks.ibm.com/ redbooks/pdfs/sg.pdf . Dimmer I, Haug V, Huché T, Singh AK, Vågmo M, Venkataraman AK () Ch : DLPAR. In: PowerVM virtualization managing & monitoring. IBM RedBook, http://www.redbooks.ibm. com/redbooks/pdfs/sg.pdf . Jann J, Burugula RS, Dubey N IBM DLPAR tool set for pSeries for P & P systems. http://www.alphaworks.ibm.com/tech/dlpar (Contact [email protected]) . Wikipedia entry on DLPAR: http://en.wikipedia.org/wiki/ DLPAR
IBM Power Architecture Multi-Threaded Processors
Bibliography . Jann J, Browning L, Burugula RS () Dynamic reconfiguration: basic building blocks for autonomic computing in IBM pSeries servers. IBM Syst J (Spec Issue Auton Comput) () . Jann J, Dubey N, Pattnaik P, Burugula RS () Dynamic reconfiguration of CPU and WebSphere on IBM pSeries servers. Softw Pract Exp J (SPE Issue) ():– . Lynch J Dynamic LPAR – the way to the future. http://www. ibmsystemsmag.com/aix/junejuly/administrator/p.aspx
Dynamic LPAR Dynamic Logical Partitioning for POWER Systems
Dynamic Reconfiguration Dynamic Logical Partitioning for POWER Systems
E Earth Simulator Ken’ichi Itakura Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokohama, Japan
Synonyms ES
Definition The Earth Simulator is a highly parallel vector supercomputer system consisting of processor nodes and an interconnection network.
Discussion The Earth Simulator, which was developed, as a national project, by three governmental agencies, the National Space Development Agency of Japan (NASDA), the Japan Atomic Energy Research Institute (JAERI), and Japan Marine Science and Technology Center (JAMSTEC), was led by Dr. Hajime Miyoshi. The ES is housed in the Earth Simulator Building (approximately m × m × m). The fabrication and installation of the ES at the Earth Simulator Center of JAMSTEC by NEC was completed at the end of February in . It was first on the Top list for two and a half years starting in June (Fig. ).
Hardware The ES is a highly parallel vector supercomputer system of the distributed-memory type, and consisted of processor nodes (PNs) connected by × single-stage crossbar switches. Each PN is a system with a shared memory, consisting of eight vector-type arithmetic processors (APs), a -GB main memory system (MS), a remote access control unit (RCU), and an I/O processor. The peak performance of each AP is Gflops.
The ES as a whole thus consists of , APs with TB of main memory and the theoretical performance of Tflops. Each AP consists of a four-way super-scalar unit (SU), a vector unit (VU), and main memory access control unit on a single LSI chip. The AP operates at a clock frequency of MHz with some circuits operating at GHz. Each SU is a super-scalar processor with KB instruction caches, KB data caches, and general-purpose scalar registers. Branch prediction, data prefetching, and out-of-order instruction execution are all employed. Each VU has vector registers, each of which has vector elements, along with eight sets of six different types of vector pipelines: addition/shifting, multiplication, division, logical operations, masking, and load/store. The same type of vector pipelines works together by a single vector instruction and pipelines of different types can operate concurrently. The VU and SU support the IEEE floatingpoint data format. This LSI chip technology is nm CMOS and eight layers copper interconnection. The die size is . mm × . mm. It has million transistors and , pins. The maximum power consumption is W. The overall MS is divided into , banks and the sequence of bank numbers corresponds to increasing addresses of locations in memory. Therefore, the peak throughput is obtained by accessing contiguous data which are assigned to locations in increasing order of memory address. The RCU is directly connected to the crossbar switches and controls internode data communications at . GB/s bidirectional transfer rate for both sending and receiving data. Thus the total bandwidth of internode network is about TB/s. Several data-transfer modes, including access to three-dimensional (D) sub-arrays and indirect access modes, are realized in hardware. In an operation that involves access to the data of a sub-array, the data is moved from one PN to
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
E
Earth Simulator
Disks Interconnectin network (IN) cabinets (65)
Processor node (PN) cabinets (320)
Air conditioning system
65 m 71 yd
50 m 55 yd
Double floor for cables
Power supply system
Earth Simulator. Fig. Bird’s-eye view of the earth simulator system
another in a single hardware operation, and relatively little time is consumed for this processing. The Interconnection Network (IN) cabinet has two switches of ×. The signal from PN changes from Serial to Parallel and is inputted into a switching circuit. The signal which came out from the switching circuit is performed Parallel/Serial conversion. Then, it is sent to PN cabinet. The number of Cables which connects PN cabinet and IN cabinet is × = , , and the total extension is , km.
MDPS In October , Mass Data Processing System (MDPS) was installed as a massively data storage system which renews the tape library system. It consists of four file service processors (FSPs), TB hard disk drives, and a currently used . PB cartridge tapes library (CTL). MDPS was adopted aiming to improve data transfer throughput and accessibility. . The transfer speed of saving and extracting the data between the ES and the storage became two to fivetimes faster. This improvement can be realized by expansion of transfer cable ability and replacement of a tape archive with disks.
. User view is a very large disk and usual UNIX commands and tools can access the MDPS data on the login server. Data I/O procedure becomes easy. . MDPS enables users to access the results computed by ES remotely, because it can transfer the data to a dedicated server out of ES LAN (ES-Network FTP Server).
Operating System The operating system running on ES is an enhanced version of NEC’s UNIX-based OS called “SUPER-UX” that is developed for NEC’s SX Series supercomputers. To support ultra-scale scientific computations, SUPER-UX was enhanced mainly in the following two points: . Extending scalability . Providing special features for ES Extending scalability up to the whole system ( PNs) is the major requirement for the OS of ES. All functions of the OS, such as process management, memory management, file management, etc., are fully optimized to fulfill the requirement. For example, any of the OS tasks costing order n, such as scattering data in sequence over all PNs, is replaced with the equivalent one costing order log n, such as binary-tree copy, if possible, where n is the number of PNs.
Earth Simulator
E
Crossbar network
CCS #0
CCS #1
PN #639
PN #625
PN #624
PN #31
PN #17
PN #16
PN #15
Back-end clusters
PN #1
PN #0
TSS cluster
CCS #39
Earth Simulator. Fig. Super cluster system of ES: a hierarchical management system is introduced to control the ES. Every nodes are collected as a cluster system and therefore, there are sets of cluster in total. A set of cluster is called an “S-cluster” which is dedicated for interactive processing and small-scale batch jobs. A job within one node can be processed on the S-cluster. The other sets of cluster is called “L-cluster” which are for medium-scale and large-scale batch jobs. Parallel processing jobs on several nodes are executed on some sets of cluster. Each cluster has a cluster control station (CCS) which monitors the state of the nodes and controls electricity of the nodes belonged to the cluster. A super cluster control station (SCCS) plays an important role of integration and coordination of all the CCS operations. PN#0
PN#1
PN#2
PN#n
AP
AP
AP
AP
FAL
FAL
FAL
FAL
File server
File server
File server
Request File server
Read/write
Parallel file
Earth Simulator. Fig. Parallel file system (PFS): A parallel file, i.e., file on PFS is striped and stored cyclically in the specified blocking size into the disk of each PN. When a program accesses to the file, the File Access Library (FAL) sends a request for I/O via IN to the File Server on the node that owns the data to be accessed
On the other hand, the OS provides some special features which aim for efficient use or administration of such a large system. The features include internode high-speed communication via IN, global address space among PNs, super cluster system (Fig. ), batch job environment, etc.
Parallel File System If a large parallel job running on PNs reads from/writes to one disk installed in a PN, each PN accesses to the disk in sequence, and performance degrades terribly. Although local I/O in which each PN
reads from or writes to its own disk solves the problem, it is a very hard work to manage such a large number of partial files. Therefore, parallel I/O is greatly demanded in ES from the point of view of both performance and usability. The parallel file system (PFS) provides the parallel I/O features to ES (Fig. ). It enables handling multiple files, which are located on separate disks of multiple PNs, as logically one large file. Each process of a parallel program can read/write distributed data from/to the parallel file concurrently with one I/O statement to achieve high performance and usability of I/O.
E
E
Earth Simulator
Job Scheduling ES is basically a batch-job system. Network Queuing System II (NQSII) is introduced to manage the batch job (Fig. ). There are two type queues. One is L batch queue and the other is S batch queue. S batch queue is aimed at being used for a pre-run or a post-run for large-scale
batch jobs (making initial data, processing results of a simulation, and other processes), and L batch queue is for a production run. Users choose an appropriate queue for their jobs. S batch queue is designed for single-node batch jobs. In S batch queue, Enhanced Resource Scheduler (ERSII) is used as a scheduler and does a job scheduling based
Running
Login server
L-system
Waiting Batch job entry (qsub)
Multi-node batch job
Node number J1
J2 J3
J2 J1
J3 Time
Current
L batch queue
S-system
AP number
S batch queue Single-node batch job in S-system Current
Time
Earth Simulator. Fig. Shows the queue configuration of ES
For single-node batch jobs Nodes for single-node batch jobs
For multi-node batch jobs Nodes for multi-node batch jobs
Interactive nodes
•••••••
•••••••••
S-disk Stage-in/out HDME disk
Submitting jobs (single-node jobs)
DATA disk
Work disk
Work disk
••••••••
Work disk
Stage-in/out automatically before/after job execution Submitting jobs (multi-node jobs)
NFS mount MDPS (M-Disk)
User
Batch script
NFS mount MT
MT
Login server Remote login
Tape library
Earth Simulator. Fig. Job execution. The user writes a batch script and submits the batch script to ES. The node scheduling, the file staging, and other processing are automatically processed by the system
Earth Simulator
on CPU time. On the other hand, L batch queue is for multi-node batch jobs. In this queue, the customized scheduler for ES is used as the scheduler. We have been developing this scheduler with following strategies: . The nodes allocated to a batch job are used exclusively for that batch job. . The batch job is scheduled based on elapsed time instead of CPU time. Strategy () enables to estimate the job termination time and to make it easy to allocate nodes for the next batch jobs in advance. Strategy () contributes to an efficiency
Earth Simulator. Table Programming model on ES
Inter-PN
“Hybrid”
“Flat”
HPF/MPI
HPF/MPI
Intra-PN
Microtasking/openMP
AP
Automatic vectorization
CPU
CPU
job execution. The job can use the nodes exclusively and the processes in each node can be executed simultaneously. As a result, the large-scale parallel program is able to be executed efficiently. PNs of L-system are prohibited from access to the user disk to ensure enough disk I/O performance. Therefore, the files used by the batch job are copied from the user disk to the work disk before the job execution. This process is called “stage-in.” It is important to hide this staging time for the job scheduling. Main steps of the job scheduling are summarized as follows: . Node allocation . Stage-in (copies files from the user disk to the work disk automatically) . Job escalation (rescheduling for the earlier estimated start time if possible) . Job execution . Stage-out (copies files from the work disk to the user disk automatically)
CPU
CPU
Microtask/OpenMp
CPU
CPU
Microtask/OpenMp HPF/MPI
Hybrid parallelization
CPU
CPU
CPU
CPU
HPF/MPI
CPU
HPF/MPI HPF/MPI
Flat parallelization
Earth Simulator. Fig. Two types of parallelization of ES
E
CPU
E
E
Eden
When a new batch job is submitted, the scheduler searches available nodes (Step ). After the nodes and the estimated start time are allocated to the batch job, stage-in process starts (Step ). The job waits until the estimated start time after stage-in process is finished. If the scheduler finds the earlier start time than the estimated start time, it allocates the new start time to the batch job. This process is called “Job Escalation” (Step ). When the estimated start time has arrived, the scheduler executes the batch job (Step ). The scheduler terminates the batch job and starts stage-out process after the job execution is finished or the declared elapsed time is over (Step ) (Fig. ).
Programming Model in ES The ES hardware has a three-level hierarchy of parallelism: vector processing in an AP, parallel processing with shared memory in a PN, and parallel processing among PNs via IN. To bring out high performance of ES fully, you must develop parallel programs that make the most use of such parallelism. As shown in Table , the three-level hierarchy of parallelism of ES can be used in two manners, which are called hybrid and flat parallelization, respectively (Fig. ). In the hybrid parallelization, the internode parallelism is expressed by HPF or MPI, and the intranode by microtasking or OpenMP, and you must, therefore, consider the hierarchical parallelism in writing your programs. In the flat parallelization, the both inter- and intranode parallelism can be expressed by HPF or MPI, and it is not necessary for you to consider such complicated parallelism. Generally speaking, the hybrid parallelization is superior to the flat in performance and vice versa in ease of programming. Note that the MPI libraries and the HPF runtimes are optimized to perform as well as possible both in the hybrid and flat parallelization.
Bibliography . Sato T, Kitawaki S, Yokokawa M () Earth simulator running In: th international supercomputer conference ISC, Hedielberg . Kitawaki S () The development of the earth simulator. In: LACSI symposium (Keynote speech), Santa Fe . Habata S, Kitawaki S, Yokokawa M () The earth simulator system. NEC Res Dev :():– . Inasaka J, Ikeda R, Umezawa K, Ko Y, Yamada S, Kitawaki S () Hardware technology of the earth simulator. NEC Res Dev ():–
Eden Rita Loogen Philipps-Universität Marburg, Marburg, Germany
Definition Eden is a parallel functional programming language that extends the non-strict functional language Haskell with constructs for the definition and instantiation of parallel processes. The programmer is freed from managing synchronization and data exchange between processes while keeping full control over process granularity, data distribution, and communication topology. Eden is geared toward distributed settings, that is, processes do not share any data. Common and sophisticated parallel communication patterns and topologies, that is, algorithmic skeletons, are provided as higher-order functions in a skeleton library written in Eden. Eden is implemented on the basis of the Glasgow Haskell Compiler GHC [], a mature and efficient Haskell implementation. While the compiler frontend is almost unchanged, the backend is extended with a parallel runtime system (PRTS) []. This PRTS uses suitable middleware (currently PVM or MPI) to manage parallel execution.
Discussion Introduction Functional languages are promising candidates for parallel programming, because of their high level of abstraction and, in particular, because of their referential transparency. In principle, any sub-expression could be evaluated in parallel. As this implicit parallelism would lead to too much overhead, modern parallel functional languages allow the programmers to specify parallelism explicitly. The underlying idea of Eden is to enable programmers to specify process networks in a declarative way. Processes map input to output values. Inputs and outputs are transferred via unidirectional one-to-one channels. A comprehensive definition of Eden including a discussion of its formal semantics and its implementation can be found in the journal paper []. We describe only the essentials here.
Eden
Basic Eden Constructs The basic Eden coordination constructs are process abstraction and instantiation:
p r o c e s s :: ( T r a n s a , (a → b) ( # ) :: ( T r a n s a , Process a
Trans b) → Process Trans b) b → a →
⇒ a b ⇒ b
The function process embeds functions of type a→b into process abstractions of type Process a b, while the instantiation operator ( # ) takes such a process abstraction and input of type a to create a new process which consumes the input and produces output of type b. Thus, the instantiation operator leads to a function application, with the side effect that the function is evaluated by a remote child process. Its argument is concurrently, that is, by a new thread, evaluated in the parent process and sent to the child process which, in turn, fully evaluates and sends back the result of the function application. Both processes are using implicit : communication channels established between child and parent process on process instantiation. The type context (Trans a, Trans b) states that both a and b must be types belonging to the type class Trans of transmissible values. In general, Haskell type classes provide a structured way to define overloaded functions. Trans defines implicitly used communication functions which by default transmit normal form values in a single piece. The overloading is used twofold. Lists are communicated element-wise as streams and tuple components are evaluated concurrently. An independent thread will be created for each component of an output tuple and the result of its evaluation will be sent on a separate channel. The connection points of channels to processes are called inports on the receiver side and outports on the sender side. There is a oneto-one correspondence between the threads and the outports of a process while data that is received via the inports is shared by all threads of a process. Analogously, several threads will be created in a parent process for tuple inputs of a child process. During its lifetime, an Eden process can thus contain a variable number of threads. The demand-driven (lazy) evaluation of Haskell is an obstacle for parallelism. A completely demand-
E
driven evaluation creates a parallel process only when its result is already needed to continue the main evaluation. This suppresses any real parallel evaluation. Although Eden overrules the lazy evaluation, for example, by evaluating inputs and outputs of processes eagerly and by sending those values immediately to the corresponding recipient processes, it is often necessary to use explicit demand control in order to start processes speculatively before their result values are needed by the main computation. Evaluation strategies [] are used for that purpose. We will not go into further details. In the following, we will use the (predefined) Eden function spawn to eagerly and immediately instantiate a finite list of process abstractions with their corresponding inputs. Appropriate demand control is incorporated in this function. Neglecting demand control, spawn would be defined as follows: s p a w n :: ( T r a n s a , T r a n s b ) ⇒ [ Process a b] → [a] → [b] -- d e f i n i t i o n w i t h o u t d e m a n d c o n t r o l spawn = z i p W i t h (#)
The variant spawnAt additionally locates the created processes on given processor elements (identified by their number). s p a w n A t :: ( T r a n s a , T r a n s b ) ⇒ [ Int ] → [ P r o c e s s a b ] → [ a ] → [ b ]
Example:
A
parallel
variant
of
the
function
map :: (a→b) → [a] → [b] which creates a
process for each application of the parameter function to an element of the input list can be defined as follows: parMap
:: ( T r a n s a , T r a n s b ) ⇒ (a → b) → [a] → [b]
parMap f = spawn ( repeat ( process
f ))
The Haskell prelude function repeat :: a → [a] yields a list infinitely repeating the parameter element.
⊲
To model many-to-one communication, Eden provides a nondeterministic function merge that merges a list of streams into a single stream.
E
E
Eden
m e r g e :: T r a n s a ⇒
[[ a ]]
→
[a]
This function can, for example, be used by the master in a master–worker system to receive the results of the workers as a single stream in the time-dependent order in which they are provided. Although merge is of great worth, because it is the key to specify many reactive systems, one has to be aware that functional purity and its benefits are lost when merge is being used in a program. Fortunately, functional purity can be preserved in most portions of an Eden program. In particular, it is, for example, possible to use sorting in order to force a particular order of the results returned by a merge application and thus to encapsulate merge in a deterministic context. Example: A simple master–worker system with the following functionality can easily be defined in Eden as shown in Fig. . The master process distributes tasks to worker processes, which solve the tasks and return the results to the master: tasks
results master
masterWorker np pf f toWs
tag_map f worker 1
fromWs
...
tag_map f worker np
The Eden function masterWorker (evaluated by the “master” process) takes four parameters: np specifies the number of worker processes that will be spawned, prefetch determines how many tasks will initially be sent by the master to each worker process, the function f describes how tasks have to be solved by the workers, and the final parameter tasks is the list of tasks that have to be solved by the whole system. The auxiliary pure Haskell function distribute :: Int -> [a] -> [Int] -> [[a]] (code not shown) is used to distribute the tasks to the workers. Its first parameter determines the number of output lists, which become the input streams for the worker processes. The third parameter is the request list reqs which guides the task distribution. The same list is used to sort
the results according to the original task order (function orderBy :: [[b]] -> [Int] -> [b], code not shown). Initially, the master sends as many tasks as specified by the parameter prefetch in a round-robin manner to the workers (see definition of initReqs). Further tasks are sent to workers which have delivered a result. The list newReqs is extracted from the worker results tagged with the corresponding worker id which are merged according to the arrival of the worker results. This simple master–worker definition has the advantage that the tasks need not be numbered to reestablish the original task order on the results. Moreover, worker processes need not send explicit requests for new work together with the result values.
Defining Non-hierarchical Process Networks in Eden With the Eden constructs introduced up to now, communication channels are only established between parent and child processes during process creation. This results in purely hierarchical process topologies. Eden further provides functions to create and use explicit channel connections between arbitrary processes. These functions are rather low level and it is not easy to use them appropriately. Therefore, we will not show them here, but explain two more abstract approaches to define non-hierarchical process networks in Eden: remote data and graph specifications. Remote data of type a is represented by a handle of type RD a with the following interface functions []: release :: a -> RD a fetch :: RD a -> a. The function release yields a remote data handle that can be passed to other processes, which will in turn use the function fetch to access the remote data. The data transmission occurs automatically from the process that releases the data to the process which uses the handle to fetch the remote data. Example: We show a small example where the remote data concept is used to establish a direct channel connection between sibling processes. Given functions f and g, one can calculate (g ○ f) a in parallel creating a process for each function.
Eden
E
m a s t e r W o r k e r :: ( T r a n s a , T r a n s b ) ⇒ Int → Int → ( a → b ) → [ a ] → [ b ] m a s t e r W o r k e r np p r e f e t c h f t a s k s = o r d e r B y f r o m W s reqs where fromWs = s p a w n w o r k e r s toWs workers = [ p r o c e s s ( t a g _ m a p f ) ∣ n ← [1.. np ]] toWs = d i s t r i b u t e np t a s k s reqs newReqs = m e r g e [[ i ∣ r ← rs ] ∣ (i , rs ) ← zip [1.. np ] f r o m W s ] reqs = i n i t R e q s ++ n e w R e q s i n i t R e q s = c o n c a t ( r e p l i c a t e p r e f e t c h [1.. np ])
Eden. Fig. Eden definition of a simple master–worker system
main
main
inp
f
a
inp
g Indirect connection
release.f g.fetch Direct connection
b
Eden. Fig. A simple process graph
Simply replacing the function calls by process instantiations (process g # (process f # inp))
leads to the process network in Fig. a where the process evaluating the above expression is called main. Process main instantiates a first child process calculating g, thereby creating a concurrent thread to evaluate the input for the new child process. This thread instantiates the second child process for calculating f. It receives the remotely calculated result of the f process and passes it to the g process. The drawback of this approach is that the result of the f process will not be sent directly to the g process. This causes unnecessary communication costs. In the second implementation, we use remote data to establish a direct channel connection between the child processes: process (g o fetch) # (process (release of) # inp) It uses function release to produce a handle of type RD a for data of type a. Calling fetch with remote data returns the value released before. The use of remote
data leads to a direct communication of the actual data between the processes evaluating f and g (see Fig. b). The remote data handle is treated like the original data in the first version, that is, it is passed via the main process from the process computing f to the one computing g. ⊲ The remote data concept enables the programmer to easily build complex topologies by combining simpler ones [].
Grace
(Graph-based communication in Eden) is a library that allows a programmer to specify a network of processes as a graph, where the graph nodes represent processes and the edges of communication channels []. The graph is described as a Haskell data structure ProcessNetwork a, where a is the type of the result computed by the process network. A function build is used to define a process network, while a function start instantiates such a network and automatically sets up the corresponding process topology, that is, the processes are created and the necessary communication channels are installed.
E
E
Eden
b u i l d :: f o r a l l f a g r p n e . -- type c o n t e x t ( Placeable f a g r p, Ord e , Eq n ) ⇒ -- main node (n , f ) → [ Node n ] → -- node list [ Edge n e ] → -- edge list ProcessNetwork r s t a r t :: ( T r a n s a ) ⇒ ProcessNetwork a → a
The function build transforms a graph specification into a value of type ProcessNetwork r. The graph is defined by a list of nodes of type [Node n] and a list of edges of type [Edge n e]. Type variables n and e represent the types of node and edge labels, respectively. To allow functions with different types to be associated with graph nodes, the type f of node functions is existentially quantified and only explicitly given for the main node, because the function placed on the main node determines the result type r of the process network. This is also the reason why the main node is not a member of the node list but provided as a separate parameter. The function build uses multiparameter classes with functional dependencies and explicit quantification to achieve flexibility in the specification of graphs, for example, by placing arbitrary functions on nodes and nevertheless using standard Haskell lists to represent node lists. The type context Ord e and Eq n ensures that edges can be ordered by their label and that nodes can be identified by their label. Placeable is a multiparameter type class with dependent types, which is used by the Grace implementation to partition user supplied function types into their parts, for example,
parameter and result types. This is needed to create individual channels for these parts. Suitable instances will be derived automatically for every possible function. In the type context Placeable f a g r p the function’s type f determines the other types: a is the type of the function’s first argument, g is the remaining part of the function’s type without the first argument. The final result type of the function after applying all parameters is r. Finally, p is a type level list of all the parameters. The main benefit of Grace is a clean separation between coordination and computation. The network specification encapsulates the coordination aspects. The graph nodes are annotated with functions describing the computations of the corresponding processes. Example: We consider again a simple process network which is similar to the one in Fig. b and show its specification in Grace. The behavior of the three processes is specified by functions f, g, and root. The communication topology is defined by a graph consisting of three nodes (a root node and two additional nodes) and three edges. The specification is shown in Fig. . ⊲
Algorithmic Skeletons Algorithmic skeletons [] capture common patterns of parallel evaluations like task farms, pipelines, divideand-conquer schemes, etc. The application programmer only needs to instantiate a skeleton appropriately, thereby concentrating on the problem-specific matters and trusting on the skeleton with respect to all parallel details. The small introductory example functions parMap and masterWorker shown above are examples for skeleton definitions in Eden. In Eden and other parallel functional languages, skeletons are no more than polymorphic higher-order functions which can be applied
example = start network where -- fct. spec. of process behaviour f, g, root :: (Trans a ) => [a] -> [a] ... -- process graph specification network = build ("root", root) nodes edges nodes = [N "nd_f" f , N "nd_g" g ] edges = [E "nd_f" "nd_g" 0 nothing, E "nd_g" "root" 0 nothing, E "root" "nd_f" 0 nothing]
Eden. Fig. Grace specification of a simple process graph
root
f
g
Eden
with different types and parameters. Thus, programming with skeletons follows the same principle as programming with higher-order functions. An important issue is that skeletons can be both used and implemented in Eden. In other skeletal approaches, the creation of new skeletons is considered as a system programming task, or even as a compiler construction task. Skeletons are implemented by using imperative languages and/or parallel libraries. Therefore, these systems offer a closed collection of skeletons which the application programmer can use, but without the possibility of creating new ones, so that adding a new skeleton usually implies a considerable effort. The Eden skeleton library contains a big collection of different skeletons. Various kinds of skeleton definitions in Eden together with cost models to predict the execution time of skeleton instantiations have been presented in a book chapter []. Parallel map implementations have been analyzed in []. An Eden implementation of the large-scale mapand-reduce programming model proposed by Google [] has been investigated in []. Hierarchical master– worker schemes with several layers of masters and submasters can elegantly be created by nesting a simple single-layer master–worker scheme. More elaborate hierarchical master–worker systems have been discussed in []. Example: (A regular fixed-branching divide and conquer skeleton) As a nontrivial example of a skeleton definition we present a dynamically unfolding divide-and-conquer skeleton []. The essence of a divide-and-conquer algorithm is to decide whether the input is trivial and, in this case, to solve it, or else to decompose nontrivial input into a number of subproblems, which are solved recursively, and to combine the output. A general skeleton takes parameter functions for this functionality, as shown here:
type D i v i d e C o n q u e r a = ( a → Bool ) → -(a → b) → -( a → [ a ]) → -([ b ] → b ) → -a → b --
b trivial ? solve split combine input / result
E
The resulting structure is a tree of task nodes where successor nodes are the sub-problems, the leaves representing trivial tasks. A fundamental Eden skeleton which specifies a general divide and conquer algorithm structure can be found in []. Here, we show a version where every nontrivial task is split into a fixed number of sub-tasks. Moreover, the process tree is created in a distributed fashion: One of the tree branches is processed locally, the others are instantiated as new processes, as long as PEs are available. These branches will recursively produce new parallel subtasks, resulting in a distributed expansion of the computation. In the following binary tree of task nodes, the boxes indicate which task nodes will be evaluated by the eight processes:
In this setting, explicit placement of processes is essential to ensure that processes are not placed on the same processor element while leaving others unused. In [], this kind of skeleton is compared with a flat expansion version, where the main process unfolds the tree up to a given depth, usually with more branches than available processor elements (PEs). The resulting subtrees can then be evaluated by a farm of parallel worker processes, the main process combines the results of the subprocesses. A master–worker system (as shown above) can be used to implement this version. Figure shows the distributed expansion divideand-conquer skeleton for k-ary task trees. Besides the standard parameter functions, the skeleton takes the branching degree, and a ticket list with PE numbers to place newly created processes. The left-most branch of the task tree is solved locally, other branches are instantiated using the Eden function spawnAt, which instantiates a collection of processes (given as a list) with respective input, on explicitly specified PEs. Results are combined by the combine function. Explicit Placement via Tickets: The ticket list is used to control the placement of newly created processes. First, the PE numbers for placing the immediate child
E
E
Eden
dcN :: (Trans a, Trans b) ⇒ Int → [Int] → -- branch degree/tickets DivideConquer a b dcN k tickets trivial solve split combine x |null tickets = seqDC x |trivial x = solve x |otherwise = ... -- code for demand control omitted combine (myRes:childRes ++ localRess) where -- sequential computation seqDC x = if trivial x then solve x else combine (map seqDC (split x)) -- child process generation childRes = spawnAt childTickets childProcs procIns childProcs = map (process ◦ rec_dcN) theirTs rec_dcN ts = dcN k ts trivial solve split combine -- ticket distribution (childTickets, restTickets) = splitAt (k-1) tickets (myTs: theirTs) = unshuffle k restTickets -- input splitting (myIn:theirIn) = split x (procIns, localIns) = splitAt (length childTickets) theirIn -- local computations myRes = ticketF myTs myIn localRess = map seqDC localIns
Eden. Fig. Distributed expansion divide and conquer skeleton for k-ary task trees
processes are taken from the ticket list. Then, the remaining tickets are distributed to the children in a round-robin manner using the function unshuffle :: Int -> [a] -> [[a]] which unshuffles a given list into as many lists as the first parameter tells. Child computations will be performed locally when no more tickets are available. The explicit process placement via ticket lists is a simple and flexible way to control the distribution of processes as well as the recursive unfolding of the task tree. If too few tickets are available, computations are performed locally. Duplicate tickets can be used to allocate several child processes on the same PE. ⊲
Implementation Eden’s implementation follows a microkernel approach. The kernel implements a few, general control mechanisms and provides them as primitive operations (see Fig. ). The more complex high-level language constructs are implemented in libraries using the primitive base constructs. Apart from simplifying the implementation, this approach has important advantages with respect to productivity and maintainability of the system.
Details on Eden’s implementation, especially on primitives provided by the interface of the parallel runtime system (PRTS) and the implementation of the Eden module can be found in [, ].
EdenTV – The Eden Trace Viewer Tool The Eden trace viewer tool (EdenTV) [] provides a postmortem analysis of program executions on the level of the computational units of the PRTS. The latter is instrumented with special trace generation commands activated by a runtime option. In the spacetime diagrams generated by EdenTV, machines (i.e., processor elements), processes, and threads are represented by horizontal bars, with time on the x-axis. The diagram bars have segments in different colors, which indicate the activities of the respective logical unit in a period during the execution. Bars are grey, when the logical unit is running, light grey, when it is runnable, but currently not running, and dark grey, when the unit is blocked. Messages between processes or machines are optionally shown by gray arrows which start from the sending unit bar and point at the receiving unit bar. The representation of messages is very important for programmers, since they can observe
Eden
Eden program
E
Language Haskell libraries
Eden module and libraries
System
Eden primitives Kernel
Sequential RTS
Parallel RTS Suitable middleware
E
Eden. Fig. Implementation layers
11.6 11.8
12
12.2 12.4 12.6 12.8
13
13.2 13.4 13.6 13.8
14
14.2 14.4 14.6 14.8
15
15.2 15.4 15.6 15.8
16
16.2 16.4 16.6
Zoom of the final communication phase of the hyperquicksort algorithm when sorting 5M elements on 17 machines of a Beowulf cluster. Messages are shown as black arrows.
Eden. Fig. Trace of hypercube communication topology of dimension
hot spots and inefficiencies in the communication during the execution as well as control communication topologies. Figure shows in grayscale (light grey = yellow, grey = green, dark grey = red) the trace of a recursive parallel quicksort in a hypercube of dimension with message traffic, exposing the well-known butterfly communication pattern imposed by the hypercube structure.
Future Directions Lines of future studies are ●
It is planned for the future to maintain a common parallel runtime environment for Eden, Glasgow parallel Haskell (GpH), and other parallel Haskells. Such a common environment is highly desirable and
would improve the availability and acceptance of parallel Haskells. ● Recent results [] show that the Eden system, although tailored for distributed memory machines, behaves equally well on workstation clusters and on multi-core machines. It needs to be investigated whether the Eden system can be optimized for multi-core machines by exploiting the shared memory and thereby avoiding unnecessary communications. ● We plan to design and investigate further skeletons. The flexibility of defining individual skeletons for several algorithm classes has not been completely exploited yet. Much more experience in the design and use of skeletons is necessary. ●
Recent studies on remote data and the Grace system have shown that high-level parallel programming notations simplify parallel programming a lot. More
E
Eden
research is necessary to find even more abstract parallel programming constructs. ● Supporting tools like profiling, analysis, visualization, and debugging tools are essential for real-world parallel program development. Much more work is needed to further develop existing, and to design new powerful tools which help the programmers to analyze and tune parallel programs.
Related Entries Concurrent ML Functional Languages Glasgow Parallel Haskell (GpH) Multilisp
receive functions are provided for channel communication. First-class synchronous operations, called events, however, allow to hide channels and complex communication and synchronization protocols behind appropriate abstractions. In contrast to Eden and other parallel functional languages, Concurrent ML has been designed for concurrent programming, that is, with a focus on structuring software and not with the goal of speeding up computations. Recently, however, a shared-memory parallel implementation has been developed []. Manticore is a strict parallel functional language that supports different kinds of parallelism. In particular, it combines explicit concurrency in the style of Concurrent ML with various implicitly parallel constructs which provide fine-grain data parallelism.
NESL Parallel Skeletons Sisal
Glasgow parallel Haskell (GpH) is another parallel Haskell extension that is less explicit about parallelism than Eden. It provides a simple parallel combinator par to annotate sub-expressions that should be evaluated in parallel if free processing elements were available. In contrast to Eden, GpH assumes a global shared memory (at least virtually), and the parallel runtime system decides whether an annotated sub-expression will be evaluated in parallel or not. NESL is a strict data-parallel functional language that supports nested parallel array comprehensions. Parallelism is hidden and concentrated in data-parallel operations. SISAL (Streams and Iteration in a Single Assignment Language) and MultiLisp are early strict parallel functional languages. SISAL is a first-order functional language with implicit parallelism, a rich set of array operations and stream support. It has been designed for numerical applications. MultiLisp extends the LISP dialect Scheme with explicit parallel constructs, in particular so-called futures. Futures initiate parallel evaluations without forcing the main computation to wait for the result. In this respect, futures lead to a special form of laziness within a strict computation language. Concurrent ML extends the strict functional language SML (Standard ML) with concurrency primitives like thread creation and synchronous message passing over explicit channels. Blocking send and
Bibliographic Notes and Further Reading The seminal book on research directions in parallel functional programming [] edited by Hammond and Michaelson covers not only fundamental issues but also provides summaries about selected research areas. Various parallel and distributed variants of the functional programming language Haskell are discussed in a journal paper by Trinder et al. []. A comprehensive overview on patterns and skeletons for parallel and distributed programming can be found in a book edited by Gorlatch and Rabhi []. The three parallel functional languages, Glasgow parallel Haskell, PMLS (a parallel version of ML), and Eden, have been compared with respect to programming methodology and performance in []. Comprehensive and up-to-date information on Eden is provided on its web site http://www.mathematik.uni-marburg.de/∼eden Basic information on its design, semantics, and implementation as well as the underlying programming methodology can be found in []. Details on the parallel runtime system and Eden’s concept of implementation can best be found in [, ]. The technique of layered parallel runtime environments has been further developed and generalized by Berthold et al. [, ]. The Eden trace viewer tool EdenTV is available on Eden’s web site. A short introductory description is given in []. Another tool for analyzing the behavior of Eden programs has been developed by de la Encina et al. [, ]
Eden
by extending the tool Hood (Haskell Object Observation Debugger) for Eden. Extensive work has been done on skeletal programming in Eden. An overview on various skeleton types have been presented as a chapter in the mentioned book by Gorlatch and Rabhi []. Definitions and applications of specific skeletons can, for example, be found in the following papers: parallel map [], topology skeletons [], adaptive skeletons [], Google map-reduce [], hierarchical master–worker systems [], divide-and-conquer schemes []. Special skeletons for computer algebra algorithms are developed with the goal to define the kernel of a computer algebra system in Eden []. An operational and a denotational semantics for Eden have been defined by Ortega-Mallén and Hidalgo-Herrero [, ]. These semantics have been used to analyze Eden skeletons [, ]. A non-determinism analysis has been presented by Segura and Peña []. Eden is intensively being used in teaching parallel programming and algorithms and as a platform for investigating high-level parallel programming concepts, techniques, methodology, and parallel runtime systems. Up to now, its main use in practice has been the implementation of parallel computer algebra components [, ].
Acknowledgments The author is grateful to Yolanda Ortega-Mallén, Jost Berthold, Mischa Dieterle, Thomas Horstmeyer, and Oleg Lobachev for their helpful comments on previous versions of this essay.
Bibliography . Berthold J () Towards a generalised runtime environment for parallel Haskells. Computational Science – ICCS’, LNCS . Springer (Workshop on practical aspects of High-level parallel programming – PAPP ) . Berthold J, Dieterle M, Lobachev O, Loogen R () Distributed memory programming on many-cores - a case study using Eden divide-&-conquer skeletons. ARCS , Workshop on ManyCores. VDE Verlag . Berthold J, Dieterle M, Loogen R () Implementing parallel google map-reduce in Eden. Europar’, LNCS , Springer, pp – . Berthold J, Dieterle M, Loogen R, Priebe S () Hierarchical master–worker skeletons. Practical Aspects of Declarative Languages (PADL ), LNCS , Springer, pp –
E
. Berthold J, Klusik U, Loogen R, Priebe S, Weskamp N () High-level process control in Eden. EuroPar – Parallel Processing, LNCS , Springer, pp – . Berthold J, Loogen R () Skeletons for recursively unfolding process topologies. Parallel Computing: Current & Future Issues of High-End Computing, ParCo , NIC Series, vol , pp – . Berthold J, Loogen R () Parallel coordination made explicit in a functional setting. Implementation and Application of Functional Languages (IFL ), Selected Papers, LNCS , Springer, pp – (awarded best paper of IFL’) . Berthold J, Loogen R () Visualizing parallel functional program runs – case studies with the Eden trace viewer. Parallel Computing: Architectures, Algorithms and Applications, ParCo , NIC Series, vol , pp – . Berthold J, Marlow S, Zain AA, Hammond K () Comparing and optimising parallel Haskell implementations on multicore. rd International Workshop on Advanced Distributed and Parallel Network Applications (ADPNA-). IEEE Computer Society . Berthold J, Zain AA, Loidl H-W () Scheduling light-weight parallelism in ArtCoP. Practical Aspects of Declarative Languages (PADL ), LNCS , Springer, pp – . Cole M () Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge . Dean J, Ghemawat S () MapReduce: simplified data processing on large clusters. Commun ACM ():– . Dieterle M, Horstmeyer T, Loogen R () Skeleton composition using remote data. Practical Aspects of Declarative Programming (PADL ), LNCS , Springer, pp – . Encina A, Llana L, Rubio F, Hidalgo-Herrero M () Observing intermediate structures in a parallel lazy functional language. Principles and Practice of Declarative Programming (PPDP ), ACM, pp – . Encina A, Rodríguez I, Rubio F () pHood: a tool to analyze parallel functional programs. Implementation of Functional Languages (IFL’), Seton Hall University, New York, pp –. Technical Report, SHU-TR-CS--- . Hammond K, Berthold J, Loogen R () Automatic skeletons in template Haskell. Parallel Process Lett ():– . Hammond K, Michaelson G (eds) () Research directions in parallel functional programming. Springer, Heidelberg . Hidalgo-Herrero M, Ortega-Mallén Y () An operational semantics for the parallel language Eden. Parallel Process Lett ():– . Hidalgo-Herrero M, Ortega-Mallén Y () Continuation semantics for parallel Haskell dialects. Asian Symposium on Programming Languages and Systems (APLAS ), LNCS , Springer, pp – . Hidalgo-Herrero M, Ortega-Mallén Y, Rubio F () Analyzing the influence of mixed evaluation on the performance of Eden skeletons. Parallel Comput (–):– . Hidalgo-Herrero M, Ortega-Mallén Y, Rubio F () Comparing alternative evaluation strategies for stream-based parallel functional languages. Implementation and Application of Functional
E
E . .
.
.
.
. .
. .
. . . .
Eigenvalue and Singular-Value Problems
Languages (IFL ), Selected Papers, LNCS , Springer, pp – Horstmeyer T, Loogen R () Graph-based communication in Eden. In: Trends in Functional Programming, vol . Intellect Klusik U, Loogen R, Priebe S, Rubio F () Implementation skeletons in Eden – low-effort parallel programming. Implementation of Functional Languages (IFL ), Selected Papers, LNCS , Springer, pp – Lobachev O, Loogen R () Towards an implementation of a computer algebra system in a functional language. AISC/Calculemus/MKM , LNAI , pp – Loidl H-W, Rubio Diez F, Scaife N, Hammond K, Klusik U, Loogen R, Michaelson G, Horiguchi S, Pena Mari R, Priebe S, Portillo AR, Trinder P () Comparing parallel functional languages: programming and performance. Higher-Order Symbolic Computation ():– Loogen R, Ortega-Mallén Y, Peña R, Priebe S, Rubio F () Parallelism abstractions in Eden. In [], chapter , Springer, pp – Loogen R, Ortega-Mallén Y, Peña-Marí R () Parallel functional programming in Eden. J Funct Prog ():– Peña R, Segura C () Non-determinism analysis in a parallelfunctional language. Implementation of Functional Languages (IFL ), LNCS , Springer Rabhi FA, Gorlatch S (eds) Patterns and skeletons for parallel and distributed computing. Springer, Heidelberg Reppy J, Russo CV, Yiao Y () Parallel concurrent ML. International Conference on Functional Programming (ICFP) , ACM The GHC Developer Team. The Glasgow Haskell Compiler. http://www.haskell.org/ghc Trinder P, Hammond K, Loidl H-W, Peyton Jones S () Algorithm + strategy = parallelism. J Funct Prog ():– Trinder PW, Loidl HW, Pointon RF () Parallel and distributed Haskells. J Funct Prog ( & ):– Zain AA, Berthold J, Hammond K, Trinder PW () Orchestrating production computer algebra components into portable parallel programs. Open Source Grid and Cluster Conference (OSGCC)
either eigenvalues: these are the roots λ of the nthdegree characteristic polynomial: det(A − λI) = .
()
The set of all eigenvalues is called the spectrum of A: Λ(A) = {λ , . . . , λn }. Eigenvalues can be real or complex, whether the matrix is real or complex. or eigenpairs: in addition to an eigenvalue λ, one seeks for the corresponding eigenvector x ∈ Cn − {} which is the solution of the singular system (A − λI)x = .
()
When the matrix A is real symmetric or complex hermitian, the eigenvalues are real. Given a matrix A ∈ Rm×n (or A ∈ Cm×n ), the singular-value decomposition (SVD) of A consists of computing orthogonal matrices U ∈ Rm×m and V ∈ Rn×n (or unitary matrices U ∈ Cm×m and V ∈ Cn×n ) such that the m × n matrix ˜ = U T AV Σ
()
˜ = U H AV) has zero entries except for its diago(or Σ nal leading p × p real submatrix Σ = diag(σ , . . . , σp ) with σ ≥ . . . ≥ σp ≥ . The first p columns of U = [u , . . . , um ], and of V = [v , . . . , vn ], are called the left and right singular vectors, respectively. The singular triplets are (σi , ui , vi ) for i = , . . . , p.
Discussion Mathematical Fundamentals Eigenproblems. Equation has n complex roots, some
Eigenvalue and Singular-Value Problems Bernard Philippe , Ahmed Sameh Campus de Beaulieu, Rennes, France Purdue University, West Lafayette, IN, USA
Definition Given a matrix A ∈ Rn×n or A ∈ Cn×n , an eigenvalue problem consists of computing eigen-elements of A which are:
of them possibly multiple. For every eigenvalue λ ∈ C, the dimension of the eigenspace Xλ = ker(A − λI) is at least . The algebraic multiplicity of λ (its multiplicity as root of ( )) is larger or equal to its geometric multiplicities (the dimension of Xλ ). When all the geometric multiplicities are equal to the algebraic ones, a basis X = [x , . . . , xn ] ∈ Cn×n of eigenvectors (say right eigenvectors) can be built by concatenating bases of the invariant subspaces X λ where λ ∈ Λ(A). In such a situation, the matrix A is said to be diagonalizable since by expressing the same operator in the basis X, D = X − AX = diag(λ , . . . , λn ).
()
Eigenvalue and Singular-Value Problems
By denoting X −H = Y = [y , . . . , yn ], Eq. becomes D = Y H AX with Y H X = I. For i = , . . . , n, the vector yi is the left eigenvector corresponding to the eigenvalue λi . The well-known Jordan normal form generalizes the diagonalization when Eq. does not hold. However, the problem of determining such a decomposition is ill-posed. Transforming it into a well-posed problem can be done by proving the existence of matrices with an assigned block structure in a neighborhood of the matrix. For more details see Kagström and Ruhe’s contribution in []. A decomposition which is numerically computable is called the Schur decomposition, in which a unitary matrix U exists such that the complex matrix
E
matrix, while the augmented symmetric matrix ⎛ A⎞ ⎟ Aaug = ⎜ ⎜ ⎟ T ⎝A ⎠
()
has eigenvalues ±σ , . . . , ±σn , corresponding to the eigenvectors ⎛ ⎜ ui √ ⎜ ⎝ ±v i
⎞ ⎟, ⎟ ⎠
E
i = , . . . , n.
Every method for computing singular values of A is based on computing the eigenpairs of B or Aaug .
The Symmetric Eigenvalue Problem ⎛ λ τ ⋯ τ n ⎜ ⎜ ⎜ ⋱ ⋱ ⋮ ⎜ T = U H AU = ⎜ ⎜ ⎜ ⋱ τ n−,n ⎜ ⎜ ⎜ λn ⎝
⎞ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
Three class of methods are available for solving problems () and () when A is symmetric: ()
is upper triangular. Eigenvectors can then be computed from T. When the matrix A is Hermitian – or real symmetrics – so is the matrix T which becomes diagonal and real. The symmetric eigenvalue problem deserves a lot of attention since it arises frequently in practice. When the matrix A is real unsymmetric, pairs of complex conjugate eigenvalues may exist. In order to avoid complex arithmetic in that case, the real Schur decomposition extends the Schur decomposition by allowing × diagonal blocks with pairs of conjugate eigenvalues.
SVD problems. Since the complex case is similar to the
real one and since the SVD of AT is obviously obtained from the SVD of A, the study is restricted to the situation A ∈ Rm×n with m ≤ n. Let A have the singular-value decomposition ( ), then, the symmetric matrix
Jacobi iterations: The oldest method, introduced by Jacobi in [], consists of annihilating successively off-diagonal entries of the matrix via orthogonal similarity transformations: A = A and Ak+ = RTk Ak Rk where Rk is a plane rotation. The scheme is organized into sweeps of n(n−)/ rotations to annihilate every off-diagonal pairs of symmetric entries once. One sweep involves n + O(n ) operations when symmetry is exploited in the computation. The method was abandoned due to high computational cost but has been revived with the advent of parallelism. However, depending on the parallel architecture, it remains more expensive than other solvers. QR iterations: The aim of the QR algorithm relies on the following iteration: A = A and Q = I ⎧ ⎪ ⎪ For k ≥ , ⎪ ⎪ ⎪ ⎪ () ⎪ ⎪ ⎨ (Qk+ , Rk+ ) = qr(Ak ) (QR Factorization) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ A = Rk+ Qk+ , ⎪ ⎩ k+ where the
QR factorization
is defined
in
()
[linear least squares and orthogonalization 12]. Under weak assumptions, matrix Ak
has eigenvalues σ ≥ . . . ≥ σn ≥ , with corresponding eigenvectors (vi ), (i = , . . . , n). B is called the normal
approaches a diagonal matrix for sufficiently large k. A direct implementation of ( ) would involve O(n ) arithmetic operations at each iteration. If,
B = AT A ∈ Rn×n ,
E
Eigenvalue and Singular-Value Problems
however, one obtains the tridiagonal matrix T = QT AQ via orthogonal similarity transformations at a cost of O(n ) arithmetic operations, the number of arithmetic operations of a QR iteration is reduced to O(n ) or even O(n) if only the eigenvalues are needed. To accelerate convergence, a shifted version of the QR algorithm is given by T = T and Q = I ⎧ ⎪ ⎪ For k ≥ , ⎪ ⎪ ⎪ ⎪ () ⎪ ⎪ ⎨ (Qk+ , Rk+ ) = qr(Tk − μk I) (QR Factorization) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ T = Rk+ Qk+ + μ k I, ⎪ ⎩ k+ where μk is an approximation of the smallest eigenvalue of Tk . Sturm sequence evaluation: The Sturm sequence in λ ∈ R is defined by p (λ) = and for k = , . . . , n by pk (λ) = det(Ak − λ Ik ) where Ak and Ik are the principal matrices of order k of A and of the identity matrix. The number N (λ) of sign changes in the sequence is equal to the number of eigenvalues of A which are smaller than λ. N (λ) can also be computed as the number of negative diagonal entries of the matrix U in the LU factorization of the matrix A − λI = LU. As for the QR iteration, the algorithm is applied to the tridiagonal matrix T, so that each Sturm sequence computation has complexity O(n). Sturm sequence allows one to partition a given interval [a, b] as desired to obtain each of the eigenvalues in [a, b] to a desired accuracy. When needed, the eigenvectors are then obtained by inverse iterations. Many references exist which describe the theory underlying the three methods outlined above, e.g., see []. A parallel version of the cyclic Jacobi algorithm was given by Sameh []. It is obtained by the simultaneous annihilation of several off-diagonal elements by a given orthogonal matrix Uk , rather than only one rotation as is done in the serial version. For example, let A be of order (see Table ) and consider the orthogonal matrix Uk as the direct sum of four independent plane rotations simultaneously determined. An example of such a matrix is
Parallel Jacobi algorithms.
Uk = Rk (, ) ⊕ Rk (, ) ⊕ Rk (, ) ⊕ Rk (, ),
Eigenvalue and Singular-Value Problems. Table Annihilation scheme as in [] (first regime) x
x
x
x
x
x
x
x
where Rk (i, j) is that rotation which annihilates the (i, j) off-diagonal element (⊕ indicates that the rotations are assembled in a single matrix and extended to order n by the identity). Let one sweep be the collection of such orthogonal similarity transformations that annihilate the element in each of the n(n − ) off-diagonal positions (above the main diagonal) only once, then for a matrix of order , the first sweep will consist of seven successive orthogonal transformations with each one annihilating distinct groups of maximum four elements simultaneously as described in the Table . For the remaining sweeps, the structure of each subsequent transformation Uk , k > , is chosen to be the same as that of Rj where j = + (k − ) mod . In general, the most efficient annihilation scheme consists of (r − ) similarity transformations per sweep, where r = ⌊ (n + )⌋, in which each transformation annihilates different ⌊ n⌋ off-diagonal elements (see []). Several other annihilation schemes are possible which are based on round-robin techniques. Luk and Park [] have demonstrated that various parallel Jacobi rotation ordering schemes are equivalent to the sequential row-ordering scheme, and hence share the same convergence properties. The above algorithms are well-suited for shared memory computers. While they can also be implemented on distributed memory systems, their efficiency on such systems may suffer due to communication costs. In order to increase the granularity of the computation (i.e., to increase the number of floating point operations between two communications), block algorithms are considered. A parallel Block-Jacobi algorithm is given by Gimenez et al. in []. This algorithm takes advantage of the symmetry of the matrix.
Eigenvalue and Singular-Value Problems
Tridiagonalization of a symmetric matrix by Householder reductions. The goal is to compute a symmetric tridiago-
nal matrix T, orthogonaly similar to A: T = QT AQ where Q = H ⋯Hn− combines Householder reductions. These transformations are defined in [linear least square and orthogonalization 12]. The transformation H is chosen such that n − zeros are introduced in the first column of H A, the first row being inchanged: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ H A = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
Therefore, by symmetry, applying H on the right results in the following pattern: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ A = H AH = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⋆ ⋆ ⎞ ⎟ ⎟ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆⎟ ⎟ ⎟ ⎟ ⋆ ⋆ ⋆ ⋆ ⋆⎟ ⎟ ⎟ ⎟ ⋆ ⋆ ⋆ ⋆ ⋆⎟ ⎟ ⎟ ⎟ ⋆ ⋆ ⋆ ⋆ ⋆⎟ ⎟ ⎟ ⎟ ⋆ ⋆ ⋆ ⋆ ⋆⎠
()
In order to take advantage of the symmetry, the reduction A = H AH can be implemented by a symmetric rank-two update which is a routine of the Level BLAS []. The tridiagonal matrix T is obtained by repeating successively similar reductions on columns to n − . The total procedure involves n / + O(n ) arithmetic operations. Assembling the matrix Q = H . . . Hn− requires n / + O(n ) additional operations. As indicated in [linear least square and orthogonalization 12], successive applications of Householder reductions can be compacted into blocks which allow use of routines of Level BLAS []. Such reduction is implemented in routine DSYTRD of LAPACK [] which takes advantage of the symmetry of the process.
E
A parallel version of the algorithm is implemented in routine PDSYTRD of ScaLAPACK []. Parallel QR for the symmetric tridiagonal eigenvalue problem. The QR scheme is the method of choice when
all the eigenvalues and the eigenvectors are sought. A Divide-and-Conquer algorithm was introduced by Dongarra and Sorensen in []. It consists of partitioning the triadiagonal matrix by rank-one tearings. A tridiagonal matrix T of order n can be partitioned ⎛T ⎞ ⎟ + ρff T where T and T as follows: T = ⎜ ⎟ ⎜ ⎝ T ⎠ are two tridiagonal matrices of order n and n such that n = n + n and where f is a vector with zero entries except in position n and n + . If the Schur decompositions T = Q D QT and T = Q D QT are already known, the problem is reduced to computing ˜ = QT TQ = D + ρzz T where the eigenvalues of D ⎛Q ⎞ ⎛D ⎞ ⎟, D = ⎜ ⎟, and z = Qf . Q = ⎜ ⎜ ⎜ ⎟ ⎟ ⎝ Q ⎠ ⎝ D ⎠ The eigenvalues of T are interleaved with the diagonal entries of D and satisfy a simple secular equation obtained from the characteristic polynomial of ˜ The eigenvectors are then explicitely known but D. they must be expressed back in the original basis by pre-multiplying them by Q. The back transformation may even include the initial orthogonal transformation which reduces the original symmetric matrix to the tridiagonal form. By p recursive tearings, the problem is reduced to p independent eigenvalue problems of order m = n/p . Then, the technique described above is recursively applied in parallel to update the eigendecompositions. The total number of arithmetic operations is O(n ). Parallel Sturm sequences. Sturm sequences are often used when only the part of the spectrum, in an interval [a, b], is sought. Several authors discussed the parallel computation of Sturm sequences for a tridiagonal matrix T, and among them are Lo et al. in []. It is hard to parallelize the LU factorization of (T − λI) which provides the number of sign changes in the Sturm sequences, because the diagonal entries of U are obtained by a nonlinear recurrence. One
E
E
Eigenvalue and Singular-Value Problems
possible approach is to consider the linear three-term recursion which computes the Sturm sequence terms pk (λ) = det(Tk − λ Ik ), for k = , . . . , n. Since this computation corresponds to solving a triangular banded system of three diagonals, it can be parallelized, as for instance, by the algorithm described by Chen et al. in []. An analysis of the situation is given in []. Because the sequence (pk (λ)),n could suffer from overflow or underflow, Zhang proposed the computation of a scaled sequence in [] and provided the backward error of the sequence. To parallelize the computation, a more beneficial approach is to consider simultaneous computation of Sturm sequences by replacing bisection of intervals by multisection. However, multisections are efficient only when most of the created subintervals contain eigenvalues. Therefore, in [] a strategy in two steps is proposed: () isolating all the eigenvalues with disjoint intervals, () extracting each eigenvalue from its interval. Multisections are used for step () and bisections or other root finders are used for step (). This approach proved to be very efficient. When the eigenvectors are needed, they are computed independently by Inverse Iterations. A difficulty could arise if one wishes to compute all the eigenvectors corresponding to a cluster of very poorly separated eigenvalues. Demmel et al. discussed in [] the reliability of the Sturm sequence computation in floating point arithmetic where the sequence is no longer monotonic. Therefore, in very rare situations, it is possible to obtain a wrong answer with regular bisection. A robust algorithm, called MRRR developed by Dhillon et al. [], is implemented via LAPACK with routine DSTEMR for the computation of high-quality eigenvectors. Several approaches are discussed in [] for possible parallel implementations of the scheme in DSTEMR.
Also, successive applications of Householder reductions can be compacted into blocks which allow use of Level BLAS []. The algorithm is implemented via routine DGEHRD of LAPACK []. A parallel version of the algorithm is implemented via routine PDGEHRD of ScaLAPACK [].
Parallel solution of the Hessenberg eigenproblem. Sev-
eral attempts for generalizing the Divide-and-Conquer approach used in the symmetric case have been considered, including Dongarra and Sidani in [], Adams and Arbenz in []. In these algorithms, the Hessenberg matrix is partitioned into a sum of a two-block uppertriangular matrix and a rank-one update. Compared to the symmetric case, there are two major drawbacks that limit the benefit of the tearing procedure: () the condition number of the eigenproblems defined by the two diagonal blocks of the first matrix can be much higher than that of the original eigenproblem as shown by Jessup in []. () the Newton iterations involved in the updating process include solving triangular or Hessenberg linear systems rendering the procedure rather more expensive than without tearing. Henry et al. discussed in [] existing approaches for realizing a parallel QR procedure and ended up considering an improved implicit multishifted strategy which allows the use of Level BLAS []. The algorithm is implemented in subroutine PDLAHQR of ScaLAPACK [].
The Singular-Value Problem As mentioned above, methods for solving the symmetric eigenvalue problems can be reconsidered as methods for solving SVD problems. In this section, the study is restricted to the case where A ∈ Rm×n with m ≥ n.
The Unsymmetric Eigenvalue Problem
Parallel Jacobi algorithms for SVD. In the special case
Reduction to the Hessenberg form. Similar to the sym-
where m ≫ n, one should first perform an initial “skinny” QR factorization of A = QR where Q ∈ Rm×n is orthogonal and R ∈ Rn×n is triangular. Next, the SVD T ˜ of R = UΣV provides the SVD of A = UΣV T with ˜ Computing the QR factorization in parallel U = QU. has been outlined in [linear least squares and orthogonalization 12]. The complete process is obtained at the cost of mn + O(mn + n ) arithmetic operations.
metric case, the reduced form is obtained by Householder reductions. But here, lack of symmetry implies that, unlike the picture in (), zeros are not introduced in the upper part of the matrix. Therefore, the process results in an upper Hessenberg matrix. The total procedure involves n / + O(n ) arithmetic operations. Assembling the matrix Q = H . . . Hn− requires n / + O(n ) additional operations.
Eigenvalue and Singular-Value Problems
One-sided and two-sided Jacobi algorithms. The first
one-sided algorithm was introduced by Hestenes in []. Applying a rotation R to both sides of B = AT A for annihilating the entry βij is equivalent to post-multiply A by R for making the columns i and j of AR orthogonal. Therefore, any parallel Jacobi algorithm for solving a symmetric eigenvalue problem can be considered as a parallel one-sided Jacobi algorithm for solving a SVD problem. Each sweep is a sequence of orthogonal transformations Vk made of a set of independent rotations that are successively applied on the right side by Ak+ = Ak V˜ kT (A = A). When the right singular vectors are needed, they are accumulated by Vk+ = Vk V˜k = (V = I). At convergence, the singular values are given by the norms of the columns of Ak and the normalized columns are the corresponding left singular vectors. Each sweep performs mn + O(mn) arithmetic operations. Brent and Luk [] and Sameh [] described possible implementations of this algorithm on multiprocessors. Charlier et al. [] demonstrate that an implementation of Kogbetliantz’s algorithm for computing the SVD of upper-triangular matrices is quite effective on a systolic array of processors. Therefore, the first step in a Jacobi SVD scheme is the QR factorization of A, followed by computing the SVD of ∈ Rn×n . The Kogbetliantz’s method for computing the SVD of a real square matrix A mirrors the method for computing the eigenpairs of symmetric matrices, in that the matrix A is reduced to the diagonal form by an infinite sequence of plane rotations Ak+ = Uk Ak VkT ,
k = , , . . . ,
()
Vk (i, j, ϕ kij ),
Uk = Uk (i, j, θ ijk ) where A ≡ A, and Vk = are plane rotations. It follows that Ak approaches the diagonal matrix Σ = diag(σ , σ , . . . , σn ), where σi is the ith singular value of A, and the products (Uk . . . U U ), (Vk . . . V V ) approach matrices whose ith column is the respective left and right singular vector corresponding to σi . Block Jacobi algorithms. The above algorithms are well-
suited for shared memory architectures. While they can also be implemented on distributed memory systems, their efficiency on such systems will suffer due to communication costs. In order to increase the granularity of the computation (i.e., to increase the number of floating
E
point operations between two communications), block algorithms are considered. For one-sided algorithms, each processor is allocated a block of columns instead of a single column. The computation remains the same as discussed above with the ordering of the rotations within a sweep is as given in []. For the two-sided version, the allocation manipulates -D blocks instead of single entries of the matrix. Beˇcka and Vajtešic propose in [] a modification of the basic algorithm in which one annihilates, in each step, two off-diagonal blocks by performing a full SVD on a small-sized matrix. While reasonable efficiencies is realized on distributed memory systems, this block strategy increases the number of sweeps needed to achieve convergence. We note that parallel Jacobi algorithms can only surpass the speed of the bidiagonalization schemes of ScaLAPACK when the number of processors available are much larger than the size of the matrix under consideration. Bidiagonalization of a matrix by Householder reductions.
The classical algorithm which reduces A into an upperbidiagonal matrix B, via the orthogonal transformation B = U T AV, was introduced by Golub and Kahan (see for instance []). The algorithms requires (mn −n /)+ O(mn) arithmetic operations. U and V can be assembled in (mn − n /) + O(mn) and n / + O(n ) additional operations, respectively. A parallel version of this algorithm is implemented in routine PDGEBRD of ScaLAPACK. The one-sided reduction for bidiagonalizing a matrix was proposed by Raha []. It appears to be better suited for parallel implementation. It consists of determining in a first step the orthogonal matrix V which would appear in the tridiagonalization of AT A: T = V T AT AV = BT B. The matrix V can be obtained by using Householder or Givens reductions, without first forming AT A explicitely. The second step performs a QR factorization of F = AV: F = QR. Since F T F = RT R = T is tridiagonal, its Cholesky factor R is upper-bidiagonal which means that any two nonadjacent columns of F are orthogonal. This allows a simplified QR factorization. Bosner and Barlow introduced in [] two adaptations of Raha’s approach: a block version which allows use of Level BLAS routines and a parallel version. Once the bidiagonalization is performed, all that remains is to compute the SVD of B. When only
E
E
Eigenvalue and Singular-Value Problems
the singular values are sought, it can be done by the iterative Golub and Kahan algorithm where the number of arithmetic operations is O(n) per iteration (see for instance []) which is performed only on uniprocessor. The whole Singular-Value Decomposition is implemented in routine PDGESVD of ScaLAPACK.
Bibliographic Notes and Further Reading Eigenvalue problems and singular-value decomposition: [] Algorithms: [] Linear algebra: [, , ] Parallel algorithms for dense computations: [, ]
Bibliography . Adams L, Arbenz P () Towards a divide and conquer algorithm for the real nonsymmetric eigenvalue problem. SIAM J Matrix Anal Appl ():– . Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D () LAPACK users’ guide, rd edn. Society for Industrial and Applied Mathematics, Philadelphia, Library available at http://www.netlib.org/lapack/ . Bai Z, Demmel J, Dongarra J, Ruhe A, van der Vorst H (eds) () Templates for the solution of algebraic eigenvalue problems – a practical guide. SIAM, Software–Environments–Tools, Philadelphia . Beˇcka M, Vajtešic M () Block-Jacobi SVD algorithms for distributed memory systems. Parallel algorithms and applications, Part I in , –, Part II in , – . Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLA-PACK users’ guide. Society for Industrial and Applied Mathematics, Philadelphia, Available on line at http://www.netlib.org/lapack/lug/ . Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC () An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Soft –: – . Bosner N, Barlow JL () Block and parallel versions of onesided bidiagonalization. SIAM J Matrix Anal Appl :– . Brent RP, Luk FT () The solution of singular-value and symmetric eigenvalue problems on multiporcessor arrays. SIAM J Sci Stat Comput :– . Charlier J-P, Vanbegin M, Van Dooren P () On efficient implementations of kogbetliantz’s algorithm for computing the singular value decomposition. Numer Math :–
. Chen SC, Kuck DJ, Sameh AH () Practical parallel band triangular system solvers. ACM Trans Math Software :– . Demmel J, Dhillon I, Ren H () On the correctness of some bisection-like parallel eigenvalue algorithms in floating point arithmetic. ETNA :– . Demmel J () Applied numerical linear algebra. SIAM, Philadelphia . Dhillon D, Parlett B, Vömel C () The design and implementation of the MRRR algorithm. ACM Trans Math Software :– . Dongarra JJ, Duff IS, Sorensen DC, Van der Vorst H () Numerical linear algebra for high performance computers. SIAM, Philadelphia . Dongarra JJ, Sidani M () A parallel algorithm for the nonsymmetric eigenvalue problem. SIAM J Sci Comput :– . Dongarra JJ, Sorensen DC () A fully parallel algorithm for the symmetric eigenvalue problem. SIAM J Sci Stat Comput :s–s . Dongarra JJ, Van De Gein R () Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM J Sci Comput :– . Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms for dense linear algebra computations. In: Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh AH, Voigt RG (eds) Parallel algorithms computations. SIAM, Philapdelphia . Gimenez D, Hernandez V, van de Gein R, Vidal AM () A Block Jacobi method on a mesh of processors. Conc Pract Exp –:– . Golub GH, Van Loan CF () Matrix computations, rd edn. Johns Hopkins University Press, Baltimore . Henry G, Watkins D, Dongarra J () A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures. SIAM J Sci Comput :– . Hestenes MR () Inversion of matrices by biorthogonalization and related results. J Soc Ind Appl Math :– . Jacobi CGJ () Uber ein Leiches Vehfahren Die in der theorie der Sacular-storungen Vorkom-mendern Gleichungen Numerisch Aufzuosen. Crelle’s J für reine und angewandte Mathematik :– . Jessup ER () A case against a divide and conquer approach to the nonsymmetric eigenvalue problem. J Appl Num Math : – . Kagström B, Ruhe A () An algorithm for numerical computation of the Jordan normal form of a complex matrix. ACM Trans Math Software :– . Lo SS, Philippe B, Sameh A () A multiprocessor algorithm for the symmetric tridiagonal eigenvalue problem. SIAM J Sci Stat Comput :s–s . Luk F, Park H () On parallel Jacobi orderings. SIAM J Sci Stat Comput :– . Parlett BN () The symmetric eigenvalue problem. SIAM, Classics in Applied Mathematics, Philadelphia . Ralha R () One-sided reduction to bidiagonal form. Linear Algebra Appl :–
EPIC Processors
. Sameh A () On Jacobi and Jacobi-like algorithms for a parallel computer. Math Comp :– . Sameh A () Solving the linear least squares problem on a linear array of processors. In: Snyder L, Gannon DB, Siegel HJ (eds) Algorithmically specialized parallel computers. Academic, H.J. Siegel . Trefethen LN, Bau D III () Numerical linear algebra. SIAM, Philadelphia . Zhang J () The scaled sequence computation. In: Proceedings of the SIAM Conference on Applied Linear Algebra. July –,
EPIC Processors David I. August, Arun Raman Princeton University, Princeton, NJ, USA
Definition Explicitly Parallel Instruction Computing (EPIC) refers to architectures in which features are provided to facilitate compiler enhancements of instruction-level parallelism (ILP) in all programs, while keeping hardware complexity relatively low. Using ILP-enhancement techniques such as speculation and predication, the compiler identifies the operations that can execute in parallel in each cycle and communicates a plan of execution to the hardware.
Discussion
E
EPIC processors combine the relative hardware simplicity of Very Long Instruction Word (VLIW) processors and the runtime adaptability of superscalar processors. VLIW processors require the compiler to specify a POE consisting of a cycle-by-cycle description of independent operations in each functional unit, and the hardware follows the POE to execute the operations. While VLIW considerably simplifies the hardware, it also results in machine-code incompatibility between implementations of a VLIW architecture. Due to the strict adherence to the compiler’s POE, code compiled for one implementation will not execute correctly on other implementations with different hardware resources or latencies. Superscalar processors discover ILP and construct a POE entirely in hardware. This means that code compiled for a particular superscalar architecture will execute correctly on all implementations of that architecture. However, ILP extraction in hardware entails significant complexity, resulting in diminishing performance returns and an adverse impact on clock rate. EPIC processors retain VLIW’s philosophy of statically exposing ILP and constructing the POE, thereby keeping the hardware complexity relatively low, but incorporate the ability of superscalar processors to cope with dynamic factors such as variable memory latency. Combined with techniques to provide binary compatibility for correctness, EPIC processors provide a means to achieve high levels of ILP across diverse applications while reducing the hardware complexity needed for the same.
Introduction
Design Principles
The performance of modern processors is dependent on their ability to execute multiple instructions per cycle. In the early s, instruction-level parallelism (ILP) was the only viable approach to achieve higher performance without rewriting software. Programs for ILP processors are written in a sequential programming model, with the compiler and hardware being responsible for automatically extracting the parallelism in the program. Explicitly Parallel Instruction Computing (EPIC) refers to architectures in which features are provided to facilitate compiler enhancements of ILP in all programs. These architectures allow the compiler to generate an effective plan of execution (POE) of a given program, and provide mechanisms to communicate the compiler’s POE to the hardware.
In EPIC architectures, the compiler is responsible for designing a POE. This means that the compiler must identify the inherent parallelism of a sequential program and map it efficiently to the parallel hardware resources in order to minimize the program’s execution time. The main challenge faced by the compiler in determining a good POE is on account of dynamic events that determine the program’s runtime behavior. Will a load-store pair access the same memory location? If yes, then the operations must be executed sequentially; otherwise, they may be scheduled in any order. Will a conditional branch be taken? If yes, then operations along the taken path may be scheduled in parallel with operations before the branch; otherwise, they must not be executed. In such situations, the compiler can
E
E
EPIC Processors
speculate that a load-store pair will not alias, or that a branch will be taken, with the compiler’s confidence in the speculation typically being determined by the outcomes of profiling runs on representative inputs. EPIC architectures support compiler speculation by providing mechanisms in hardware to ensure correctness if speculation fails. Finally, the architecture must be rich enough for the compiler to express its POE; specifically, which operations are to be issued in parallel and which hardware resources they must use. Section “ILP Enchancement” presents the major techniques used to enhance ILP of programs. Section “Enabling Architectural Features” describes the micro-architectural features that enable these and other ILP enhancement techniques.
ILP Enhancement The code example in Fig. is used to illustrate the main compiler transformations for ILP enhancement that are made possible by EPIC architectures. In the example, a condition is evaluated to alternatively update a memory location p4 with the contents of location p2 or p3. Outside the if-construct, the variable val is updated with the value at location p5, with the latter potentially aliasing with p4. The processor model assumed for illustration purposes, shown in Table , is a six-issue processor capable of executing one branch per cycle, with no further restrictions on the combination of operations that may be concurrently issued. Conditional branches require separate comparison and EPIC Processors. Table Assumed processor model to illustrate ILP enhancement techniques
control transfer operations. All operations are assumed to have a latency of one cycle, with the exception of memory loads that have a latency of two cycles. Figure a shows the scheduled, assembly code. The compiler conservatively respects all dependences, resulting in a rather sparse schedule with several slots going to waste. The execution time along either branch is ten cycles.
Speculation Compiler-controlled speculation refers to breaking inherent programmatic dependences by guessing the outcome of a runtime event at compile time. As a result, the available ILP in the program is increased by reducing the height of long dependence chains and by increasing the scheduling freedom among the operations. Control Speculation
Many general-purpose applications are branchintensive. The latency to resolve a branch directly affects ILP since potentially multiple issue slots may go to waste. It is crucial to overlap other instructions with branches. Control speculation breaks control dependences which occur between branches and other operations. An operation is control dependent on a branch if the branch determines whether the operation will be executed at runtime. By guessing the direction of a branch, the control dependence may be broken effectively making the operation’s execution independent of the branch. Control speculation allows operations to move across branches, thus reducing dependence height and resulting in a more compact schedule.
Issue width
Six instructions
Data Speculation
Latency of conditional branches
Two cycles
Latency of memory loads
Two cycles
Latency of other instructions
One cycle
Many general-purpose applications exhibit irregular memory access patterns. Often, a compiler is conservative because it respects a memory dependence that occurs infrequently or because it cannot determine that the dependence does not actually exist. In either case, data speculation breaks data flow dependences between memory operations. Two memory operations are flow dependent on one another if the first operation writes a value to a memory location and the second operation potentially reads from the same location. By guessing that the two memory operations will access different locations, the load operation may be hoisted above the
if (*p1 == 0) *p4 = *p2; else *p4 = *p3; val += *p5;
EPIC Processors. Fig. Example C code to illustrate ILP enhancement techniques
EPIC Processors
0
E
(1) r1 = MEM[p1]
1 2
(2) c1 = (r1 == 0)
3
(3) JUMP c1, ELSE
4
(4) r2 = MEM[p2]
5 6
(5) MEM[p4] = r2
(6) JUMP CONTINUE
ELSE: 0
E
(7) r3 = MEM[p3]
1 2
(8) MEM[p4] = r3 CONTINUE:
0
(9) r5 = MEM[p5]
1
a 0
2
(10) r4 = r4 + r5
(4) r2 = MEM[p2]
(1) r1 = MEM[p1]
(7) r3 = MEM[p3]
(9) r5 = MEM[p5]
1 2
(2) c1 = (r1 == 0)
3
(3) JUMP c1, ELSE
4
(4′) CHECK r2
5
(10) r4 = r4 + r5
(5) MEM[p4] = r2
(6) JUMP CONTINUE
ELSE: 0 1
(7′) CHECK r3 (8) MEM[p4] = r3 CONTINUE:
b
0
(9′) CHECK r5
0
(1) r1 = MEM[p1]
1 2
(11) pT, pF = (r1 == 0)
3
(4) r2 = MEM[p2]
(7) r3 = MEM[p3]
5
(5) MEM[p4] = r2
(8) MEM[p4] = r3
6
(9) r5 = MEM[p5]
4
7
c 0
(1) r1 = MEM[p1]
8
(10) r4 = r4 + r5
(4) r2 = MEM[p2]
(7) r3 = MEM[p3]
(9) r5 = MEM[p5]
1
d
2
(11) pT, pF = (r1 == 0)
3
(4′) CHECK r2
(7′) CHECK r3
4
(5) MEM[p4] = r2
(8) MEM[p4] = r3
5
(10) r4 = r4 + r5
(9′) CHECK r5
EPIC Processors. Fig. Predication and speculation have a synergistic relationship: applying speculation after predication results in the best improvement in ILP. (a) Sequential schedule. (b) Schedule using control and data speculation alone. (c) Schedule using predication alone. (d) Schedule using predication and speculation combined
E
EPIC Processors
store. EPIC provides for more aggressive code motion: operations dependent on the load may also be hoisted above potentially aliasing stores. Applying speculation to the code in Fig. results in the tighter schedule shown in Fig. b.
to detect if a dependence actually existed for this execution and initiates repair if required. In general, all data-speculative and all potentially excepting controlspeculative operations require checks. In Fig. b, operations ′ , ′ , and ′ are the previously discussed symbolic check operations.
Predication Predicated execution is a mechanism that supports conditional execution of individual operations based on Boolean guards, which are implemented as predicate register values. A compiler converts control flow into predicates by applying if-conversion []. If-conversion translates conditional branches into predicate defining operations and guards operations along alternative paths of control under the computed predicates. A predicated operation is fetched regardless of its predicate value. An operation whose predicate is TRUE is executed normally. Conversely, an operation whose predicate is FALSE is prevented from modifying the processor state. With if-conversion, complex nests of branching code can be replaced by a straight-line sequence of predicated code. Predication increases ILP by allowing separate control flow paths to be overlapped and simultaneously executed in a single thread of control. Figure c illustrates an if-conversion of the code segment from Fig. . The predicate for each operation is shown within angle brackets. For example, operation is predicated on pT. The absence of a predicate indicates that the operation is always executed. Predicates are computed using predicate define operations, such as operation . The predicate pT is set to if the condition evaluates to true, and to if the condition evaluates to false. The predicate pF is the complement of pT. Operations and , and operations and are executed concurrently with the appropriate ones taking effect based on the predicate values. The net result of applying predication is that the dependence height of the code segment is reduced to nine cycles. Combining Speculation and Predication Both speculation and predication provide effective means to increase ILP. However, the example shows that their means of improving performance are fundamentally different. Speculation allows the compiler to break control and memory dependences, while predication allows the compiler to restructure program control flow
EPIC Processors
and to overlap separate execution paths. The problems attacked by both techniques often occur in conjunction; therefore, the techniques can be mutually beneficial. Figure d illustrates the use of predication and speculation in combination. If-conversion removes the branch resulting in a stream of predicated operations. As before, data speculation breaks the dependences between operations and , and operations and , allowing the compiler to move operation to the top of the block. Even though no branches remain in the code, control speculation is still useful to break dependences between predicate definitions and guarded instructions. In this example, the control dependences between operations and , and operations and are eliminated by removing the predicates on operations and . This form of speculation in predicated code is called promotion. As a result of this speculation, the compiler can hoist operations and to the top of the block to achieve a more compact schedule of height six cycles. In summary, predication is only % faster and speculation is only % faster than the original code segment. As the example shows, speculation in the form of promotion can have a greater positive effect on performance after if-conversion than before. This synergistic relationship between predication and speculation yields a combined speedup of % over the original code segment.
Enabling Architectural Features Since EPIC delegates the responsibilities of enhancing ILP and planning an execution schedule to the compiler, the architecture must provide features: . To overcome the worst impediments to a compiler’s ability to enhance ILP, namely, frequent control transfers and ambiguous memory dependences . To specify an execution schedule
Speculation Support As discussed in Section “ILP Enhancement”, an architecture that supports speculation must provide mechanisms to detect potential exceptions on controlspeculative operations as they occur, to record information about data-speculative memory accesses as they occur, and then to check at an appropriate time whether an exception should be taken or data-speculative repair should be initiated.
E
First, speculative operations must be distinguished from nonspeculative operations, since the latter should report exceptions immediately. Also, loads that have not been speculated with regard to data dependence need not interact with memory conflict detection hardware. For this, each operation that can be speculated has an extra bit, called the S-bit, associated with it. This bit is set on operations that are either control speculated or are data dependent on a data-speculative load. Each load has an additional bit, called the DS-bit. This bit is set only on data-speculative loads. Second, to defer the reporting of exceptions arising from speculative operations until it is clear that the operation would have been executed in the original program, a -bit tag called the NaT (Not a Thing) bit is added to each register. When a speculative load causes an exception, the NaT bit is set instead of immediately raising the exception. Speculative instructions that use a register with its NaT bit set propagate the bit through their destination registers until a check operation determines whether speculation failed in which case the deferred exception is suppressed. Third, a mechanism must be provided to store the source addresses of data-speculative loads until their independence with respect to intervening stores can be established. This functionality can be provided by the Memory Conflict Buffer (MCB) [] or the Advanced Load Address Table (ALAT) []. The MCB temporarily associates the destination register number of a speculative load with the address from which the value in the register was speculatively loaded. Destination addresses of subsequent stores are checked against the addresses in the buffer to detect memory conflicts. The MCB is queried by explicit data speculation check instructions, which initiate recovery if a conflict was detected.
Predication Support A set of predicate registers are used to store the results of compare operations. For example, the Intel Itanium processor family has a set of (-bit) predicate registers. An EPIC processor must also implement some form of predicate squash logic, which prevents instructions with false predicates from committing their results. The IMPACT EPIC architecture [] also provides predication-aware bypass/interlock logic, which forwards results based on the predicate values associated with the generating instructions.
E
E
EPIC Processors
Static Scheduling Support EPIC provides many features for enabling the compiler to specify a high-quality Plan Of Execution (POE). These features include the ability to specify multiple operations per instruction, the notion of architecturally visible latencies, and an architecturally visible register structure. Multiple Operations per Instruction (MultiOp)
EPIC architectures allow the compiler to specify explicit information about independent operations in a program. A MultiOp is a specification of a set of operations that are issued simultaneously to be executed by the functional units. Exactly one MultiOp is issued per cycle of the virtual time. The virtual time is the schedule built by the compiler, and may differ from the actual time if the hardware inserts stalls due to dynamic events. For example, the Itanium architecture [] consists of instruction groups that have no read-after-write (RAW) or write-after-read (WAR) register dependencies, delimited by stops. The Itanium implementation consists of -bit instruction bundles comprising three -bit opcodes and a -bit template that includes intrabundle stop bits. The use of template bits greatly simplifies instruction decoding. The use of stops enables explicit specification of parallelism. Architecturally Visible Latencies
A lot of the hardware complexity of a superscalar processor arises from the need to maintain an illusion of sequential execution (i.e., each operation completes before another begins) in the face of non-atomicity of the operations. To avoid the complexity, EPIC processors expose operations as being nonatomic; instead, read-and-write events of an operation are viewed as the atomic events. There is a contractual specification of assumed latencies of operations that must be honored by both the hardware and the compiler. A MultiOp that takes more than one cycle to generate all the results is said to have non-unit assumed latency (NUAL). The guarantee of NUAL affords multiple benefits to EPIC processors. Since the hardware is certain that the read event of an operation will not occur before the write event of the operation that produces the value to be read, there is no need for stall/interlock capability. Since the compiler is certain that an operation will not write its result before its assumed latency has elapsed, it
can schedule other operations that are anti- or outputdependent on the operation in question, resulting in tighter schedules. As mentioned earlier, EPIC processors may still have interlocking to account for the deviation of actual latencies of operations from their assumed latencies. However, MultiOp and NUAL permit an in-order implementation of the architecture while still extracting substantial ILP. Register Structure
ILP processors require a large number of registers. Specifically in EPIC processors, techniques such as control and data speculation cause increased register pressure. Fundamentally, since multiple operations are executing in parallel, there is a demand for a larger store for temporary values. Superscalar processors architecturally expose only a small fraction of the physical registers, and rely on register renaming to generate parallel schedules. However, this requires complex hardware for dynamic scheduling. Since EPIC demands a schedule from the compiler in order to simplify the hardware, it becomes necessary to expose a larger set of architectural registers to generate the same parallel schedule as achieved through hardware register renaming. EPIC architectures also provide a mechanism to rotate the register file to speed up inner-loop execution. Speeding up loops involves executing instructions from multiple loop iterations in parallel. One way to do this is by using Modulo Scheduling []. Modulo scheduling schedules one iteration of the loop such that when this schedule is repeated at regular intervals, no intra- or inter- iteration dependence is violated, and no resource conflict arises between the dynamic operations within and across loop iterations. While generating code, it is important that false dependences do not arise owing to dynamic instances on different iterations of the same operation writing to the same registers. While the loop could be unrolled, followed by static register renaming, this typically results in substantial code growth. EPIC architectures implement a rotating register file that provides register renaming such that successive writes to the same virtual register actually go to different physical registers. This is accomplished through the use of a rotating register base (RRB) register. The physical register number accessed by each operation is the sum of the virtual register number specified in the instruction
EPIC Processors
and the number in the RRB register, modulo the number of registers in the rotating register file. The RRB is updated at the end of each iteration using special operations. In this manner, EPIC architectures provide compiler-controlled dynamic register renaming. Another interesting feature is the register stack that is used to simplify and speed up function calls. Instead of spilling and filling all the registers at a procedure call and return sites, EPIC processors enable the compiler to rename the virtual register identifiers. The physical register accessed is determined by renaming the virtual register identifier in the instruction through a base register. The callee gets a physical register frame that may overlap with the caller’s frame; the overlap constitutes the registers that are passed as parameters.
Cache Management EPIC architectures allow the compiler to explicitly control data movement through the cache hierarchy by exposing the hierarchy architecturally. Generating a high-quality schedule often requires the compiler to be able to accurately predict the level in the cache hierarchy where a load operation will find its data. This is normally achieved by profiling or analytical means, and the source-cache specifier is encoded as part of the load instruction. This informs the hardware of the expected location of the datum as well as the assumed latency of the load. EPIC architectures also provide a target cache specifier for load/store operations that allow a compiler to instruct the hardware of where the datum should be put in the cache hierarchy for subsequent references to that datum. This allows the compiler to explicitly control the content of the caches. The compiler could, for example, prevent pollution of the caches by excluding data with poor temporal locality from the higher-level caches, and remove data from the last level cache after the last use. Many streaming programs operate on data without accessing them again. To improve the latency of such accesses, EPIC architectures provide a data prefetch cache. Prefetch load instructions can be used to prefetch data with poor temporal locality into the data prefetch cache. This prevents the displacement of first-level data cache contents that may potentially have much better temporal locality.
E
Related Entries Cydra Dependences Instruction-Level Parallelism Modulo Scheduling and Loop Pipelining Multiflow Computer Parallelization, Automatic
E Bibliographic Notes and Further Reading The EPIC philosophy was primarily developed by HP researchers in the early s [, ]. As mentioned in the introduction, the EPIC philosophy was fundamentally influenced by the VLIW philosophy developed in the s in the Multiflow and Cydrome processors [, ]. EPIC inherited various features from Cydrome, most significantly: ● ●
Predicated execution Support for software pipelining of loops in the form of rotating register files ● A rudimentary MultiOp instruction format ● Hardware support for speculation including speculative opcodes, a speculative tag bit in each register, and deferred exception handling Many of the ideas on speculation and its hardware support were developed independently by researchers in the IMPACT group at the University of Illinois [] and IBM Research []. The most famous commercial EPIC processor is the Itanium processor family jointly developed by HP and Intel. The first processor in the family was released in , with successors having improved memory subsystems, multiple cores on the same die, and improved power efficiency. More information on the EPIC philosophy and implementation can be found in the technical reports and manuals on the topic [, ].
Bibliography . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion of control dependence to data dependence. In: Proceedings of the th ACM Symposium on Principles of Programming Languages, pp –, January . Gallagher DM, Chen WY, Mahlke SA, Gyllenhaal JC, Hwu WW () Dynamic memory disambiguation using the memory conflict buffer. In: Proceedings of th International Conference on
E .
.
.
.
. .
.
.
.
Erlangen General Purpose Array (EGPA)
Architectural Support for Programming Languages and Operating Systems, pp –, October Intel Corporation () Intel Itanium architecture software developer’s manual. Application architecture, vol , Revision .. Santa Clara, CA August DI, Connors DA, Mahlke SA, Sias JW, Crozier KM, Cheng B, Eaton PR, Olaniran QB, Hwu WW () Integrated predication and speculative execution in the IMPACT EPIC architecture. In: Proceedings of the th International Symposium on Computer Architecture, pp –, June Rau BR () Iterative modulo scheduling: an algorithm for software pipelining loops. In: Proceedings of the th International Symposium on Microarchitecture, pp –, December Schlansker MS, Rau BR () “EPIC: An architecture for instruction-level parallel processors,” Technical Report HPL-, Compiler and Architecture Research HP Laboratories Palo Alto, February Schlansker MS, Rau BR () EPIC: Explicitly Parallel Instruction Computing. Computer ():– Fisher JA () Very long instruction word architectures and the ELI-. In: ISCA ’: Proceedings of the th Annual International Symposium on Computer Architecture, ACM, New York, pp – Beck GR, Yen DW, Anderson TL () The Cydra minisupercomputer: architecture and implementation. J Supercomput :– Mahlke SA, Chen WY, Hwu WW, Rau BR, Schlansker MS () Sentinel scheduling for VLIW and superscalar processors. In: Proceedings of the th International Conference on Architectural Support for Programming Languages and Operating Systems, pp –, October Ebcioglu K () Some design ideas for a VLIW architecture for sequential-natured software. In: Cosnard M, Barton MH, Vanneschi M (eds) Parallel Processing (Proc. IFIP WG . Working Conference on Parallel Processing, Pisa, Italy), pp –, North Holland,
developed, mainly by W. Händler. Such a system is a collection of nodes where each node consists of one processor and one shared memory unit at least. The nodes are organized in several hierarchically ordered planes (layers) and one top-node. Each plane is a collection of m identical nodes arranged in an m × m grid, where each node is connected to its four nearest neighbors via shared memory. At the border of the plane, the nodes are interconnected in such a way that the layer forms a torus. Each upper plane is smaller than the layer immediately below. It has a quarter of the nodes of the lower plane. Each node, which is not part of the lowest plane, has access to the shared memory of four child nodes of the layer immediately below. The base layer, called working plane, is that part of the system where most of the real work is to be done. The upper layers and the top-node feed the working plane with data and offer support functions. In larger systems, only some of the lower planes are realized. In this case, the top-node is connected to the highest realized plane via bus. A special EGPA system is MEMSY. In such a system, only the two lower planes and the top-node are realized. Furthermore, by adding a bus system, long distance communication via messages is possible. Consequently, several programming models, not only shared memory, are supported. From to , three such systems of the EGPA class have been realized: the Pilot Pyramid (+ nodes), the DIRMU Pyramid, and the MEMSY Prototype ( + + nodes both).
Discussion
Erlangen General Purpose Array (EGPA) Jens Volkert Johannes Kepler University Linz, Linz, Austria
Synonyms MEMSY
Definition During the s, the concept of the multiprocessor class Erlangen General Purpose Array (EGPA) was
Introduction In , W. Händler published first ideas [A] and proposed one year later together with F. Hofmann and H.J. Schneider a new subclass of multiprocessors []: the Erlangen General Purpose Array (EGPA, Erlangen is an university city near Nürnberg in Bavaria). (Many of the following text is taken from [, , ].) The essential features and design objectives of these MIMDcomputers are: . Homogeneity: There is only one identical type of element – the node often called PMM (processormemory-module). . Memory-coupling: Connection between nodes is via memory.
Erlangen General Purpose Array (EGPA)
. Restricted neighborhood: No node has access to the whole memory, and no memory block can be accessed by all nodes. . Hierarchy: The PMMs are arranged in several layers and these layers form a total order. . Regularity: The nodes of one layer are ordered like a grid. W. Händler explained these features as follows: The reasons for these demands are speed (memory coupling), programmability (regularity, homogeneity), control (hierarchy), technical limits (restricted neighborhood), and expandability (all). In the following years, several EGPA systems were built. The first one, the pilot pyramid, was assembled in and was used until . It consisted of four PMMs at the working plane level (A-processors) and one PMM at level (B-processor). From to , DIRMU (Distributed Reconfigurable Multiprocessor Kit []) was used for testing applications written for EGPA systems. DIRMU was a kit consisting of PMMs, based, where processor-ports of one PMM could be interconnected with memory-ports of other PMMs by cables. Using several of these PMMs, different topologies could be realized: rings, trees, and so on. Especially an EGPA pyramid with three layers ( + + PMMs) could be built. From to , a new EGPA system started running in Erlangen. It was called MEMSY (Modular Expandable Multiprocessor System). In this multiprocessor, three layers were realized. It consisted of nodes in the working plane and nodes in the next layer. These machines were used to demonstrate the feasibility and usefulness of EGPA systems. In the following, the principles of EGPA systems are discussed, and the pilot pyramid and MEMSY serve as examples.
Motivation EGPA systems are general purpose machines, but they were especially designed for numerical simulations. With the emergence of powerful computer systems, the interest in computational science was awakened. This computational approach in natural science and engineering sciences was and is very successful in context with fluid dynamics, condensed matter physics, plasma
E
physics, applied geophysics, and others. The corresponding models of physical phenomena are either continuum models or many-body models. Essential is that there is an inherent “locality” in these approaches. The designer of EGPA kept this fact in mind when they developed their machines according to the possibilities the computer technology offered at that time, the s.
Design Goals The EGPA architecture was defined with the following design goals in mind: Scalability The architecture should be scalable with no theoretical limit. The communication network should grow with the number of processing elements in order to accommodate the increased communication demands in larger systems. Portability Algorithms should run on EGPA systems of any size. Cost effectiveness As far as possible, off-the-shelf components should be used. Flexibility The architecture should be usable for a great variety of user problems. Efficiency The computing power of the system should be big enough to handle real problems which occur in scientific research.
System Topology An EGPA system consists of identical PMMs (nodes): one processing unit and one attached communication (multiport) memory form one PMM (Processor Memory Module). The processing element consists of one or more processors. Optionally, there can exist a coprocessor and/or a private memory only accessible by the processing unit itself. The PMMs are hierarchically arranged in several layers (planes). An example with four layers (A, B, C, D) is given in Fig. a. At the topmost layer, there is only a single PMM which serves as a front-end to the system. At each lower layer, each PMM is connected to exactly four neighbors at the same layer, e.g., each processor has access to the communication memories of its neighboring PMMs. Thus, at each plane, the hardware has a
E
E
Erlangen General Purpose Array (EGPA)
D
C
B
A
ProcessorMemory-Module (PMM) Symmetric multiport-memory connection Asymmetric multiport memory connection between PMMs of different hierarchical level I/O communication supported by I/Oprocessor
Child NE Child NW Child SW Child SE Neighbour E Neighbour N Neighbour W Neighbour S
1111... 1110... 1101... 1100... 1011... 1010... 1001... 1000...
Private memory
Shared Memory
0001... 0000...
Erlangen General Purpose Array (EGPA). Fig. (a) EGPA with PMMs in the working plane (b) address space of a node
grid-like structure. As in each row, the most left node is connected to the most right PMM, and in each column, the corresponding nodes are connected too, each layer forms a torus. (In Fig. a, for clarity, only two of these turn-around connections are drawn.) In addition to the horizontal accesses, each node, except the very lowest, has access to four PMMs at the next lower layer. Consequently, the address space of a processor looks like in Fig. b. Each access with an address lying in the address space part of a child or neighbor will be translated by hardware to a shared memory address of that child or neighbor and that location of the child or neighbor will be accessed (e.g., the address . . . causes an access to the cell . . . of the neighbor in the north). The vertical connections are equipped to broadcast data downward to all four of the lower PMMs simultaneously, say, to transmit code segments. On the other hand, though the lower nodes do not have memoryaccess to those at higher layers, they are able to interrupt their supervising PMM. These substructures, consisting of four nodes and their supervisor, are called elementary pyramid. Several of them are highlighted in Fig. a. Each elementary pyramid has an I/O processor with corresponding I/O devices, mainly disks. The I/O processor which has access to all of its elementary pyramid memories is controlled by the supervising PMM. The topmost PMM is connected to a network containing a software development system.
The overall arrangement of an EGPA system is pyramidal. Each pyramid can be extended downward to any size by adding new levels. Such an extension significantly increases the computing power of the system. An EGPA system is a distributed memory multiprocessor. Remote communication between nodes is done via chains of copies and not via messages as usual.
System Software Each node has its own local operating system that exactly renders those services the resources of which are available. Now, in case a process needs a system service, this service either will be rendered at once or the addressed operating system cannot perform it and now informs his superior processor, i.e., its operating system, of the order. In the pilot pyramid, this technique was realized as follows. For each A-processor, a shadow process existed in the B-processor. At the four A-processors, only a user interface to the operating system of the B-processor was implemented. If there was an order to the operating system of the B-processor, this interface communicated with its shadow process. Then this shadow process sent the request to the proper operating system which executed the desired function afterward. This approach can be extended to larger EGPA computer; since each such system is composed of elementary pyramids (see Fig. a). Each B-processor is
Erlangen General Purpose Array (EGPA)
part of an elementary pyramid one level higher and so on. Consequently, the technique of shadow processes could be used in upper elementary pyramid as well. The proposed user interface for programming such a multiprocessor was in addition to several tools for testing parallel programs the EGPA monitor. This monitor consists of an EGPA frame program and a pool of subroutines. If a parallel program is executed, the frame program runs first installing the necessary system processes and then starting the user routines. The subroutines of the monitor are tools for controlling user programs. They allow to distribute programs and data to the system, to control the processes and to execute coordinating tasks (see []). Two very important tools are EXECUTE (PROGNAME, PROC) (starts routine PROGNAME at the indicated processor – child or neighbor) and WAIT (PROGNAME, PROC) (wait until the routine has finished at the processor). The EGPA monitor was realized for the Pilot Pyramid. But the concept can be carried to larger systems. It supports only one parallel program at a time in contrast to MEMSOS (see later) used in the MEMSY pilot.
The Pilot Pyramid The pilot pyramid, equivalent to an elementary pyramid, was assembled in and was used until (see Fig. ). It consisted of four PMMs at the working plane level (A-processors) and PMM at level (B-processor). All nodes were commercially available computers AEG / which offered a multiport memory. The characteristics of these processors were internal processor cycles of , , and ns and a word length of bits. The access time to a word was – μs. Each A-processor had a memory of KB and the B-processor KB. For more details, see []. The operating system was an expansion of the original available operating system MARTOS. The EGPA monitor could be called from user programs written in FORTRAN or SL (a subset of ALGOL).
Application Programming with the EGPA Monitor EGPA was designed for mainly solving problems using data partitioning. The corresponding programming model is known as SPMD (Single Program Multi Data).
E
I/O I/O control
Memory
E Processor
Erlangen General Purpose Array (EGPA). Fig. Pilot Pyramid
Before implementing a parallel program, the user has to partition his problem domain in such a way that parallel work can be done as much as possible. Furthermore, he has to assign himself the data to the processing elements and, in contrast to a global shared memory systems, in EGPA, the user had to take into account that needed data should not only exist within the system but also in the right memory at the right time. The synchronization and the control of data were done by using the EGPA monitor. The allocation of data was effected by segments, and most of data transports happened under user control. There were two possibilities for synchronization: under system control using messages or under user control by special routines of the monitor. The latter was a factor thousand faster, but deadlocks were possible if the user made a mistake. Nevertheless, it was preferred by nearly all users and the gained speedups were much better. One very important area for using EGPA systems are partial differential equations. Under the assumption that the region of interest is a square (or a cube) represented by a grid, for numerical purposes, and a grid-point-wise relaxation method is performed in red black order, the partitioning is very simple. To each processor one part is assigned. Each processor handles its portion of the grid and needs only values produced by neighboring processors. In Fig. a, the access area of one processor is hatched.
E
Erlangen General Purpose Array (EGPA)
On an EGPA system, one iteration step performed at a processor p is as follows: . Relax red points which do not depend on points of any neighbor. . Wait until all neighbors have signaled that their values can be used, e.g., the western neighbor has set the variable “ready” to iteration number (p accesses this variable using the name “ready&western”). . Relax red points which depend on black points of neighbors. These are points at the border of p. . Relax black points. . Ready := Iteration number +; /∗ signal to neighbors own borders are ready for next iteration. For this simple case, speedups near the number of processors in the working layer were achieved. If it was possible to map a rectangle onto a non-square physical space (see, e.g., Fig. b), the results were equivalent. If the domain under investigation is not so optimal, the programmer had to search for other solutions. If, e.g., the domain is a circle, it is not possible to give all nodes the same load and consequently, the efficiency of the system is smaller. Sometimes it helps to divide the region into several blocks. Then the grid is so folded onto the multiprocessor that each processor operates on a part of several blocks and each block is calculated by several processors simultaneously (Fig. c). Thereby, the torus structure of the layer is necessary. In Fig. d, this is illustrated for a circle. Each square is distributed to all nodes of the working plane. In many cases, the application designer has to search for new algorithm to use the power of the system. Thereby, he has to take into account the interconnection structure of an EGPA system. Matrix multiplication (H = F.G) with m × m nodes in the working
a
b
c
plane shall serve as an example. At the first glance, the interconnection topology of an EGPA seems to be a serious restriction for calculation of any new element, since data stored in remote memories (in the sense of restricted neighborhoods) are needed and have to be transported using the communication memories. The following algorithm is based on an idea of Cannon []. At the beginning, all matrices are equally distributed over the lowest plan as well as possible and H is set to . Hij , Fij , and Gij are stored in node pij (see Fig. Start position for m = ). The start position is transformed by more or less rotation of the partial matrices Fij and Gij in rows respectively columns of the working plane until the data are distributed as depicted in Fig. step for m = . For example, p reads G and copies it into the communications memory of processor p , this data is read and written into its own shared memory by p . These data transports can be done in parallel by the hardware. Thereby, the toroidal connections are used. Step : Each processor pij multiplies the partial matrices Fik and Gkj stored at its memory at the beginning of step i and adds the resulting matrix to Hij . Subsequently, the processor writes the matrix Fik into the memory of the processor “west” of it and the matrix Gij into the memory of the processor “north” of it. After m steps, the result H is distributed on the working plane as shown in Fig. . Result: Parts of the program for demonstration purposes written in SL (subset of ALGOL). It is one SPMD program per layer: The same code runs at each processor of one layer. For that reason, relative names with the following syntax were used.
d
Erlangen General Purpose Array (EGPA). Fig. Data grid partitioning for pointwise relaxation method
Erlangen General Purpose Array (EGPA)
Start position:
Step 1:
Step 2:
F21 G21 F22 G22 F23 G23 F24 G24
F22 G21 F23 G32 F24 G43 F21 G14
F23 G31 F24 G42 F21 G13 F23 G24
F31 G31 F32 G32 F33 G33 F34 G34
F33 G31 F34 G42 F31 G13 F32 G24
F34 G41 F31 G12 F32 G23 F33 G34
F41 G41 F42 G42 F43 G43 F44 G44
F44 G41 F41 G12 F42 G23 F43 G34
F41 G11 F42 G22 F43 G33 F44 G44
Step 3:
Step 4:
Result: H11
H12
H13
H14
F24 G41 F21 G12 F22 G23 F23 G34
F21 G11 F22 G22 F23 G33 F24 G44
H21
H22
H23
H24
F31 G11 F32 G22 F33 G33 F34 G44
F32 G21 F33 G32 F34 G43 F31 G14
H31
H32
H33
H34
F42 G21 F43 G32 F44 G43 F41 G14
F43 G31 F43 G42 F41 G13 F42 G24
H41
H42
H43
H44
F11 G11 F12 G12 F13 G13 F14 G14
F13 G31 F14 G42 F11 G13 F12 G24
F11 G11 F12 G22 F13 G33 F14 G44
F14 G41 F11 G12 F12 G23 F13 G34
E
F12 G21 F13 G32 F14 G43 F11 G14
Erlangen General Purpose Array (EGPA). Fig. Matrix multiplication for m = = nodes: partial matrix distribution
indicators that the data needed for STEP i are already stored∗ / WHILE FSEM/= STEP OR GSEM/= STEP DO WAIT (T) OD;/∗ T for avoiding too much memory accesses, if coherent caches are used - not in the Pilot Pyramid, WAIT is not necessary ‘/ MULT (HMATRIX,FMATRIX,GMATRIX) /∗ local matrix multiplication followed by addition of the result to HMATRIX ∗ / FSEM := GSEM := -STEP/∗ indicator that the data are used and can be overwritten∗ / WHILE FSEM& WEST/= -STEP DO WAIT(T) OD; /∗ western neighbour has not finished∗ / COPY (FMATRIX,FMATRIX&WEST); /∗ subroutine which copies FMATRIX into FMATRIX of the western neighbour.∗ / FSEM&WEST := STEP + 1;/∗ indicator that data is transported ∗ / WHILE GSEM&NORTH/= -STEP DO WAIT (T) OD; COPY (GMATRIX, GMATRIX&NORTH); GSEM&NORTH := STEP + 1
Allocation of data at each A-processor: /∗ installation of a global (shared) segment with name “MATRIX” in its own shared memory∗ / &SEG (MATRIX,GLOBAL); /∗ consequently there exist segments “MATRIX” installed in the shared memory of all its neighbour. It needs only access to those segments of its western and northern neighbour. Therefore∗ / &SEG (MATRIX&WEST,GLOBAL); &SEG (MATRIX&NORTH,GLOBAL); /∗ declaration part∗ / GLOBAL (MATRIX) [] REAL FMATRIX, GMATRIX, HMATRIX, INT FSEM, GSEM; VALGLOB (MATRIX&WEST) REF [] REAL FMATRIX&WEST, REF INT FSEM&WEST; VALGLOB (MATRIX&NORTH) REF [] REAL GMATRIX&NORTH, REF INT GSEM&NORTH; OD
Essential part of the main program running at each A-processor: /∗ P x P is the number of A-processors∗ / P is the FSEM := GSEM := ; FOR STEP FROM 1 TO m DO /∗ FSEM = STEP and GSEM = STEP are
The codes for gaining the data distribution at the beginning of Step (see Fig. ) as well as for COPY and for MULT are not represented. These routines need the size of the matrices which depend on the position of
E
E
Erlangen General Purpose Array (EGPA)
C-plane
B-plane
A-plane
Memory sharing within a plane Memory sharing between planea Common bus
Erlangen General Purpose Array (EGPA). Fig. MEMSY prototype
the processor. Therefore, the EGPA monitor provides each processor with its position (column, row) in the working plane together with the value of m.
MEMSY One essential result of all investigations concerning EGPA was the following. For many kinds of computations, the main computing load was carried by the lowest layer. The higher levels were mainly used for distributing partial tasks and data and for the realization of communication mechanisms. The consequence was that the higher layers were left underutilized. For removing this problem, the first idea was to taper the pyramid so that each lower level in an elementary pyramid has or even , rather than merely four PMMs. Experiences with a test system showed that excessive tapering (more than nine nodes in the elementary pyramid) can lead to bottlenecks at the higher layers. During these investigations, the best solution was found: The plane B must be mighty enough (elementary pyramid with five nodes) that the upper levels can be removed if the topmost node is connected to the B-plane by a bus. Such a special EGPA system is called MEMSY.
A prototype with nodes in the A-plane and nodes in the B-plane had been realized. The theoretical peak performance was GMIPS or MFLOPS (double precision), the measured performance was up to MFLOPS [, ]. (Most of the following information presented in this chapter is taken from [].) The node structure of MEMSY differed from that one of the pilot pyramid. Each node consisted of a Motorola multiprocessor board MVEME ( MC, MHz clock rate) and additional hardware. It contained a local memory, an attached shared memory, and an I/O-bus with access to different devices. The programming model of MEMSY was defined as a set of library calls which could be called from C and C++. In contrast to the EGPA model, it was not only based on the shared memory paradigm of EGPA. Instead, the model defined a variety of different mechanisms for communication and coordination. From these mechanisms, the application programmer could pick those calls which were best suited for his particular problem. The use of the shared memory was based on the concept of “segments,” very much like the original shared memory mechanism provided by UNIX System V. A process which wants to share data with another process (possibly on another node) first has to create a shared memory segment of the needed size. To have the operating system, select the correct location for this memory segment; the process has to specify with which neighboring nodes this segment needs to be shared. After the segment has been created, other processes may map the same segment into their address space by the means of an “attach” operation. The segments may also be unmapped and destroyed dynamically. For coordination purposes, three mechanisms had been provided: short (two word) messages, semaphores, and spinlocks. There was a second message type for transports of high-volume and fast data transfer between any two processors of the system. To express parallelism within a node, the programmer had to create multiple processes on each processing element by a special variant of the system call “fork.” Parallelism between nodes was handled by a configuration description, which forced the application environment to start one initial process on each node that the application should run on. In this starting phase, the application was consistent with the SPMD
Erlangen General Purpose Array (EGPA)
model. Through the execution of fork-calls, a configuration could be established which is beyond this model. The operating system of the MEMSY prototype was MEMSOS, it was based on UNIX V/ Release of Motorola, which was adapted to the multiprocessor architecture of the processor board []. In MEMSOS, most of the users’ needs were supported by the implementation of the “application concept.” An application was defined as the set of all processes belonging to one single user program. A process, belonging to an application, could be identified by (a) a globally unique application id, (b) a task id which was assigned to each process (called task) of an application and which was locally unique for this application and processor node and (c) a globally unique node id which identified any node in the system. This concept allowed a very simple steering and controlling of applications. MEMSOS could distinguish different applications and was able to run several of them concurrently in “space sharing” as well as in “time sharing” style. Additionally, a very important new concept, called “gang scheduling,” had been implemented. The purpose of this was to assure the concurrent execution of tasks of one application on all nodes at the same time. This was achieved by creating a runqueue for each application on each allocated node. A global component, called applserv, retrieved information about the nodes and the applications started and running there. It established a ranking of applications which were to be scheduled.
Exemplary Applications In order to validate the concepts of EGPA, certain applications from different scientific areas were adapted to the programming models of the EGPA monitor and of MEMSOS. In some cases, it was necessary to develop new parallel algorithms. The corresponding programs were written a lot of years ago. The listings of them are no more available. Therefore, it is not possible to give an impression of the amount for writing such a program, e.g., in lines of code. Only the matrix multiplication (see page ) gives an idea: lines (an estimation of the not shown part is included). The authors of such parallel programs wrote in their papers mainly about the principles of the underlying parallel algorithms and the runtimes. In many cases, they only published the maximal speedups received by
E
examples which were well-dimensioned for the used system. This means the problem size is maximal and, e. g., in case of the relaxation, the region under investigation is a square and not a circle or something else (see Fig. d). In a few cases, the dependency of the speedup on the problem size is depicted (see Gauss-Seidel and Table ). Using the Pilot pyramid, the maximum speedup was . Here are some of the measured speedups. The example “text formatting” demonstrates that not all types of tasks will run well on an EGPA system. Linear algebra [, ]: Matrix inversion: Gauss-Jordan (.), Column Substitution (.) Matrix multiplication (.) Gauss-Seidel iteration (). Speedup/dimension of the dense matrix: /, ./, ./, ./, ./ Differential equations []: Relaxation (.) Image processing and graphics []: Topographical representation (.) Illumination of topographical model (.), Vectorizing a grey level matrix (.), distance transformation – Each processor is working on a fixed part of data (.), – Dynamic assignment of varying parts of data (.). Nonlinear programming []: Search for a minima of a multi-dimensional objective function (.) Graph theory Network flow with neighborhood aid (.) Text formatting [] (.) Erlangen General Purpose Array (EGPA). Table Results with processors in the working plane [] Poisson equation with (/), (./), Dirichlet conditions: (./) (speedup/number of cells in finest grid) Steady state Stokes equation with staggered grid: (speedup/number of cells in finest grid)
(./), (./), (./)
E
E
Erlangen General Purpose Array (EGPA)
In order to have a larger system, the DIRMU (Distributed Reconfigurable Multiprocessor based on with coprocessor ) was used for building an EGPA with three layers. Among other things, that multiprocessor was used to investigate multigrid methods which were established in the industry at that time. Though in coarser grids the work per processor is very small, the results were very good (see Table ). In this context, it is worth mentioning that on a system, which only realized a working plane consisting of DIRMU-PMMs, adaptive multigrid approaches were implemented with high efficiency []. On the MEMSY Prototype, several algorithms were realized with great success. The measured efficiency was mostly greater than %. This is true for all the abovementioned algorithms which were ported from pilot or DIRMU to MEMSY [], but also for the following real world applications []: ● ● ●
Calculation of the electronic structure of polymers Parallel molecular dynamic simulation Parallel processing of seismic data by wave analogy of the common depth point ● Mapping parallel program graphs into MEMSY by the use of a hierarchical parallel. Algorithm ● Parallel application for calculating one-/two- and ecp-electron integrals for polymers Only for sparse matrix algorithms, efficiencies near % were measured.
Outlook At the same time when the mentioned EGPA systems were realized in Japan, a very similar systems, named PAX [], has been built. Also in Japan, the System HAP represented a lot of the principles of EGPA systems []. In the s, large commercial multiprocessors came into the market. Distributed memory MIMDs were now available which consisted of much more processors than, e.g., the MEMSY prototype. Even global shared memory systems with cache coherence were larger with respect to the number of processors. Especially the latter multiprocessors were easier to program than EGPA systems with their restricted neighborhoods. Consequently, the research on EGPA machines was stopped even in Erlangen.
Related Entries Computational Sciences Distributed-Memory Multiprocessor Metrics SPMD Computational Model Topology Aware Task Mapping
Bibliographic Notes and Further Reading Further papers dealing with the EGPA pilot are [, , , , , , ]. Publications concerning MEMSY can be found on the page: www.informatik.uni-erlangen.de/ Projects/MEMSY/papers.html
Bibliography . Bode A, Fritsch F, Händler W, Henning W, Hofmann F, Volkert J () Multi-grid oriented computer architecture. ICPP, – . Cannon LE () A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University . Dal Cin M, Hohl W, Dalibor S, Eckert T, Grygier A, Hessenauer H, Hildebrand U, Hönig J, Hofmann F, Linster C.-U, Michel E, Pataricza A, Sieh V, Thiel T, Turowski S () Architecture and realization of the modular expandable multiprocessor system MEMSY. In: Proceedings of parallel systems fair of the th international parallel processing symposium (IPPS ‘), Cancun, pp – . Finnemann H, Brehm J, Michel E, Volkert J () Solution of the neutron diffussion equation through multigrid methods implemented on a memory-coupled -processor system. Parallel comput (–):– . Finnemann H, Volkert J () Parallel multigrid algorithms implemented on memory-coupled multiprocessors. Nucl Sci Eng :– . Fritsch G, Jens Volkert () Multiprocessor systems for large numerical applications. Parcella – . Fritsch G, Henning W, Hessenauer H, Klar R, Linster CU, Oelrich CW, Schlenk P, Volkert J () Distributed sharedmemory architecture MEMSY for high performance parallel computations. Comput Archit News ():– . Fritsch G, Müller H () Paralleliation of a minimation problem for multiprocessor system. In: CONPAR . Lecture Notes Computer Science, vol . Springer, Berlin/Heidelberg/New York, pp – . Geus L, Henning W, Vajtersic M, Volkert J () Matrix inversion algorithms for the a pyramidal multiprocessor system. Comput Artif Intell (Nr ):– . Gossmann M, Volkert J, Zischler H () Matrix image processing and graphics on EGPA. EGPA-Bericht, IMMD, Universität Erlangen-Nürnberg
Ethernet . Hä¨ndler W () A unified associative and von-Neumann processor EGPP and EGPP array. In: Sagamore Computer Conference, pp – . Händler W., Klar R. () Fitting processors to the needs of a general purpose array (EGPA). MICRO : – . Hä¨ndler W, Herzog U, Fridolin Hofmann, Hans Jü¨ rgen Schneider: Multiprozessoren fü¨ r breite Anwendungsbereiche: Erlangen General Purpose Array. ARCS :– . Händler W, Bode A, Fritsch G, Henning W, Volkert J () A tightly coupled and hierarchical multiprocessor architecture. Comput Phy Commun :– . Händler W, Fritsch G, Volkert J () Applications implemented on the erlangen general purpose array. In: Parcella’. AkademieVerlag, Berlin, pp – . Händler W, Herzog U, Hofmann F, Schneider HJ () A general purpose array with a broad spectrum of applications. In: Informatik Fachberichte vol . Springer, Berlin/Heidelberg/New York, pp – . Händler W, Maehle E, Wirl K () Dirmu multiprocessor configurations. In: Proceedings of the international conference on parallel processing, pp – . Henning W, Volkert J () Programming EGPA Systems. In: Proceedings of the th International conference on distributed computing systems, Denver, pp – . Henning W, Volkert J () Multi grid algorithms implemented on EGPA multiprocessor. ICPP, pp – . Hercksen U, Klar R, Kleinöder W () Hardwaremeasurements of storage access conflicts in the processor array EGPA. In: ISCA ’, Proceedings of the th annual symposium on computer architecture, ACM Press, New York, La Baule, pp – . Hofmann F, Dal Cin M, Grygier A, Hessenauer H, Hildebrand U, Linster CU, Thiel T, Turovski S () MEMSY – a modular expandable multiprocessor system. In: Proceedings of parallel computer architectures: theory, hardware, software, applications. Lecturer notes computer science, vol . Springer, London, pp – . Hoshino T, Kamimura T, Kageyama T, Takenouchi K, Abe H () Highly parallel processor array “PAX” for wide scientific applications. In: Proceedings of the international conference on parallel processing, IEEE Computer Society, Columbus, pp – . Linster CU, Stukenbrock W, Thiel T, Turowski S () Das Multiprozessorsystem MEMSY. In: Wedekind (Hrsg.): verteilte systeme. Wissenschaftsverlag, Mannheim, Leipzig Wien Zürich, pp – . Momoi S, Shamada S, Koboyashi M, Tshikawa T () Hierarchical array processor system (HAP). In: Proceedings of the conference on algorithms and hardware for parallel processing on COMPAR , pp – . Rathke M () Paralleliseren ordnungserhaltender Programmsysteme für hierarchische Multiprozessorsysteme. PhD thesis. Arbeitsberichte des IMMD, Universität Erlangen ()
E
. Rusnock K, Raynoha P () Adapting the Unix operating system to run on a tightly coupled multiprocessor system. VMEbus Syst ():– . Vajtersic M () Parallel Poisson and biharmonic solvers implemented on the EGPA multiprocessor. In: ICPP
ES Earth Simulator
Ethernet Miguel Sanchez Universidad Politécnica de Valencia, Valencia, Spain
Synonyms IEEE .; Interconnection network; Network architecture; Thick ethernet; Thin ethernet
Definition Ethernet is the name of the first commercially successful Local Area Network technology invented by Robert Metcalfe and David Boggs while working at Xerox’s Palo Alto Research Center (PARC) in . While the prototype developed worked at . Mbps, the commercial successor was called Ethernet and worked at Mbps. Several faster versions have been developed and marketed since then. Ethernet networks are based on the idea of a shared serial bus and the use of Carrier Sense Multiple Access with Collision Detection (CSMA/CD) technique for network peers to communicate. Ethernet media access is fully distributed. Ethernet evolution brings the use of switching instead of CSMA/CD and full-duplex links. At the same time, coaxial cable is first replaced by twisted pair and then by fiber optics.
Discussion Introduction The world of Computing has always been a scenario of fast changes where state of the practice is challenged on a daily basis. Most of the advances in our field have come from this attitude of challenging the existing wisdom.
E
E
Ethernet
Thanks to that, we moved from the mainframe to the mini, and from it to the Personal Computer and now to embedded systems. When we lived in a world of a few computers, networking seemed not to be in big demand, but as the number of computers in enterprise grew, a new need started to develop to make efficient use of companies’ computing resources. It was the dawn of the Local Area Network (LAN). In the early days of computing, most of the interconnection focused on adding dumb terminals to the existing mainframe computers. For that purpose, different variations of serial connections sufficed. However, as first minicomputers and then personal computers started to spread on the enterprise there was a need of another type of interconnection. In this new environment, the computing power started to move from a central location to be distributed to each desk. Together with that, the need of keeping systems interconnected became apparent, especially at those locations that have a significant number of mini or personal computers. Xerox Corporation played an important role in the development of both the personal computer and LANs. More specifically, Xerox founded Palo Alto Research Center (PARC) in to conduct independent research. Many current technologies can be traced back to PARC, from the personal computer to the Graphical User Interface to Ethernet LANs. Xerox Alto was the first personal computer. It was developed in (Fig. ). The advent of Ethernet as a successful wired networking technology is based on a wireless precursor
Ethernet. Fig. Original Ethernet drawing by Robert Metcalfe
developed at the University of Hawaii and first deployed in by Normal Abramson and others. They used amateur radios to create a computer network so teletypes at different locations could access a time-sharing system. It was called the ALOHA network. All terminals use a shared frequency to transmit to the host. Simultaneous transmissions destroy each other but if a transmission was successful then it was acknowledged from the host. A network technology was needed to link together Alto computers with printers and other devices. No existing technology was found suitable so a team at PARC was set to develop a new technology. That was the embryo of Ethernet.
Collision Sense Multiple Access with Collision Detection The basic principle of the ALOHA network that allowed a node to transmit at any time was simple but not very efficient. Harvard University student Robert Metcalfe sought to improve the basic idea, and so he did his PhD on a better system and joined PARC. The main improvement was that network nodes should check for an ongoing transmission before starting their own. The way to do this was to check whether a carrier signal was present or not. This technique is also called Listen Before Talk (LBT). This gave way to a technique called Carrier Sense Multiple Access (CSMA). CSMA could be improved further by adding another mechanism called Collision Detection (CD). It consisted of stopping an ongoing transmission as soon as a collision was detected. A collision happens when two or more network nodes transmit at the same time. The purpose of this second improvement was to reduce the duration of collisions. This newer version of the access method was called Carrier Sense Multiple Access with Collision Detection or CSMA/CD and it was the base for the Ethernet network []. A third improvement over ALOHA was what to do after an unsuccessful transmission (collision). While ALOHA called for a random (but short) timer, Metcalfe’s proposal was to set the retransmit timer as a random slot number chosen from a special set. Ethernet’s Binary Exponential Backoff algorithm calls for random numbers to be selected from a range that starts as {, } and doubles after each collision. The maximum range is { . . . }. This algorithm ensures that the
Ethernet
probability of a new collision after a collision is halved. While CSMA/CD works on continuous time, Binary Exponential Backoff works on slotted time, each slot being the time required to transmit bits. Robert Metcalfe was helped by Stanford student Dave Boggs and they created a working system by networking together several Alto computers.
Ethernet Topologies Original Ethernet design used a bus topology with a coaxial cable as the shared media for all the transmissions. Data transmission used a baseband signal using Manchester encoding. A bus was found as a suitable way to broadcast information to all the network nodes and no switching was needed. The network bus was terminated with a resistive load on each end to reduce signal reflection. Maximum segment length was up to m. Repeaters could be use to connect two or more segments together. No more than four consecutive repeaters were allowed. Maximum network diameter was , m (Fig. ). Coaxial cable, specially the thick cable, was difficult to use in an office setup due to its size and it was expensive. This topology had the main drawback of a bus being a single point of failure. In order to reduce the cost of the network some revisions of Ethernet called for a thinner cable. This so-called Thin Ethernet was still based on a bus topology but using a thinner cable would make it not only cheaper but also easier to install cable runs. Unfortunately, the new setup with thinner cable was not without its own troubles as the attachment of each station was made using so-called T-connectors that required the coaxial cable to be cut at each joint and required two connectors to be added. Both the number of connectors and the physical characteristic of the cable (RG-) limited the length of Thin Ethernet segments to m.
1
1
0
0
1
1
0
Ethernet. Fig. Manchester encoding as used on Ethernet
0
E
The availability of broadband coaxial cable on some premises for cable TV distribution enabled another variation for Ethernet, but now using broadband signals and these call for a change as transmission is here unidirectional. This broadband version of Ethernet was never very successful but it allowed a maximum distance of , m. In a quest for cheaper alternatives, coaxial cable was replaced by twisted pair and the bus topology gave way to a star topology. While the first attempt at this only worked at a fraction of the coaxial data rate, the same speed was achieved with a later version. The main advantage of using twisted pair was its wide availability in office setups as it was used for voice communications. Twisted pair was also cheaper and easier to install and to handle. But the use of star topology called for a central multi-port repeater or hub. Hub device was, in essence, a bus-in-a-box. However, the star topology meant that if a cable was severed only one network node was affected. In a bus topology, if the coaxial cable is severed the whole network stops working. The star topology proved to be much more robust than the bus counterpart but each cable run was limited to m giving a maximum distance of m for each segment. Support for fiber optic media came from the need for both longer distance links and electrically isolated links. The first use of fiber optic media in Ethernet networks was for the interconnection of repeaters. It later developed as an alternative media to coaxial or twisted pair. While more expensive than copper, fiber optic media allowed longer distances of up to , m. Ethernet fiber optic links were point to point and that meant a star topology. A passive-star over fiber was proposed but it was never implemented.
The First Ethernet, and the First Commercial Ethernet Success The first version of Ethernet was deployed in PARC in and it ran at . Mbps and used -bit values for source and destination addresses. The first bits of the data field were reserved for a type identifier, so different protocols could be used over the same network each one using a different value for the type identifier. Data frames were protected against errors by means of a Cyclic Redundancy Check of bits placed at the end of the frame.
E
E
Ethernet
The original name for this network was the Aloha Alto Network, as it was developed for interconnecting Alto computers together and with laser printers. One of the original prototype boards can be found now at the Smithsonian. However, it was soon recognized that such a name suggested it would only work with Alto computers. The Ethernet name was a vendor-neutral name that was based on an old concept of wireless transmissions. During nineteenth century, it was believed that wireless signals propagate using an invisible media called ether, although physicists Michelson and Morley rejected that misconception in .
IEEE . also developed several choices for physical media and used a naming convention that prepended the data rate to a word describing the modulation scheme (baseband or broadband) and next the physical media used. BASE described a Mbps version of Ethernet using Manchester encoding over RG-X coaxial cable. BASE-T described a similar network but over twisted-pair cable. Eventually twisted pair became the cabling of choice, mostly using two of the four pairs of Category and Category cable. While twisted-pair setups required the use of expensive active hubs, the savings for using cheaper cable and the improved reliability of the network compensated for the additional costs.
DEC, Intel and Xerox In , Xerox, DEC and Intel joined to develop a common specification for Ethernet products: The first commercially successful Ethernet network was born. The specification was published as the “blue book” []. The new Ethernet specification, also known as DIX (acronym based on the three companies names), called for Mbps data rate, -bit addresses, -bit type field, and -bit Cyclic Redundancy Check (CRC) field (Fig. ). The data field was of variable size of up to , bytes. It was a new and improved version of the first Ethernet and now it was set to become a viable commercial product. At this same time, the market for personal computers blossomed with many different models. While this first commercially available version of Ethernet sold well, there were some concerns about the potential lock-in by the three companies. The Institute of Electrical and Electronic Engineers (IEEE) came into play. Ethernet would be standardized within the networking group , who met for the first time on February . Ethernet became the IEEE . standard in []. The availability of the . standard eased the doubts about the maturity level of Ethernet networks. At the same time, many different companies started selling compatible products many of them intended for the — by then — new IBM PC computer.
Ethernet. Fig. Ethernet frame format
CSMA/CD Ethernet Modeling and Problems One of the early criticisms of Ethernet was the fact that media access is probabilistic (nondeterministic). That means transmission latency cannot be known beforehand. The deterministic behavior of other media access algorithms, like IEEE . (Token Ring) was considered superior, but average transmission latency is much lower on Ethernet than on a token-passing network (except under very heavy network loads). Ethernet (and CSMA/CD by extension) was found to exhibit what was called “capture effect,” that happens when two stations contending for the medium have a collision and both have more frames to transmit. The winner will transmit successfully and it will keep an edge on future transmissions against the station that lost the transmission slot after the collision. This effect causes a transient channel-access unfairness that in turn can interfere with higher layer protocols, like TCP, that can amplify the effect. A second cause of concern was the behavior of heavy loaded Ethernet networks [, ]. As traffic grows beyond a certain limit, the chances of a transmission attempt to end up in a collision grow too. Having an increased chance of a collision when traffic load is high introduces a positive feedback that can bring the network to a congestion state. But as Ethernet data rate goes up, so
Ethernet
does the overload limit. Again, the use of full-duplex Ethernet avoids this problem.
Next Stop: Mbps Ethernet Once the Mbps Ethernet IEEE standard was approved, the group was tasked to develop a Mbps version. This faster version was known as Fast Ethernet. Different standards allowed this faster speed over different types of physical media using both copper and fiber optic media. Twisted-pair cable is not very well suited for high speed communication so different modulation schemes were used over different types of twisted-pair cables to give way to several physical layer specifications: BASE-TX use BB MLT- coded signals over Category twisted-pair cable while BASE-T uses BT OAM- coded signals over four twisted pairs of Category . Similarly, BASE-FX is Fast Ethernet media access for multimode fiber. Depending on whether half-duplex or full-duplex links are used, m or km is the maximum link distance respectively for fiber media.
Switched Ethernet Ethernet evolved not only becoming faster but also adding new components to the original architecture. A shared bus proved to be not enough for large networks. The maximum distance constraints together with traffic considerations opened the way to bridges. One way to increase the capacity of a network segment was to split it into two smaller segments, each one now having fewer computers competing for the available capacity. But these two segments need now to be interconnected so distributed applications can still work as before. The original bus topology gave way to a star topology using twisted-pair cable. Hubs provided the interconnection between multiple stations but there was a problem: All the stations had to work at the same data rate on the same collision domain. Bridges could be used to split one collision domain into several interconnected network segments where only intersegment traffic was shared. Therefore, both hubs and repeaters enabled the interconnection of two or more network segments. The term “hub” is used only for the interconnection of point-to-point links while repeaters could be used for interconnecting two coaxial-cable Ethernet segments.
E
From the logical point of view, both exhibit the same behavior working at the physical layer. A repeater retransmits a digital signal so it can reach a longer distance. A repeater does not create a separation of the collision domain. A bridge is an interconnection device that works at the Link Layer. Bridges connect two or more cable runs called network segments. Bridges understand MAC addresses and create several independent collision domains. Only traffic intended for other network segment is actually forwarded by a bridge. The aggregate bandwidth of a network interconnected by a bridge is higher than if the interconnection uses a repeater, but so is latency. The term “bridge” was usually associated with the interconnection of large and usually coax-based segments while the term “switch” referred mostly to interconnection devices using point-to-point links, where each port would have a workstation or server. From the logical point of view, bridges and switches are similar. Little by little, bridges were replaced by new devices called switches that offered similar functionality but many ports per unit. Switches started to replace network hubs. There was a performance increase but the cost was high too. The different topologies of Ethernet had to be loopfree. But given that bridges or switches were used to interconnect network segments there was a risk of creating network loops. On the other hand, having network loops is a way to introduce redundancy to the topology of a network, a feature that could be very useful if it did not interfere with Ethernet media access. The problem was that a new component was needed for Ethernet not to have problems in the presence of loops in the network topology or frames would be forwarded forever within the loops. The brilliant solution to the problem was the Spanning-Tree protocol [], invented by Radia Perlman while working at Digital Equipment Corporation and ratified as the IEEE .d standard. This protocol running on switches and bridges creates a spanning tree out of the existing network topology. Frames are only forwarded on network segments that belong to the spanning tree and, being a tree a loop-free topology, the problem of forwarding loops is solved. Whenever one of the links in use stops working, a new spanning
E
E
Ethernet
tree is calculated making use of the available redundant links. Bridges and switches exchange control frames periodically to maintain the spanning tree. A switch, like a bridge, works on Link Layer and creates an independent collision domain on each port. Switches are store-and-forward devices that can easily handle different data rates on different ports. Switches can handle simultaneous frame forwarding on different ports, an event that would have caused a collision should a hub were used. Different proprietary flow-control mechanisms were developed for half-duplex switches based on back pressure congestion management: A switch may transmit a carrier signal on those ports it wants to alleviate congestion. Ethernet switches may use different approaches to switching. The simpler way was to use the store-andforward approach where a frame is stored in a buffer to be retransmitted once it is received entirely by the switch. Cut-through switches only store the very first bytes of the frame and once the destination address is known the forwarding of the frame can start. This second approach leads to a significant reduction of the switching latency experienced by frames. An Ethernet network that uses switches instead of hubs enables a new improvement: full-duplex links. Provided that the physical media has two different circuits for simultaneous transmission and reception (e.g., two different copper pairs on BASE-TX), it is possible for a station to transmit and receive at the same time. While full-duplex Ethernet does work at the same data rate as half-duplex, higher throughput can be achieved with full-duplex. Besides, collisions do not occur in a full-duplex point-to-point link. A collision-free link is going to be more effective as the time devoted to collisions can now be put to use to transmit more data. However, network adapters still need to keep the CSMA/CD functionality for backward compatibility. Together with full-duplex operation came the .x flow-control mechanism. Sometimes, a receiver on a full-duplex link may not be able to handle frames fast enough. Ethernet flow-control is based on the receiver transmitting a special pause frame to a multicast MAC address. Such a frame reports the amount of time (using bit-time units) the sender on that full-duplex link must wait before the next transmission. If a receiver is
able to accept data sooner than expected, a new pause frame with zero wait-time can be transmitted.
Ethernet Gets Faster In May , the Gigabit Ethernet Alliance was founded by eleven companies interested in the development of Gigabit Ethernet (i.e., , Mbps Ethernet). The original aim of this new Ethernet version, faster than Fast Ethernet, was to make server and backbone connections even faster. Gigabit Ethernet was covered in IEEE .z and .ab standards and it was approved in . Gigabit Ethernet can use either multi-mode fiber, single-mode fiber, or twisted-pair copper cable. Gigabit Ethernet over cooper (known as BASE-T) poses a special challenge for twisted-pair connections that could use Category cable only if it meets some restrictions. This is the reason for Category E or Category copper cable. Different techniques were used for making possible to get gigabit speed over copper (e.g., QAM-modulation to more efficiently utilize the channel bandwidth). Many changes were needed beyond speed. Halfduplex operation was still allowed and so a technique called carrier extension was added to allow Ethernet to keep CSMA/CD and the ordinary m maximum network distance. An extension field is appended as needed to ensure a minimum frame length of bytes. Alternatively, another optional improvement was frame bursting: It allows the transmission of consecutive data frames after one MAC arbitration on half-duplex channels. In June , Gigabit Ethernet was ratified. With this new version of Ethernet the performance level is reaching or surpassing other Metropolitan Area Network (MAN) and Wide Area Network (WAN) technologies such as OC- and OC-, SONET or SDH technologies. This is the first version of Ethernet only supporting fiber optic media (GBASE-CX barely supports m of copper cable, just for attaching a switch to a router). A Gbps link range is between m and km. Both multi-mode and single-mode fibers are supported.
Ethernet and Beyond New versions of Ethernet are coming as Gbps Ethernet and Gpbs Ethernet. The IEEE .ba standard
Exaop Computing
has been ratified in June . A bit unusual was the fact that an interim Gbps version was developed largely as a step towards Gpbs Ethernet. The .ba calls for full-duplex fiber optic links. Data Center Ethernet (DCE) extends Ethernet architecture to improve networking and management in data centers. Several features like Priority-Based Flow-Control, Enhanced Transmission Selection (standardized as .Qaz), Multipathing or Congestion Management (.Qau) focus on improving network features and manageability. The vision of DEC is a unified fabric where LAN, storage area network (SAN), and inter-process communication traffic (IPC) converge. Technologies like Fiber Channel over Ethernet (FCoE) enable data centers to consolidate interconnection and internetworking technologies into a single network, which may simplify connectivity and reduce data center power consumption. Whether this is the last stop of Ethernet’s network technology or not, it is unclear. However, if past is an indication of what may be ahead, an even faster version of Ethernet should be expected in a few years time.
. Metcalfe RM, Boggs DR () Ethernet: distributed packet switching for local computer networks. Commun ACM ():– . The Ethernet () A local area network: data link layer and physical layer specifications (version .). Digital Equipment Corporation, Intel, Xerox . ANSI/IEEE Standard .– Carrier sense multiple access with collision detection. IEEE, October . Boggs DR, Mogul JC, Kent CA () Measured capacity of an Ethernet: myths and reality. In: Proceedings of ACM sigcomm ’ symposium of communications architecture and protocols, Stanford, CA, (Also as Computer Commun Rev (), August ) . Takagi H, Kleinrock L () Throughput analysis for persistent CSMA systems. IEEE Trans Commun COM-():– . Perlman R () Interconnections: bridges, routers, switches, and internetworking protocols (a ed. edición). Addison-Wesley, Reading, MA . Spurgeon CE () Ethernet—the definitive guide. O’Reilly & Associates, Inc, Sebastopol
Event Stream Processing Stream Programming Languages
Buses and Crossbars Distributed-Memory Multiprocessor Network Interfaces
Eventual Values Futures
Network of Workstations
Bibliographic Notes and Further Reading Access to many standards has been fee-based. IEEE standards started a move a few years ago by opening their documents to the public, for free. The documents of published standards are available for free download months after they are published. That means all details about any approved Ethernet standard version is easily available visiting http://standards.ieee.org/ getieee/..html Unfortunately, standards are usually written so all details are covered but many times they are not readerfriendly. For getting a good understanding of Ethernet technology, the Charles Spurgeon’s [] guide is a very appropriate text.
Bibliography
Related Entries Clusters
E
Exact Dependence Dependence Abstractions
Exaflop Computing Exascale Computing
Exaop Computing Exascale Computing
E
E
Exascale Computing
Exascale Computing Ravi Nair IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms Exaflop computing; Exaop computing
Definition Exascale computing refers to computing with systems that deliver performance in the range of (exa) operations per second.
Discussion Introduction In computing literature, it is customary to measure progress in factors of . The term exa, standing for or , is derived from the Greek word `´ε ξ, meaning six. Thus exascale computing refers to systems that execute between and operations per second. This performance level represents a -fold increase in capability over petascale computing, which itself represents a -fold increase over terascale computing. There is often confusion over how to denote the performance of a system. The most common measure is the number of double-precision floating-point operations that are completed every second (flops) when the system is running the Linpack dense linear algebra benchmark. By this measure, an exascale system is one which can execute Linpack at the rate of exaflops or higher. For a given system, this is usually less than the peak double-precision floating-point execution rate, typically by –%, because of practical limitations in exploiting the full floating-point capability in a system. In many systems that implement only single-precision floating point in hardware, the peak single-precision floating-point capability may be more than double the flops as measured for Linpack. Petascale computing was ushered in with the introduction of the IBM Roadrunner system in May , when it ran Linpack at . petaflops. If trends continue as in the past, it is reasonable to expect an exascale system to be operational around .
A list of the Top supercomputers in the world has been maintained since June and is a useful indicator of trends in the capabilities of the largest scientific machines. The list shows that there has indeed been a trend of doubling of Linpack performance every year, but that the manufacturer of the highest-performing system and the architecture of the system have been changing over time. While shared-memory computers and even uniprocessors made the list in earlier days, today the list is dominated by cluster supercomputers and massively parallel processors, the latter being distinguished from the former by their highly customized interconnection between nodes. With growing interest in the use of large-scale machines for applications that are considered commercial (or, more appropriately, nonscientific), the term exascale is also being used to refer to systems which can carry out operations per second ( exaops). There continues to be ambiguity in what constitutes an operation in this terminology. An operation is variously defined as an instruction, an arithmetic or logical operation, an operation intrinsic to the execution of an algorithm, and so on. This has led to one abstract definition of an exascale system to be one that has a performance capability roughly times more than a system solving the same problem in . In any case, the needs of the two worlds, scientific and commercial, appear to be converging especially from the points of view of compute capability, system footprint, energy requirements, and cost. It remains to be seen whether, in comparison to systems of , the first exascale system will execute only Linpack times faster, or whether it will solve a variety of problems, both scientific and commercial, faster by a factor of .
The Importance of Exascale Large supercomputers are already being used to solve important problems in a variety of areas. The International Exascale Software Project (IESP) has categorized the present areas of application of supercomputers as follows: ● ● ● ●
Material Science Energy Science Chemistry Earth Systems
Exascale Computing
● ● ● ● ●
Astrophysics and Astronomy Biology/Life Systems Health Sciences Nuclear and High Energy Physics Fluid Dynamics
Each of these areas has specific problems and a set of algorithms customized to solve those problems. However, there is an overlap in the use of algorithms across these domains, because of similarity in the basic nature of some of the problems across domains. Many of these problems could benefit from even larger supercomputers, but the expectation is that exascale systems will also bring a qualitative change to problem solving, allowing them to be used in new types of problems and in new areas. There are two ways in which larger supercomputers can help in the solution of existing problems. The first is called strong scaling, where the problem remains the same, but the solution takes less time. Thus, a weather calculation that forecasts the weather a week ahead is more useful if it takes h to solve rather than days. The other form of scaling is called weak scaling, where the size of the problem that can be solved is vastly increased with the help of exascale computers with its significantly larger resources. Thus the quality of a weather forecast can be improved if it is based on global data rather than local data, or if the forecast is made based on data from a larger number of more closely located sensors, or if the model is increased in complexity, for example, through the use of more types of meteorological information. Beyond just scaling of existing applications, largescale computing enables new types of computation. It becomes possible to model in a single application a physical phenomenon at different scales in time, space, or complexity simultaneously, for example, at the atomic, molecular, granular, component, or structural scale, or at the molecular, cellular, tissue, organ, or organism scale. This capability is already becoming evident at the petascale level and will be further exploited at the exascale level. Another new and interesting class of applications that will be further exploited in exascale computing is the quantification of uncertainty in mathematical models. These applications involve the use of massive parallel computing to understand the behavior
E
of physical systems in the presence of both systematic and random variations. The list of areas that could benefit from the use of supercomputers has been increasing steadily. Applications for supercomputing are also emerging in areas that were not traditionally considered scientific areas. Examples of these are visualization, finance, and optimization. Supercomputers have been used in the past by large movie production companies to render animated movies frame-by-frame over long periods of time. However, for typical scientific applications, the results produced by supercomputers were often packaged and shipped to a significantly smaller computer which the scientist used to analyze and visualize the results. The point where it will be too expensive to ship the vast amount of data produced by supercomputers is fast approaching. Increasingly, supercomputers will be used to process the results of their computation and produce visualization data to be shipped over to the scientist. The area of business analytics, traditionally considered a commercial application, is fast embracing the use of massive computers. The potential use of massive computation in such areas suggests that exascale computers are likely to represent the point at which supercomputers shift from being the largely simulationoriented scientific machines that they have traditionally been to more versatile machines solving fundamentally different types of problems for a larger section of humanity.
The Challenges of Exascale The principal engine for the advances made by supercomputers over the past decades has been Dennard scaling through which the electronics industry has enjoyed the perfect combination of increased density, better performance, and reduced energy per operation with successive improvements in silicon technology. When all dimensions of an electronic circuit on a chip, for example, the transistor gate width, the diffusion depth, and the oxide thickness, are reduced by a factor α, and when the voltage at the circuit is also reduced by a factor α, it is possible to increase the density of packing of devices on the chip by a factor of α and increase the frequency of operation of the device by a factor of α, with a reduction in energy per operation by a factor of α . With process technologists providing lithography improvements that
E
E
Exascale Computing
allowed a doubling of transistor density roughly every years, with microarchitects discovering new ways to expose parallelism at the instruction level, and with system architects engineering sophisticated techniques to combine multiple processors into ever large parallel systems, it was not difficult to maintain a steady doubling of supercomputer system performance every year at almost constant cost. Going forward from the current nm technology node, it will become increasingly difficult to keep up with Dennard scaling. With oxide thickness down to a few atoms, voltages cannot be scaled proportionally without serious compromise to the reliability and predictability of the operation of devices. It is no longer reasonable to expect the same frequency improvement for constant power density that was enjoyed in the past. At the same time, the nature of typical application programs makes the detection and exploitation of additional parallelism within a single thread of execution difficult. Thus, scaling the total number of threads in a system remains the main option to increase the performance of future large-scale systems. With current petascale systems already sporting hundreds of thousands of cores, the expected number of cores in an exascale system is likely to be in hundreds of millions. Due to various restrictions that will be enumerated below, most of this increase will have to come from using whatever lithography advances are possible to increase the density of cores on a chip. The task of achieving the goal of building an exascale system by is daunting. There are challenges in implementing the hardware, in designing the software environment, in programming the system, and in keeping the machine running reliably for long periods at a stretch. Listed below are brief descriptions of some of these challenges.
Power A quick glance at the Top list shows that the power ratings for the top two petascale systems are, respectively, MW/Petaflops for the Cray system and . MW/Petaflops for the IBM Roadrunner system. At this rate, an exascale system could consume as much as – MW of power, a quantity that is larger than the amount of power consumed by a medium-sized
US city. The cost of acquiring the largest supercomputer has remained largely constant over the years at around $– million. This is the cost of about – MW-years of energy. Thus it is unlikely that exascale installations will be willing to provide more than about – MW of power, a third of which can be expected to be lost in cooling and power distribution. There is thus a gap of x–x between the power efficiency of today’s machines and the desired efficiency of exascale machines. Traditional CMOS scaling would have bridged that gap in about four technology generations, but with the slowing down of CMOS scaling and with longer intervals between introductions of technology generations, it is imperative to look for other avenues to achieve the desired power efficiency. This gap must be bridged for all components of the system – the processor, the memory system, and the interconnection network. At the processor level, it is likely that exascale systems will include domain-specific accelerators because it is easier to build hardware that is power efficient in a specific domain rather than across a general range of applications. Vector units, which perform the same computation on multiple data simultaneously, could play an even bigger role in exascale systems precisely for this reason. Unlike what is seen in desktop and server processors today, processor designers will be forced to take a minimalist view in every part of the processor, paring overhead down to a minimum without significant impact on performance. Power-efficient optics technology will need to be deployed in interconnection networks if the total system bandwidth has to be scaled proportionally from what exists today. Memory system designs, both at the main memory level and at the cache level, will have to be reinvented with the above stringent power-efficiency requirements in mind.
Memory One of the trends seen in the top supercomputers is that the fraction of the total cost of the system and the fraction of the total system power going into memory has been steadily rising, even though the ratio of memory capacity to the computational performance (bytes/flops) has been decreasing. In the Top machines, this ratio has decreased from byte/flop to as low as . byte/flop in some cases. The result of this is that there is a greater pressure on applications
Exascale Computing
to adapt to the reduced amount of memory by carefully managing the locality of applications, and a pressure on processor designers to provide ever-growing cache capacities and deeper cache hierarchies, both of which add to the challenge of streamlining the processor design to achieve a low power-performance ratio for exascale systems. Another problem is the bandwidth at the memory interface. DRAM designs have been catering mainly to the commodity market and hence, while their capacities have been keeping up with Moore’s Law, their bandwidths have not kept up with increases in processor performance. Cost considerations have prevented supercomputer vendors from designing custom DRAM solutions. However, the rampant data explosion in all types of information processing is forcing even commodity memory vendors to rethink the designs of memory interfaces and modules. It is conceivable that such new designs along with new storage-class memory technologies and new packaging technologies will do a better job in meeting the needs of exascale systems.
Communication An important aspect of all large-scale scientific applications is the amount of time spent communicating between computation nodes. Applications impose a certain communication pattern which often reflects the locality of interactions of objects in the physical world. Thus the topologies of interconnection networks in large systems tend to be optimized for such communication patterns. Increasingly, however, the nature of algorithms and the types of problems being solved are imposing greater randomness in the pattern of communications and hence greater pressure on the available bandwidth. One measure used to characterize the behavior of systems for nonlocal communication patterns is the bisection bandwidth. This is the least bandwidth available between any two partitions of the system each having half the total number of computing nodes. Petascale systems for the High Productivity Computing Systems (HPCS) program are required to have a bisection bandwidth of at least . PB/s. But few organizations will either need or will be able to afford such a high bandwidth per flops ratio.
E
At the exascale level, costs will force the bandwidth per flops ratio to be even lower. The ExaScale Computing Study sponsored by DARPA estimated the bisection bandwidth requirements at the exascale level to be in a range between PB/s and EB/s. Such a high bandwidth will force a predominant use of optics technology across the system, not just between racks, but also at the board level and perhaps even at the chip level. Exascale computing would benefit from lower costs through commoditization of optics components and from further advances in optics technology, like silicon photonics.
Packaging As mentioned earlier, the power goal of – MW for an exascale system will impose challenging requirements in all aspects of the system. One such aspect is packaging. The number of racks in an exascale system cannot be allowed to increase dramatically from today because of limitations in the size and cost of installations. Thus each rack will have to accommodate more boards and each board will have to accommodate more chips than at present. Increased packaging density will increase power density at various levels of the system and will make testing, debugging, and servicing of the system more challenging. Two technologies that are maturing and that could help in this respect are D stacking of silicon dies and the use of silicon carriers. D stacking not only helps in providing a compact footprint but also reduces the latency of communication between circuits on different layers. With judicious use of through-silicon vias (TSVs), D technology could also provide greater bandwidth between sections of the processor design without having to resort to large, expensive dies. In comparison to the mm pitch of printed-circuit board technology, silicon carrier technology allows mounting of dies closer than um, wiring pitch of – um, and an I/O pitch of – um. The higher signaling rates, the bandwidth, and the higher I/O density promised by silicon carrier technology at lower power and in a compact footprint will go a long way toward meeting exascale goals. Supercomputers have always been at the forefront in packaging; exascale promises to be no different.
E
E
Exascale Computing
Reliability The number of components in a system has a direct bearing on the reliability of the system. Today’s petascale systems, even with their redundancy and errorcorrecting techniques at the cache and memory levels, have a mean time between failures (MTBF) of less than a month. A direct extrapolation for a -fold increase in the number of components brings this number down to a few minutes. The conventional method of handling failures is to take a checkpoint of the relevant state of an application at an interval well under the MTBF of the system. When a failure occurs, the system is interrupted, the state reverted back to a previously checkpointed state, and execution resumed. The maintenance of checkpoints requires the system to be periodically interrupted and typically involves copying of processor and memory state to some other part of memory or to storage. The overhead of this checkpointing process could impede forward progress of the application if it has to be performed too often. New techniques for detection of faults and for recovery from faults are needed for future exascale systems. While coding and redundancy techniques could be used to improve the MTBF of the system, they come at a cost – in area, in power, and often in performance. With the exascale goal already a significant challenge on all these fronts, there is a growing belief that reliability techniques can no longer be application agnostic, and that applications will need to guide when and how checkpointing needs to be done in future large systems. Usability The eventual success of exascale computing will be measured not by whether it can execute the Linpack benchmark at an exaflops rate but by how successfully it is embraced by the community and how fundamentally it transforms both science and commerce. The preceding discussion about the challenges of implementing an exascale system suggests that the role of software will be important in the effective exploitation of such a system. On the one hand, life for application developers will be made easier if the system supports today’s programming models without change, but on the other hand, software will have to work around the necessary restrictions in hardware due to reasons mentioned above. Perhaps the most effective way in which this can be accomplished is through widespread adoption of a
new exascale programming model. Neither MPI, the standard message-passing paradigm, nor OpenMP, the standard shared-memory paradigm, alone can exploit a system almost certainly likely to include a cluster of shared-memory SIMD processors. There is growing interest in a hybrid model, for example, one that uses OpenMP at the local chip level and MPI at higher levels. New programming languages classified as Partitioned Global Address Space (PGAS) languages like X, UPC, Co-Array Fortran, and Titanium provide the programmer the view of one large shared address space for the entire system, but still allow the programmer to specify partitions of the address space local to each thread. Various parts of the software stack will also be expected to incorporate features to tolerate the expected high rate of failures as discussed in the previous section. While this could be handled directly through the programming language, ease-of-use considerations are likely to force reliability to be taken care of either at the compiler level or in the runtime system, as in the Map-Reduce programming model. The role of compilers and runtime systems also becomes important if the system is heterogeneous like some of today’s petascale systems or if domainspecific accelerators and field-programmable gatearrays (FPGAs) become part of the system hardware. While a small number of proficient users, the so-called top-gun programmers, would enjoy programming with such a diverse array of components, the vast majority of users would prefer a compiler or the system to decide when and how to employ these components to deliver optimal performance for their programs. A new organization, the International Exascale Software Project (IESP) has been formed with the recognition that unless a coordinated global effort is begun now, the programming community will not be ready to utilize exascale systems effectively when they are deployed. The IESP has taken upon itself the task of outlining a roadmap for the development of systems software (including operating systems, runtime systems, I/O systems, systems management software), of development environments (including programming models, frameworks, compilers, libraries, and debugging tools), of algorithm support (including data management, analysis, and visualization), and also of algorithms for common applications.
Exascale Computing
Conclusion Exascale systems promise to usher in a new era of computing where massive supercomputers will be used not only for traditional simulations of large scientific problems but also to process the vast amount of data that is being produced especially from new types of ubiquitous devices. The success of such large systems, however, will be determined by how effectively they are utilized. Exascale systems can be expected to blaze new trails in power-efficient hardware design, in a system-wide approach to ensuring availability despite expected failures, and in new software approaches to maximize the utilization of the available computation power.
Related Entries Chapel (Cray Inc. HPCS Language) Checkpointing Clusters
E
The Top Web site [] gives a historical trend of the performance and capabilities of the highestperforming scientific computers. The limits of CMOS scaling and its effect on power consumption in future technologies are described well in []. An excellent overview of the architecture, power consumption, and economics of large datacenters may be found in []. There are several new technologies that promise to supplant or complement nonvolatile storage. A good description of storage-class memories may be found in []. There is recent interest also in the use of such memories as extensions to traditional DRAM main memory []. A comprehensive description of challenges in packaging and potential solutions can be found in []. The promise of D technology is discussed in []. An introduction to recent advances in silicon photonics appears in [].
Coarray Fortran Distributed-Memory Multiprocessor Fault Tolerance Hybrid Programming With SIMPLE LINPACK Benchmark Massive-Scale Analytics Metrics Networks, Direct Networks, Fault-Tolerant Networks, Multistage PGAS (Partitioned Global Address Space) Languages Power Wall Reconfigurable Computers TOP UPC
Bibliographic Notes and Further Reading The most comprehensive description of the challenges of exascale systems can be found in the report of a study [] under the sponsorship of the DARPA Information Processing Techniques Office. The software challenges are detailed in a similar study described in []. The International Exascale Software Project [] is also actively pursuing an agenda to ensure adequate software exploitation of future exascale systems. DARPA also sponsored a study on reliability aspects of exascale systems and the resulting whitepaper is available at [].
Bibliography . Barroso LA, Hölzle U () The datacenter as a computer: an introduction to the design of warehouse-scale machines. In: Hill M (ed) Synthesis lectures on computer architecture. Morgan and Claypool Publishers, San Rafael, CA. Available: http://www.morganclaypool.com/doi/abs/./ SEDVYCAC. Accessed Feb . Elnozahy EN (ed) () System resilience at extreme scale. Available: http://www.darpa.mil/ipto/personnel/docs/ExaScale_ Study_Resiliency.pdf. Accessed Feb . Frank D () Power-constrained CMOS scaling limits. IBM J Res Dev (/):– . Freitas RF, Wilcke WW () Storage-class memory: the next storage system technology. IBM J Res Dev (/):– . International Exascale Software Project () Main page. Available: http://www.exascale.org/iesp/Main_Page. Accessed Feb . Knickerbocker JU et al () Development of next-generation system-on-package (SOP) technology based on silicon carriers with fine-pitch chip interconnection. IBM J Res Dev (/): – . Kogge PM (ed) () ExaScale computing study: technology challenges in achieving exascale systems. Available: http://www. darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf . Accessed Feb . Qureshi MK, Srinivasan V, Rivers JA () Scalable high performance main memory system using phase-change memory technology. In: Proceedings of ISCA-. Austin, TX, USA . Sarkar V (ed) () Exascale software study: software challenges in extreme scale systems. Available: http://www.darpa.mil/ ipto/personnel/docs/ExaScale_Study_Software.pdf. Accessed Feb
E
E
Execution Ordering
. Top Supercomputer Sites () Home page. Available: http:// top.org/. Accessed Feb . Topol AW et al () Three-dimensional integrated circuits. IBM J Res Dev (/):– . Tsybeskov L, Lockwood DJ, Ichikawa M () Silicon photonics: CMOS going optical. Proc IEEE ():–
Execution Ordering Scheduling Algorithms
Experimental Parallel Algorithmics Algorithm Engineering
Extensional Equivalences Behavioral Equivalences
F Fast Fourier Transform (FFT) FFT (Fast Fourier Transform)
Fast Multipole Method (FMM) N-Body Computational Methods
Fast Poisson Solvers Rapid Elliptic Solvers
Fat Tree Networks, Multistage
Fault Tolerance Hans P. Zima, Allen Nikora California Institute of Technology, Pasadena, CA, USA
Definition Fault tolerance denotes the capability of a system to adhere to its specification and deliver correct service in the presence of faults.
Discussion Introduction Modern society relies increasingly on highly complex computing systems across a wide spectrum of applications, ranging from commercial transactions to the control of transportation and communication systems,
the electrical power grid, nuclear reactors, and space missions. In many cases, human lives and huge material values depend on the well-functioning of these systems even under adverse conditions. Despite the successes in verification and validation technology over past decades, theoretical as well as practical studies convincingly demonstrate that large systems typically do contain design faults, even when subject to the strictest development disciplines. Moreover, even a perfectly designed system may be subject to external faults, such as radiation effects and operator errors. As a consequence, it is essential to provide methods that avoid system failure and maintain the functionality of a system, possibly with degraded performance, even in the case of faults. This is called fault tolerance. A fault is defined as a defect in a system that may cause an error during its operation. If an error affects the service to be provided by a system, a failure occurs. Fault-tolerant systems were built long before the advent of the digital computer, based on the use of replication, diversified design, and federation of equipment. In an article on Babbage’s difference engine published in , Dionysius Lardner wrote []:“The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computation by different methods.” Early fault-tolerant computers include NASA’s Self-Testing-and-Repairing (STAR) computer, developed for a -year mission to the outer planets in the s, and the computers onboard the Jet Propulsion Laboratory’s Voyager systems. Today, highly sophisticated fault-tolerant computing systems control the new generation of fly-by-wire aircraft, such as the Airbus and Boeing airliners, protecting against design as well as hardware faults. Perhaps the most widespread use of fault-tolerant computing has been in the area of commercial transactions systems, such as automatic teller machines and airline reservation systems.
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
F
Fault Tolerance
The focus of this article is on software methods for dealing with faults that may arise in hardware or software systems. Before entering into a discussion on fault tolerance, the broader concept of dependability is explored below.
system’s capability to prevent the unauthorized disclosure of information. Maintainability characterizes the ability of a system to undergo modifications, including repairs: one specific aspect is the time it takes to restore a failed system to normal operation.
Dependability
Threats A system is associated with a specification, which describes the service the system is intended to provide to its user (a human or another system). It consists of one or more interacting components. A threat is any fact or event that negatively affects the dependability of a system. Threats can be classified as faults, errors, or failures; their relationship is illustrated by the fault-error-failure chain [] shown in Fig. . A fault is a defect in a system. Faults can be dormant – e.g., incorrect program code that is not executed – and have no effect. When activated during system operation, a fault leads to an error, which is an illegal system state. A fault inside a component is called internal; an external fault is caused by a failure propagated from another component, or from outside the system. Errors may be propagated through a system, generating other errors. For example, a faulty assignment to a variable may result in an error characterized by an illegal value for that variable; the use of the variable for the control of a for-loop can lead to ill-defined iterations and other errors, such as illegal accesses to data sets and buffer overflows. A failure occurs if an error reaches the service interface of a system, resulting in system behavior that is inconsistent with its specification. The execution of a system can be modeled by a sequence of states, with state transitions being caused as the result of atomic actions. In a first approximation, the set of all system states can be partitioned into correct states and error states. By separating those error states that allow a recovery from those that represent system failure the set of all error states is further partitioned into tolerated error states and failure states (Fig. ). The transitions between these state categories can be described by classifying unambiguously each action as a correct, fault, or recovery action []:
Fault tolerance is one important aspect of a system’s dependability, a property that has been defined by the IFIP . Working Group on Dependable Computing and Fault Tolerance as the “trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers.” Whereas fault tolerance deals with the problem of ensuring correct system service in the presence of faults, the notion of dependability includes also methods for removing faults during system design, for example, via program verification, or to prevent faults a priori, e.g., by imposing coding rules that outlaw common design faults.
Basic Attributes of Dependability The attributes of dependability specify a set of properties that can be used to assess how a system satisfies its overall requirements. A system’s dependability specification needs to include the requirements for each of the individual attributes in terms of the acceptable frequency and severity of service failures for specified classes of faults and the system’s operating environment. Short definitions for a typical list of attributes are provided below; a detailed discussion can be found in [, , ]. The reliability, R(t), of a system at a time, t, is specified as the probability that the system will provide correct service up to time t. For example, the reliability required for aircraft control systems has been specified as R(t) > − − for t = h (this is often informally characterized as “ nines”). The mean time to failure (MTTF) is the expected time that a system will operate correctly before failure. System availability at a time, t, is defined as the probability that the system can provide correct service at time t. The safety of a system is characterized by the absence of catastrophic consequences on its user and the environment it is operating in. More formally, it can be defined as the probability to fail in a controlled, or safe manner (see section “Controlled Failure”). Confidentiality is a
●
A correct action, executed in a correct state, results in a correct state
Fault Tolerance
F
Service interface
Propagation to service interface
Activation Fault
Error−1
Defect
Invalid state
Propagation
....
Error−n
Failure
F
Invalid state Violation of system SPEC
External fault (caused by external failure)
Service interface
Fault Tolerance. Fig. Threats: the fault-error-failure chain
● Failure states
Tolerated error states
Correct states Recovery action
Correct / fault action (unrecoverable)
Fault action Fault action (unrecoverable)
Correction of controlled failure
Fault Tolerance. Fig. State-space partitioning
●
A fault action, executed in a correct state, results in an error state (tolerated or failure)
A correct or fault action, executed in a tolerated error state, results in an error state (tolerated or failure) ● A recovery action, executed in a tolerated error state, results in a correct state. ● For a class of systems that are referred to as fail-controlled recovery from failure is possible using special protocols (see section “Controlled Failure”). A system is fault tolerant if it never enters a failure state: this means that errors may occur, but they never reach the service boundary of the system and always allow recovery to take place. The implementation of fault tolerance in general implies three steps: error detection, error analysis, and recovery. The main categories of faults are physical, design, and interaction faults. A physical fault manifests itself by malfunctioning of hardware, which is not the result of a design fault. Examples include a permanent fault caused by the breakdown of a processor core due to overheating, a transient fault caused by radiation and resulting in a bitflip in a register or memory, or a fault caused by a human operator.
F
Fault Tolerance
Design faults may originate in either hardware or software. An illegal condition in a while loop leading to infinite iteration or an assignment involving incompatible variable types are examples for software design errors. Interaction faults occur during system operation and are caused by the environment in which the system operates. They include illegal input data or operator errors as well as natural faults caused by radiation hitting the system. In addition to these main categories, faults can be characterized along a set of additional properties, often across the above classes. This includes the domain of a fault – hardware or software, its intent – accidental or malicious (e.g., faults caused by viruses, worms, or intrusion), or its persistence – hard or soft. A hard fault (such as a device breakdown) makes a component permanently unusable. In contrast, a soft fault is a temporary fault that can be transient or intermittent. A component affected by such a fault can be reused after the resulting error has been processed. For example, a bitflip in a register caused by a proton hitting a circuit is a transient fault referred to as a Single Event Upset (SEU). Reflecting the diversity of faults, the failures caused by them can be characterized by a broad range of nonorthogonal properties. These include their domain – value or timing, the capability for control, signalling – signalled or unsignalled failures, and their consequences, ranging from minor to catastrophic.
Fault Prevention Fault prevention addresses methods that prevent faults to being incorporated into a system. In the software domain, such methods include restrictive coding standards that avoid common programming faults, the use of object-oriented techniques such as modularization and information hiding, and the provision of high-level programming languages and abstractions. Firewalls are mechanisms for preventing malicious faults. An example for hardware fault prevention includes shielding against radiation-caused faults []. Fault Removal Despite the existence of fault tolerance methods, a software system’s dependability will tend to increase with
the number of faults identified and removed during its development. Fault removal refers to the set of techniques that eliminate faults during the design and development process. Verification and Validation (V&V) are important in this context: Verification refers to methods for the review, inspection, and test of systems, with the goal of establishing that they conform to their specification. Validation checks the specification in order to determine if it correctly expresses the needs of the system’s users. Verification methods can be classified as either dynamic or static, depending on whether or not they involve execution of the underlying program. Both techniques require that a system’s functional and behavioral specification be transformed into a mathematically based specification language, such as first-order logic []. System developers also need to specify a set of correctness properties, which must be satisfied. For example, a correctness property for a camera mounted on a robotic planetary exploration spacecraft would be that the camera’s lens cover must be closed if the camera is pointed within a given angle of the Sun. Static verification of a program implies a static proof – performed before execution of the program – that for all legal inputs the program conforms to its specification. Early techniques include Hoare’s logic and Dijkstra’s predicate transformers [, ]. Theorem provers use mathematical proof techniques to provide a rigorous argument that the correctness properties are satisfied. Some theorem provers, such as the early versions of the Boyer-Moore theorem prover [], are designed to operate in a completely automated fashion. Others, such as the Prototype Verification System (PVS) [], are interactive, requiring the user to “steer” the proof. Model checkers determine whether a system’s correctness properties can be violated by exploring its computational state space. If a violation is detected, the model checker returns a counterexample describing how the violation can occur. This also means that model checkers can be used to generate test cases for the implemented system []. Model checkers are able to identify types of faults associated with multithreaded systems such as deadlocks, lack of progress cycles, and race conditions. The more widely used model checkers include the SPIN model checker [], UPPAAL [], and SMV (Symbolic Model Verifier) [].
Fault Tolerance
Although static verification techniques can identify faults early in a system’s development, these techniques face a number of theoretical and practical challenges and limitations. Specifically, many decision problems are undecidable, that is, there is no general method that provides a complete solution []. Other problems, for which solution algorithms exists – including many graph problems, are NP-complete, that is, computationally intractable due to exponential complexity. Finally, methods such as model checking often need to deal with the scalability challenge, that is, an exploding state space. Furthermore, substantial effort can be required to create the formal specifications to which these tools will be applied. Learning the specification language can require substantial effort as well; even more important is learning the skill of abstracting from nonessential details of the system’s functionality and behavior. For these reasons, these techniques are usually applied only to the critical elements of a system (e.g., the fault detection, identification, and repair component of onboard spacecraft control software, or fault response systems for nuclear power plants). Dynamic verification technologies are based on the actual execution of the program. Most important are test methods. The test of a program for a specific input either demonstrates correctness of the program for this specific input, or results in an error. Thus, a specific test execution can either prove that the program operates correctly for the selected input, or it identifies an error. Thus, as already noticed by Edsger Dijkstra in , tests can prove or disprove the existence of a fault but never the absence of all faults [].
Controlled Failure Non-fault tolerant systems may be fail-controlled, meaning that they fail only in specific, predefined modes, and only to a manageable extent, avoiding complete disruption. Rather than providing the capability of resuming normal operation, a failure in such a system puts it into a state from which recovery is possible after the failure has been detected and identified. As an example, the onboard software controlling robotic planetary exploration spacecraft for those portions of a mission during which there is no critical
F
activity (such as detumbling the spacecraft after launch or descent to a planetary surface) can be organized as a fail-controlled system. When a fault is detected during operation, all active command sequences are terminated, components inessential for survival are powered off, and the spacecraft is positioned into a stable, sunpointed attitude. Critical information regarding its state is transmitted to ground controllers via an emergency link; rebooting and restoring the health of the spacecraft is then delegated to controllers on Earth. Systems that preserve continuity of service can be significantly more difficult to design and implement than fail-controlled systems. Not only is it necessary to determine that a fault has occurred, the software must be able to determine the effects of the fault on the system’s state, remove the effects of the fault, and then place the system into a state from which processing can proceed.
Fault Tolerance General Methods for Software Fault Tolerance Ensuring fault tolerance, that is, continuity of service in the presence of faults, requires replication of functionality. Many of the techniques used to achieve replication of functionality can be categorized as either design diversity or as self-checking software. Using design diversity to tolerate faults in the design of a software system requires that there be at least two variants of a system (i.e., different designs and/or implementations based on a common specification), a decider to provide an error-free result from the variants’ execution, and an explicit specification of decision points in common specification (i.e., specification of when decisions have to be performed, and specification of the data upon which decisions have to be performed). Two well-known types of design diversity are recovery blocks and N-version programming. Recovery Blocks
Recovery blocks [] make use of sequentially applied design diversity. In this case, the decider is an acceptance test that checks the computation results against context-specific criteria to determine whether the result can be accepted as correct. If the acceptance test decides
F
F
Fault Tolerance
that the result is correct, the system continues execution and does not attempt to apply any corrective action. On the other hand, if the acceptance test indicates that the result is incorrect, the result is discarded and a variant (termed an “alternate” for recovery blocks) is applied to the inputs to retry the computation. If all of the alternates have been attempted and no results have met the acceptance criteria, additional steps (such as exception handling, see section “Exception Handling”) are required to recover from the fault. A Petri-net model of a recovery block is shown in Fig. . N-Version Programming
For N-version programming [], design diversity is applied by executing all of the variants (termed versions) in parallel. The decider then votes on all of the results: if a majority of the results meet the decider’s criteria for correctness, the system is considered to be operating normally and continues execution. Otherwise, further action is required to recover from the fault. A Petri-net model for N-version programming is shown in Fig. . A variant of N-version programming, known as backto-back testing [], keeps only one of the N versions in the fielded system, but uses all of the versions during testing. If a defect is found in one of the N versions during test, the other versions are analyzed to determine
whether they also have that defect. The goal of back-toback testing is to improve the reliability of a fielded system without incurring the memory space or execution time penalties that might be associated with N-version programming. However, fielding only a single version removes the fault-tolerance capability associated with N-version programming. N-Self-Checking Programming
N-self-checking programming [] combines the idea of an acceptance test from recovery blocks with the parallel execution of N ≥ variants. In this case, each of the variants has an acceptance test associated with it. The acceptance tests associated to each variant or the comparison algorithms associated to a pair of variants can be the same, or specifically derived for each of them from a common specification. As the variants are executed, the acceptance test associated with each variant determines whether the result meets the criteria for correctness. From the set of correct results, one is selected as the collective output for all the variants and execution proceeds normally. If the set of correct results is empty, the system is considered to have failed and further action is required to restore it to nominal operation. A Petri-net model for N-self-checking software is shown in Fig. .
Start software execution
Primary alternate execution
End primary alternate execution
Alternate #i selected
No further alternates available
End alternate #i execution Acceptance test not passed
Execute acceptance test Alternate selection Acceptance test passed End software execution
Fault Tolerance. Fig. Petri-net model for recovery block
Execution alternate #i
Failed software
Fault Tolerance
F
Start software execution
Version 1 execution
Version 2 execution
Version N execution
End execution version 2
End execution version 1
End execution version N
Start decision algorithm execution N
Gather version results
F Acceptable result not provided
Acceptable result provided Failed software Execute decision algorithm
Fault Tolerance. Fig. Petri-net model for N-version programming Start software execution
Self-checking component 1 execution
Acceptable result provided
Self-checking component N execution
Self-checking component 2 execution
Gather, select acceptable result Result selected
No result selected
Acceptable result not provided
N Gather unacceptable results Failed software
Fault Tolerance. Fig. Petri-net model for self-checking programming
For self-checking software components that are based on the association of two variants, one variant only may be written for the purpose of fulfilling the functions expected from the component, while the other variant can be written as an extended acceptance
test. For example, it could perform the required computations with a different accuracy, compute the inverse function from the primary variant’s results (if possible), or exploit application-specific intermediate results of the primary variant. Algorithm-based fault tolerance
F
Fault Tolerance
(ABFT) belongs to this category []. Many of the faulttolerant systems encountered in real life are based on self-checking software. For example, the software for the Airbus A and A flight control systems and the Swedish railways’ interlocking system are based on parallel execution of two variants – in the latter case, the variants’ results are compared, and operation is stopped if the results of the variants disagree. It is important to note that the actual effectiveness of fault-tolerance techniques based on multiple versions may not be as high as the theoretical limits, because even independent programming teams may make similar errors []. However, recent work indicates that in spite of this limitation, N-version techniques can still be a viable way of tolerating faults [].
Introspection-Based Adaptive Fault Tolerance
Different components of a system may have different requirements for fault tolerance. For example, a space mission may, at one end of the spectrum, contain program components that perform low-level image processing with minimal fault-tolerance requirements (it does no harm if one in a few million pixels is lost), while at the other end the software for spacecraft control and communication needs maximum protection. Such
systems require support for adaptive fault tolerance, based on a fault model and a characterization of applications and their components with respect to their overall importance, exposure to faults, and requirements for fault tolerance. An introspection-based approach can support such an environment by providing a capability for execution-time monitoring and self-checking, analysis, and feedback-oriented recovery []. An introspection module, as illustrated in Fig. , is the core element of such a system, connected to the application via a set of software sensors for receiving input from, and a set of actuators for generating feedback to an application module. An inference engine connected to a knowledge base controls the operation of the introspection module. Introspection modules are organized into an introspection graph that reflects the organization and fault-tolerance requirements of the application. The application-specific error detection, analysis, and recovery actions of an introspection module can be controlled via assertions inserted in a program, directions for generation of specific checks, and corresponding recovery actions. This information can be provided by a user or automatically generated from the source program or a pre-existing knowledge basis characterizing properties of the application [, ]. A detailed analysis of the range of fault-tolerance requirements for a complex real-time environment is provided in [].
Sensors
. . .
Inference engine
Control links To / from introspection modules
Monitoring
Application segment(s)
Analysis
. . .
Feedback / recovery
Prognostics
Actuators
Fault Tolerance. Fig. Introspection module
. . .
Knowledge base
. . .
Fault Tolerance
Fault Tolerance in Concurrent Systems Safety and Liveness Properties
A concurrent system is introduced as an interconnected collection of a finite number of autonomous processes. Processes can either interact via a global shared memory, or communicate via message-passing over a network. No assumptions are made about individual and relative speeds of processes and delays involved in access to shared memory or message-passing communication, except for the requirement that the speed of active processes is finite and positive, and message delays are finite. (Thus, this model does not cover real-time constraints.) An execution of such a system can be modeled by a sequence of system states, with a transition between successive states taking place as a result of an atomic action in a process or a communication step. Lamport [] introduced two key types of correctness properties for parallel systems: safety properties and liveness properties. A safety property can be described by a predicate over the state that must hold for all executions of the system. It rules out that “bad things happen.” An example for a safety property is the requirement of an airline reservation system to provide mutual exclusion for accesses to any record representing a specific flight on a given day. A more mundane example is the requirement for a system controlling the traffic lights of a street intersection that the lights for two crossing streets may never be green at the same time. A liveness property requires that each process, which in principle is able to work, will be able to do so after a finite time. This property can be expressed via a predicate that must be eventually satisfied, guaranteeing that “a good thing will finally happen.” For instance, if a set of processes compete for a resource, each of the processes should be able to acquire it after a finite amount of time. Examples for violation of liveness include the prevention of a process to reach regular termination or to provide a specified service. Another example is a deadlock involving two ore more processes, which cyclically block each other indefinitely in an attempt to access common resources. Alpern and Schneider have shown that every possible property of the set of executions can be expressed as a conjunction of safety and liveness properties []. The manifestation of faults in a concurrent system, and related methods for their toleration depend largely
F
on detailed system properties and the programming model under consideration. In the next subsection, a number of typical problems in concurrent systems are shortly discussed; a comprehensive treatment of such issues can be found in []. In addition to the applicability of the general methods discussed in section “General Methods for Software Fault Tolerance”, fault tolerance can be often achieved using specialized techniques. Example Problems
Mutual exclusion constrains access to a shared resource – for example, a variable, record, file, or output device – such that at any time at most one process may access the resource (a safety property). Any correct solution must also guarantee that each process requesting the resource is granted access after a finite waiting period (a liveness property). A simple method for implementing mutual exclusion in a shared memory system is via a binary semaphore [] that controls access to the resource. The corruption of the semaphore value (by any process or as a result of a hardware fault) destroys the safety property of the algorithm. An error in the semaphore implementation may destroy the liveness property. The mutual exclusion problem is discussed in detail in []. State-based synchronization makes the progress of a process dependent on a condition in its environment. For example, in a simple version of the producerconsumer problem, the producer and consumer are autonomous processes operating asynchronously, with a cyclic buffer between them acting as a temporary storage for data items: in each cycle of the producer, it generates one data item and places it into the buffer; conversely, the consumer removes in each cycle one item from the buffer and then processes it. Wrong synchronization (as a result of a design error or a corruption of variables used for the coordination) can lead to a number of errors, including the corruption of a buffer element, an attempt of the producer to write into a full buffer, or an attempt of the consumer to read from an empty buffer. The result is a violation of the safety property, and possibly of liveness. A deadlock is a system state where two or more processes block each other indefinitely in an attempt to access a set of resources. In a typical example, process p holds resource r and requests resource r , while process p holds r and requests r . This is a violation of the liveness requirement. It can be addressed in a number
F
F
Fault Tolerance
of ways, including deadlock prevention –, by prioritizing resources and allowing access requests only in rising priority order, or by periodically checking for a deadlock and taking a recovery action if it occurs. A different example for a liveness violation is represented by the famous dining philosophers problem, where two processes conspire to block a third process indefinitely from accessing a common resource. Problems of this kind can be avoided by defining an upper limit for the number of unsuccessful resource requests, and prioritizing the blocked process once this limit is reached. In a system where processes communicate over a network, a network fault can propagate into a number of different faults in the process system. These include () transmission of wrong message content, () sending a message to the wrong destination, and, () total loss of the message, resulting in a possibly indefinite delay of the intended receiver. A way to deal with the last of these faults is the use of a watchdog timer that regulary checks the progress of processes. Byzantine faults are arbitrary value or timing faults that can occur as a result of radiation hitting a circuit, or of a malicious intruder. Under certain conditions, such faults can be tolerated using a sophisticated form of replication [].
Checkpointing and Rollback Recovery in Message-Passing Systems This section deals with fault tolerance for long-running applications in message-passing systems implemented on top of a reliable communication layer (which has to be provided by the underlying hardware and software layers of the network). The state of such systems consists of the set of local states of the participating processes, extended by the state of the communication system, which records the messages that are in-flight at any given time. Consistency of the global state implies that the dependences created between processes as a result of message transmissions must be observable in the state: more specifically, at the time a process records receipt of a message the state of the sender must reflect the previous sending of that message. A recovery approach based on checkpointing and rollback requires a stable storage device that is not affected by faults in the system. At a checkpoint, a process saves recovery information to this storage device during fault-free operation; after occurrence of an error
(e.g., as a result of a failing process), the information stored at the checkpoint can be used to restart the process. Recovery can be managed by the application; here a system-supported approach is discussed. There are two categories of recovery-based protocols. The first relies only on the use of checkpoints, whereas the second approach in addition performs logging of all nondeterministic events, which allows the deterministic regeneration of an execution after the occurrence of a fault (which is not possible if only checkpoints are used). This is important in applications that heavily communicate with the outside world (which cannot be rolled back!). However, from the viewpoint of today’s large-scale scientific applications, the first approach, which will just be denoted as checkpointing is more relevant, and will be the sole focus in the rest of this section. There are two main methods for performing checkpointing, which are characterized as uncoordinated or coordinated. In uncoordinated checkpointing, each process essentially decides autonomously when to take checkpoints. During recovery, a consistent global checkpoint needs to be found taking into account the dependences between the individual process checkpoints. Uncoordinated checkpointing has the advantage of allowing a process to take into account local knowledge, for example, for minimizing the amount of data that needs to be saved. However, there are serious disadvantages, including the domino effect that can result in rollback propagation all the way to the beginning of the computation and the need to record multiple checkpoints in each process, with the necessity of periodical cleanups via garbage collection. As a result of these shortcomings, the dominating method used in today’s systems, and particularly in supercomputers, is coordinated checkpointing. In this approach, a global synchronization step involving all processes is used to form a consistent global state. Only one checkpoint is needed in each process, and no garbage collection is required. A straightforward centralized two-stage protocol for the creation of a checkpoint requires blocking of all communication while the protocol executes. In a first step, a coordinator process broadcasts a quiesce message to all processes requesting them to stop. After receiving an acknowledgment for this request from all processes, the coordinator broadcasts a checkpoint message that requests processes to take their checkpoint and acknowledge completion. Finally,
Fault Tolerance
after receipt of this acknowledgment from all processes, the coordinator broadcasts a proceed message to all processes, allowing them to continue their work. On a failure in a process, the whole application rolls back to the last checkpoint and resumes execution from there. Coordinated checkpointing provides a relatively simple approach to recovery. Drawbacks include the overhead for communication in the checkpointing and recovery protocols, and the time required for storing the relevant data sets to a stable storage device, with the obvious consequences for scalability. In fact, a recent study on exascale computing [] shows that checkpointing along the above protocol seems no longer to be feasible, due to the existence of billions of threads in such a system. A potential solution would require new storage technology for saving the recovery information. A comprehensive treatment of rollback-recovery protocols, including log-based approaches, can be found in []. Coordinated checkpointing for largescale supercomputers is discussed in detail in [], including a study of scalability and an approach to deal with failures during the checkpointing and recovery phases.
Fault Injection Testing Fault injection testing refers to a technique for revealing defects in a system’s ability to tolerate faults []. Fault injection inserts faults into a system or mimicks their effects, combined with a strategy that specifies the type and number of faults to be injected into system components, the frequency at which faults are to be injected, and dependencies among fault types. A number of fault injectors have been developed [, , ], which allow faults to be inserted into a computing system’s memory or processor registers while the system is running, in a manner that is transparent to the executing application and system software. This allows developers of fault-tolerant systems to observe the system operating in an environment closely related to the expected environment during fielded use. As faults are injected into the system, system engineers are able to observe the immediate effects of each fault as well as the way in which it propagates through the system. When conducting fault injection testing, it is necessary to design test strategies that will cause the system under test to execute its fault-handling capabilities under a wide variety of conditions. One particular strategy, implemented by the Ballista automated testing
F
methodology [], is to change the parameters passed to software modules and observe the result. In Ballista, a test passes if the system does not hang or crash, ignoring the validity of output values produced. The JPL Implementation of a Fault Injector (JIFI) [] toolset is typical of modern software-controlled fault injectors. It allows automated campaigns of random uniformly distributed bitflip faults into application registers and/or memory space and registers. In order to characterize the fault injection results, it is necessary to verify the final result and classify its effect on the system with respect to its liveness properties and the functional correctness of the output. The main goal of this methodology is to characterize an application’s sensitivity to transient faults. For example, it is possible to determine whether the application is more likely to crash or produce incorrect results, or which subroutines of the program are most vulnerable.
Exception Handling An exception is a special event that signals the occurrence of one among a number of predetermined faults in a program. Exception handling can be considered as an additional fault-tolerance technique in that it provides a structured response to such events. There are two types of exception handling, resumption and termination []. In resumption, if an exception is encountered at a given command in a program, the program is temporarily halted and control is transferred to a handler associated with that program. The handler performs the computations required to mitigate the effects of the exception and then transfers control back to the program. The handler may return to the command immediately following the one at which the exception was encountered, or to a different location that may depend on the details of the program’s state when the exception was signalled. By contrast, termination simply leads to the (controlled) halting of the program in which the exception was observed. Although resumption has a potential advantage in that the interrupted program may be resumed upon successful execution of the handler, there are also advantages associated with choosing termination []. For example, with termination a programmer is encouraged to recover a consistent state of a module in which an exception is detected before signalling it, so that further calls to the module’s procedures find the module state consistent. With resumption, a programmer
F
F
Fault Tolerance
does not know if control will come back after signalling an exception. In addition, recovering a consistent state before signalling (e.g., for example by undoing all changes made since the procedure start) defeats the purpose of resumption, which is to save the work done so far between the procedure entry and the detection of the exception. If a consistent state is not recovered, the handler may never resume execution of the module after the signalling command, and the module will remain in the intermediate, most likely inconsistent, state that existed when the exception was detected, meaning that further module calls can lead to additional exceptions and failures.
Future Directions As computing systems become increasingly embedded in and critical to society, more effective capabilities to deal with faults induced by exceptional environmental conditions, specification and design defects, or both, will be required. As an example, consider implantable medical devices such as insulin pumps or defibrillators. These systems already execute safety-critical software; future versions of these devices intended to respond to a wider set of medical conditions will be substantially more complex than those in operation today. New challenges for fault tolerance will also come from the increased degree of automation in transportation systems (railways, airplanes, automobiles) as well as extreme-scale systems for large-scale simulation with billion-way parallelism []. Last but not least, future robotic space missions in our solar system and beyond will demand an unprecedented degree of autonomy when operating in environments in which direct communication with Earth will no longer be possible within practical time intervals. The additional complexity of these systems and the uncertainties of the environments in which they operate will introduce more opportunities for specification and design defects [], for which effective fault-tolerant techniques must be developed and deployed.
Related Entries Checkpointing Compilers Deadlocks Distributed-Memory Multiprocessor
Exascale Computing Race Detection Techniques Synchronization
Bibliographic Notes and Further Reading The basic concepts of dependability are discussed in detail in the seminal work of Avizienis, Laprie, Randell, and Landwehr [, ]. A comprehensive discussion of software fault tolerance can be found in the collection of articles edited by Lyu []. In [] dependability and fault tolerance are discussed with a focus on hardware issues. Faults and fault tolerance for algorithms in distributed systems, with a detailed treatment of provably solvable and unsolvable issues can be found in []. The premier conference on dependability is the annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), with DSN- held in Chicago in . This conference was established in by combining the IEEE-sponsored International Symposium on Fault-Tolerant Computing (FTCS), which had been held since , with the Working Conference on Dependable Computing for Critical Applications (DCCA) sponsored by IFIP WG .. Other conferences related to dependability include the International Symposium on Software Reliability Engineering (ISSRE) organized by the IEEE Computer Society and the Reliability, Availability, and Maintainability Symposium (RAMS) organized by the IEEE Reliability Society. In addition to many journals discussing fault tolerance in the context of more general or related areas such as software engineering, programming languages and systems, and system security, the IEEE Transactions on Dependable and Secure Computing (TDSC) have a strong focus on dependability and fault tolerance. The bibliography in [] provides a good overview of publications in the field until .
Bibliography . Alpern B, Schneider FB () Defining liveness. Inf Process Lett ():– . Amnell A, Behrmann G, Bengtsson J, D’Argenio PR, David A, Fehnker A, Hune T, Jeannet B, Larsen KG, MÄoller MO, Pettreson P, Weise C, Wang Yi () UPPAAL–now, next, and future. In: Proceedings of modelling and verification of parallel processes (MOVEP’k), June . LNCS tutorial , pp –
Fault Tolerance
. Avizienis A, Laprie JC, Randell B () Fundamental concepts of dependability, UCLA CSD Report No. , University of California, Los Angeles . Avizienis A, Laprie JC, Randell B, Landwehr C () Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing ():– . Avizienis AA () The methodology of N-version programming. In: Lyu MR (ed) Software fault tolerance. Wiley, Chichester . Boyer RS, Kaufmann M, Moore JS () The Boyer-Moore theorem prover and its interactive enhancement. Comput Math Appl ():– . Cai X, Lyu MR, Vouk MA () An experimental evaluation of reliability features of N-version programming. In: Proceedings of the IEEE international symposium on software reliability engineering, pp – . Cristian F () Exception handling and tolerance of software faults. In: Lyu MR (ed) Software fault tolerance. Wiley, New York, pp – . Crow J, Owre S, Rushby J, Shankar N, Srivas M () A tutorial introduction to PVS. http://www.csl.sri.com/papers/ wift-tutorial/ . Dijkstra EW () Co-operating sequential processes. In: Genuys F (ed) Programming languages. Academic, New York, pp – . Dijkstra EW () Notes on structured programming. In: Dahl OJ, Dijkstra EW, Hoare CAR (eds) Structured programming. Academic, London, pp – . Dijkstra EW () A discipline of programming, nd edn. Prentice Hall, Englewood Cliffs . Elnozahy ENM, Alvisi L, Wang Yi-Min, Johnson DB () A survey of rollback recovery protocols in message-passing systems. ACM Comput Surv ():– . Enderton HB () A mathematical introduction to logic, nd edn. Academic, San Diego . Kogge P et al () ExaScale computing study: technology challenges in achieving ExaScale systems. Technical report, DARPA information processing techniques O±ce (IPTO). http:// www.notur.no/news/inthenews/files/exascale_final_report., September . Gargantini A, Heitmeyer CL () Using model checking to generate tests from requirements specifications. In: Proceedings of the joint th european software engineering conference and the th ACM SIGSOFT symposium on the foundations of software engineering, Toulouse, France, September . Gärtner FC () Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv ():– . Goldberg D, Li M, Tao W, Tamir Y () The design and implementation of a fault-tolerant cluster manager. Technical report CSD-, Computer Science Department, University of California, Los Angeles, California, USA, October . Gunnels JA, Katz DS, Quintana-Ort ES, van de Geijn RA () Fault-tolerant high-performance matrix multiplication: theory and practice. In: Proceedings of international conference
. . .
.
.
.
. . .
.
. .
.
. .
. .
. .
.
F
on dependable systems and networks, (DSN-), Goeteborg, Sweden, June/July Hoare CAR () An axiomatic basis for computer programming. Commun ACM ():– Holzmann GJ () The SPIN model checker. Primer and reference manual. Addison-Wesley, New York Hopcroft JE, Motwani R, Ullman JD () Introduction to automata theory, languages, and computation, rd edn. Addison Wesley Higher Education, New York James M, Shapiro A, Springer P, Zima H () Adaptive fault tolerance for scalable cluster computing in space. IJHPCA (): – Kalbarczyk ZT, Iyer RK, Bagchi S, Whisnant K () Chameleon: a software infrastructure for adaptive fault tolerance. IEEE Trans Parallel Distrib Syst ():– Knight JC, Leveson NG () An experimental evaluation of the assumption of independence in multiversion programming. IEEE Trans Softw Eng ():– Koopman P () Ballista design and methodology, http://www. ece.cmu.edu/koopman/ballista/reports/desmthd.pdf Lamport L () Proving the correctness of multiprocess programs. IEEE Trans Softw Eng ():– Laprie JC, Arlat J, Beounes C, Kanoun K () Definition and analysis of hardware- and software-fault-tolerant architectures. Computer ():– Lardner D () Babbages’s calculating engine. Edinburgh Review, July . Reprinted. In: Morrison P, Morrison E (eds) Charles Babbage and his calculating engines. Dover, New York Lyu MR (ed) () Software fault tolerance, vol . Wiley, Chichester McCluskey EJ, Mitra S () Fault tolerance. In: Tucker AB (ed) Computer science handbook, nd edn. Chapman and Hall/CRC Press, Chapter , London, UK McMillan KL () Symbolic model checking – an approach to the state explosion problem. Ph.D. Dissertation, Carnegie Mellon University Nikora AP, Munson JC () Building high-quality software predictors. Softw Pract Exper ():– Randell B, Xu J () The evolution of the recovery block concept. In: Lyu MR (ed) Software fault tolerance. Wiley, Hoboken, New Jersey, USA, pp – Schneider FB () On concurrent programming. Springer, New York Shirvani PP () Fault-tolerant computing for radiation environments. Technical report -. Ph.D. Thesis, Center for Reliable Computing, Stanford University, Stanford, California , June Critical Software () csXCEPTION. http://www.criticalsoftware.com/products services/csxception/ Some RR, Agrawal A, Kim WS, Callum L, Khanoyan G, Shamilian A () Fault injection experiment results in space-borne parallel application programs. In: Proceedings IEEE aerospace conference, Big Sky, MT, USA, March Some RR, Kim WS, Khanoyan G, Callum L, Agrawal A, Beahan J () A software-implemented fault injection methodology for
F
F .
. .
.
.
Fences
design and validation of system fault tolerance. In: Proceedings of the international conference on dependable systems and networks (DSN), Goteborg, Sweden, pp –, July Stott DT, Floering B, Kalbarczyk Z, Iyer RK () A framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the th international computer performance and dependability symposium (IPDS’), IEEE Computer Society, Washington, DC, USA, pp – Gerard Tel () Introduction to distributed algorithms, nd edn. Cambridge University Press, Cambridge, UK Vouk MA () On back-to-back testing. In: Proceedings of the third annual conference on computer assurance (COMPASS’), pp – Wang L, Pattabiraman K, Kalbarczyk Z, Iyer RK () Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the international conference on dependable systems and networks (DSN’) Zima HP, Chapman BM () Supercompilers for parallel and vector computers. ACM Press Frontier Series, New York
Discussion Introduction The discrete Fourier transform (DFT) is a ubiquitous tool in science and engineering including in digital signal processing, communication, and highperformance computing. Applications include spectral analysis, image compression, interpolation, solving partial differential equations, and many other tasks. Given n real or complex inputs x , . . . , xn− , the DFT is defined as yk = ∑ ω kℓ n xℓ ,
Memory Models Synchronization
FFT (Fast Fourier Transform) Franz Franchetti , Markus Püschel Carnegie Mellon University, Pittsburgh, PA, USA ETH Zurich, Zurich, Switzerland
Synonyms Fast algorithm for the discrete Fourier transform (DFT)
Definition A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) of an input vector. Efficient means that the FFT computes the DFT of an n-element vector in O(n log n) operations in contrast to the O(n ) operations required for computing the DFT by definition. FFTs exist for any vector length n and for real and higher-dimensional data. Parallel FFTs have been developed since the advent of parallel computing.
()
√ − . Stacking the with ω n = exp(−πi/n), i = xℓ and yk into vectors x = (x , . . . , xn− )T and y = (y , . . . , yn− )T yields the equivalent form of a matrix– vector product: y = DFTn x,
Fences
≤ k < n,
≤ℓ
DFTn = [ω kℓ n ]≤k,ℓ
()
Computing the DFT by its definition () requires Θ(n ) many operations. The first fast Fourier transform algorithm (FFT) by Cooley and Tukey in reduced the runtime to O(n log(n)) for two-powers n and marked the advent of digital signal processing. (It was later discovered that this FFT had already been derived and used by Gauss in the nineteenth century but was largely forgotten since then [].) Since then, FFTs have been the topic of many publications and a wealth of different algorithms exist. This includes O(n log(n)) algorithms for any input size n, as well as numerous variants optimized for various computing platform and computation requirements. The by far most commonly used DFT is for two-power input sizes n, partly because these sizes permit the most efficient algorithms. The first FFT explicitly optimized for parallelism was the Pease FFT published in . Since then specialized FFT variants were developed with every new type of parallel computer. This includes FFTs for data flow machines, vector computers, shared and distributed memory multiprocessors, streaming and SIMD vector architectures, digital signal processing (DSP) processors, field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). Just like Pease’s FFT, these parallel FFTs are mainly for two-powers n and are adaptations of the same fundamental algorithm to structurally match the target platform.
FFT (Fast Fourier Transform)
On contemporary sequential and parallel machines it has become very hard to obtain high-performance DFT implementations. Beyond the choice of a suitable FFT, many other implementation issues have to be addressed. Up to the s, there were many public FFT implementations and proprietary FFT libraries available. Due to the code complexity inherent to fast implementations and the fast advances in processor design, today only a few competitive open source and vendor FFT libraries are available in the parallel computing space.
FFTs: Representation Corresponding to the two different ways () and () of representing the DFT, FFTs are represented either as sequences of summations or as factorizations of the transform matrix DFTn . The latter representation is adopted in Van Loan’s seminal book [] on FFTs and used in the following. To explain this representation, assume as example that DFTn in () can be factored into four matrices DFTn = M M M M .
()
Then () can be computed in four steps as t = M x, u = M t, v = M u, y = M v. If the matrices Mi are sufficiently sparse (have many zero entries) the operations count compared to a direct computation is decreased and () is called an FFT. For example, DFT can be factorized as ⎡ ⎤ ⎤⎡ ⎤⎡ ⎤⎡ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎢ ⎥ ⎥⎢ ⎥⎢ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ − ⎥ ⎢ ⎥, () DFT = ⎢ ⎢ ⎥ ⎥⎢ ⎥⎢ ⎥⎢ ⎢ − ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥⎢ ⎥⎢ ⎢ ⎢ ⎢ ⎢ −⎥ i⎥ −⎥ ⎥ ⎣ ⎦ ⎦⎣ ⎦⎣ ⎦⎣ where omitted values are zero. This example also demonstrates why the matrix-vector multiplications in () are not performed using a generic sparse linear algebra library, but, since the Mi are known and fixed, by a specialized program. Conversely, every FFT can be written as in () (with varying numbers of factors). The matrices Mi in FFTs are not only sparse but also structured, as a glimpse on () illustrates. This structure can be efficiently expressed using a formalism based on matrix algebra and also clearly expresses the parallelism inherent to an FFT.
F
Matrix formalism and parallelism. The n × n identity matrix is denoted with In , and the butterfly matrix is a DFT of size : DFT = [
]. −
()
The Kronecker product of matrices A and B is defined as A ⊗ B = [ak,ℓ B],
for A = [ak,ℓ ].
It replaces every entry ak,ℓ of A by the matrix ak,ℓ B. Most important for FFTs are the cases where A or B is the identity matrix. As examples consider ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ − ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − ⎢ ⎥ ⎥, I ⊗ DFT = ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ − ⎣ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ DFT ⊗I = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
− −
−
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ −⎥ ⎦
with the corresponding dataflows shown in Fig. . Note that the dataflows are from right to left to match the order of computation in (). I ⊗ DFT clearly expresses block parallelism: four butterflies computing on contiguous subvectors; whereas DFT ⊗I expresses vector parallelism: four butterflies operating on interleaved subvectors which is the same as one vector butterfly operating on vectors of length four as emphasized in Fig. (b). More precisely, consider the code for DFT (i.e., y = DFT x): y[0] = x[0] + x[1]; y[1] = x[0] - x[1];
F
F
FFT (Fast Fourier Transform)
Then code for DFT ⊗I is obtained by replacing every scalar operation by a four-way vector operation: y[0:3] = x[0:3] + x[4:7]; y[4:7] = x[0:3] - x[4:7];
Here, x[a:b] denotes (Matlab or FORTRAN style) the subvector of x starting at a and ending at b. These examples illustrate how the tensor product captures parallelism. To summarize: block parallelism (n blocks): In ⊗ A,
()
vector parallelism (n-way): A ⊗ In ,
()
where A is any matrix. The stride permutation matrix Lmn m permutes the elements of the input vector as in + j ↦ jm + i, ≤ i < m, ≤ j < n. If the vector x is viewed as an n × m
y
a
x
y = (I4 ⊗ DFT2)x
y
b
x
y = (DFT2 ⊗ I4)x
FFT (Fast Fourier Transform). Fig. Dataflow (right to left) of a block parallel and its “dual” vector parallel construct (figure from [])
matrix, stored in row-major order, then Lmn m performs a transposition of this matrix. Further, if P is a permutation (matrix), then AP = P− AP is the conjugation of A with P. Cooley–Tukey FFT. The fundamental algorithm at the core of the most important parallel FFTs derived in the literature is the general-radix decimation-in-time Cooley–Tukey type FFT expressed as n DFTn = (DFTk ⊗Im )Tm (Ik ⊗ DFTm )Lnk ,
n is a diagonal matrix Here, k is called the radix and Tm containing the twiddle factors. The algorithm factors the DFT into four factors as in (), which shows the special case n = = × . Two of the four factors in () contain smaller DFTs; hence the algorithm is divideand-conquer and has to be applied recursively. At each step of the recursion the radix is a degree of freedom. For two-power sizes n = ℓ , () is sufficient to recurse up to n = , which is computed by definition (). Figure shows the special case = × as matrix factorization and as corresponding data-flow graph (again to be read from right to left). The smaller DFTs are represented as blocks with different shades of gray. A straightforward implementation of () suggests four steps corresponding to the four factors, where two steps call smaller DFTs. However, to improve locality, the initial permutation Lnk is usually not performed but interpreted as data access for the subsequent compun is fused with the tation, and the twiddle diagonal Tm
y
DFT4 ⊗ I4
T 16 4
I4 ⊗ DFT4
n = km. ()
x
L16 4
DFT16 =
a
Matrix factorization
b
Data-flow graph
FFT (Fast Fourier Transform). Fig. Cooley–Tukey FFT () for = × as matrix factorization and as (complex) data-flow graph (from right to left). Some lines are bold to emphasize the strided access (figure from [])
FFT (Fast Fourier Transform)
subsequent DFTs. This strategy is chosen, for example, in the library FFTW .x and the code can be sketched as follows: void dft(int n, complex *y, complex *x) { int k = choose_factor(n); // t1 = (I_k tensor DFT_m)L(n,k)*x for(int i=0; i < k; ++i) dft_iostride(m, k, 1, t1 + m*i, x + i); // y = (DFT_k tensor I_m)diag(d(j))*t1 for(int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t1 + i); } // DFT variants needed void dft_iostride(int n, int istride, int ostride, complex *y, complex *x); void dft_scaled(int n, int stride, complex *d, complex *y, complex *x);
The DFT variants needed for the smaller DFTs are implemented similarly based on (). There are many additional issues in implementing () to run fast on a nonparallel platform. The focus here is on mapping () to parallel platforms for two-power sizes n.
Parallel FFTs: Basic Idea The occurrence of tensor products in () shows that the algorithm has inherent block and vector parallelism as explained in () and (). However, depending on the platform and for efficient mapping, the algorithm should exhibit one or both forms of parallelism throughout the computation to the extent possible. To achieve this, () can be formally manipulated using well-known matrix identities shown in Table . The table makes clear that there is a virtually unlimited set of possible variants of (), which also explains the large set of publications on FFTs. These variants hardly differ in operations count but in structure, which is crucial for parallelization. The remainder of this entry introduces the most important parallel FFTs derived in the literature. All these FFTs can be derived from () using Table . The presentation is divided into iterative and recursive FFTs. Each FFT is visualized for size n = in a form similar to () (and again from right to left) to emphasize block and vector parallelism. In these visualizations, the twiddle factors are dropped since they do not affect the dataflow and hence pose no structural problem for parallelization.
F
FFT (Fast Fourier Transform). Table Formula identities to manipulate FFTs. A is n × n, and B and C are m × m. A⊺ is the transpose of A (BC)⊺
=
C ⊺ B⊺
(A ⊗ B)⊺
=
A⊺ ⊗ B⊺
Imn
=
Im ⊗ In
A⊗B
=
(A ⊗ Im )(In ⊗ B)
In ⊗ (BC)
=
(In ⊗ B)(In ⊗ C)
(BC) ⊗ In
=
(B ⊗ In )(C ⊗ In )
A⊗B
=
mn Lmn n (B ⊗ A)Lm
=
Lmn n
=
mn (Lkn n ⊗ Im )(Ik ⊗ Ln )
=
kn (Ik ⊗ Lmn m )(Lk ⊗ Im )
=
kmn Lkmn km Lkn
− (Lmn m ) Lkmn n Lkmn km Lkmn k
F
Iterative FFTs The historically first FFTs that were developed and adapted to parallel platforms are iterative FFTs. These algorithms implement the DFT as a sequence of nested loops (usually three). The simplest are radix-r forms (usually r = , , ), which require an FFT size of n = r ℓ ; more complicated mixed-radix radix variants always exist. They all factor DFTn into a product of ℓ matrices, each of which consists of a tensor product and twiddle factors. Iterative algorithms are obtained from () by recursive expansion, flattening the nested parentheses, and other identities in Table . The most important iterative FFTs are discussed next, starting with the standard version, which is not optimized for parallelism but included for completeness. Note that the exact form of the twiddle factors differs in these FFTs, even though they are denoted with the same symbol. Cooley–Tukey iterative FFT. The radix-r iterative decimation-in-time FFT ℓ−
ℓ
ℓ
DFTr ℓ = (∏(Iri ⊗ DFTr ⊗Ir ℓ−i− ) Dri ) Rrr ,
()
i=
is the prototypical FFT algorithm and shown in Fig. . ℓ Rrr is the radix-r digit reversal permutation and the ℓ diagonal Dri contains the twiddle factors in the ith stage. The radix- version is implemented by Numerical Recipes using a triple loop corresponding to the two tensor products (inner two loops) and the product (outer loop).
F
FFT (Fast Fourier Transform)
Stage 3
Stage 2
Stage1
Stage 0
Bit reversal
16 16 16 16 ((I1 ⊗ DFT2 ⊗ I8) D16 0 ) ((I2 ⊗ DFT2 ⊗ I4) D1 ) ((I4 ⊗ DFT2 ⊗ I2) D2 ) ((I8 ⊗ DFT2 ⊗ I1) D3 ) R2
FFT (Fast Fourier Transform). Fig. Iterative FFT () for n = and r =
Formal transposition of () yields the iterative decimation-in-frequency FFT: ℓ
ℓ−
ℓ
DFTrℓ = Rrr ∏ Dri (Ir ℓ−i− ⊗ DFTr ⊗Iri ) .
()
i=
Both () and () contain the bit reversal permutation ℓ Rrr . The parallel and vector structure of the occurring butterflies depends on the stage. Thus, even though every stage is data parallel, the algorithm is neither well suited for machines that require block parallelism nor vector parallelism. For this reason, very few parallel triple-loop implementations exist and compiler parallelization and vectorization tend not to succeed in producing any speedup when targeting the triple-loop algorithm. Pease FFT. A variant of () is the Pease FFT ℓ−
ℓ
ℓ
ℓ
DFTr ℓ = (∏ Lrr (Ir ℓ− ⊗ DFTr ) Dri ) Rrr ,
()
i=
shown in Fig. for r = . It has constant geometry, that is, the control flow is the same in each stage, and maximizes block parallelism by reducing the block sizes to r on which single butterflies are computed. However,
the Pease FFT also requires the digit reversal permutation. Each stage of the Pease algorithm consists of the twiddle diagonal and a parallel butterfly block, followed by the same data exchange across parallel blocks specified through a stride permutation. The Pease FFT was originally developed for parallel computers, and its regular structure makes it a good choice for fieldprogrammable gate arrays (FPGAs) or ASICs. Formal transposition of () yields a variant with the bit-reversal in the end. Korn–Lambiotte FFT. The Korn–Lambiotte FFT is given by ℓ
ℓ−
ℓ
ℓ
DFTrℓ = Rrr (∏ Lrr ℓ− Dri (DFTr ⊗Ir ℓ− )) ,
()
i=
and is the algorithm that is dual to the Pease FFT in the sense used in Fig. . Namely, it has also constant geometry, but maximizes vector parallelism as shown in Fig. for r = . Each stage contains one vector butterfly operating on vectors of length n/r, and a twiddle diagonal. As last step it performs the digit reversal permutation. The Korn–Lambiotte FFT was developed for early vector computers. It is derived from the Pease
FFT (Fast Fourier Transform)
Stage 3 Comm
Stage 2
Parallel
Comm
Stage 1
Parallel
Comm
Stage 0
Parallel
Comm
Parallel
F
Bit reversal communication
F
16 16 16 16 16 16 16 16 (L16 2 (I8 ⊗ DFT2) D0 ) (L2 (I8 ⊗ DFT2) D1 ) (L2 (I8 ⊗ DFT2) D2 ) (L2 (I8 ⊗ DFT2) D3 ) R2
FFT (Fast Fourier Transform). Fig. Pease FFT in () for n = and r =
Bit reversal Communication
Stage 3 Comm
Vbutterfly
Stage 2 Comm
Vbutterfly
Stage 1 Comm
Vbutterfly
Stage 0 Comm
Vbutterfly
16 16 16 16 16 16 16 16 R16 2 (L8 D0 (DFT2 ⊗ I8)) (L8 D1 (DFT2 ⊗ I8)) (L8 D2 (DFT2 ⊗ I8)) (L8 D3 (DFT2 ⊗ I8))
FFT (Fast Fourier Transform). Fig. Korn–Lambiotte FFT in () for n = and r = . Vbutterfly = vector butterfly
F
FFT (Fast Fourier Transform)
algorithm through formal transposition followed by the translation of the tensor product from a parallel into a vector form. Stockham FFT. The Stockham FFT DFTr ℓ =
ℓ−
rℓ r ℓ−i ∏(DFTr ⊗Ir ℓ− ) Di (Lr i=
⊗ Iri ) ,
()
is self-sorting, that is, it does not have a digit reversal permutation. It is shown in Fig. for r = . Like the Korn–Lambiotte FFT, it exhibits maximal vector parallelism but the permutations change across stages. Each of these permutations is a vector permutation, but the vector length increases by a factor of r in each stage (starting with ). Thus, for most stages a sufficiently long vector length is achieved. The Stockham FFT was originally developed for vector computers. Its structure is also suitable for graphics processors (GPUs), and indeed most current GPU FFT libraries are based on the Stage 3 Vbutterfly
VShuffle
Stage 2 Vbutterfly
VShuffle
Stockham FFT. The formal transposition of () is also called Stockham FFT.
Recursive FFT Algorithms The second class of Cooley–Tukey-based FFTs are recursive algorithms, which reduce a DFT of size n = km into k DFTs of size m and m DFTs of size k. The advantage of recursive FFTs is better locality and hence better performance on computers with deep memory hierarchies. They also can be used as kernels for iterative algorithms. For parallelism, recursive algorithms are derived, for example, to maximize the block size for multicore platforms, or to obtain vector parallelism for a fixed vector length for platforms with SIMD vector extensions. The most important recursive algorithms are discussed next. Recursive Cooley–Tukey FFT. The recursive, general-radix decimation-in-time Cooley–Tukey FFT was shown before in (). Typically, k is chosen to be Stage 1 Vbutterfly
VShuffle
Stage 0 Vbutterfly VShuffle
2 16 4 16 8 16 16 ((DFT2 ⊗ I8) D16 0 (L2 ⊗ I8)) ((DFT2 ⊗ I8) D1 (L2 ⊗ I4)) ((DFT2 ⊗ I8) D2 (L2 ⊗ I2)) ((DFT2 ⊗ I8) D3 (L2 ⊗ I1))
FFT (Fast Fourier Transform). Fig. Stockham FFT in () for n = and r = . Vbutterfly = vector butterfly, VShuffle = vector shuffle
FFT (Fast Fourier Transform)
Recursive FFT
F
Accumulated shuffles
F
8 4 4 8 16 (DFT2 ⊗ I8) T 16 8 (I2 ⊗ ((DFT2 ⊗ I4) T 4 (I2 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L2)) L2
FFT (Fast Fourier Transform). Fig. Recursive radix- decimation-in-time FFT for n =
small, with values up to . If () is applied to n = r ℓ recursively with k = r the algorithm is called radix-r decimation-in time FFT. As explained before, the initial permutation is usually not performed but propagated as data access into the smaller DFTs. For radix- the algorithm is shown in Fig. . Note that the dataflow is equal to Fig. , but the order of computation is different as emphasized by the shading. Formal transposition of () yields the recursive decimation-in-frequency FFT n (DFTk ⊗Im ), DFTn = Lnm (Ik ⊗ DFTm )Tm
n = km. ()
Recursive application of () and () eventually leads to prime sizes k and m, which are handled by a special prime-size FFT. For two-powers n the butterfly matrix DFT terminates the recursion. The implementation of () and () is more involved than the implementation of iterative algorithms, in particular in the mixed-radix case. The divide-andconquer nature of () and () makes them good choices for machines with memory hierarchies, as at some
recursion level the working set will be small enough to fit into a certain cache level, a property sometimes called cache oblivious. Both () and () contain both vector and parallel blocks and stride permutations. Thus, despite their inherent data parallelism, they are not ideal for either parallel or vector implementations. The following variants address this problem. Four-step FFT. The four-step FFT is given by n n DFTn = (DFTk ⊗Im )Tm Lk (DFTm ⊗Ik ),
n = km, ()
and shown in Fig. . It is built from two stages of vector FFTs, the twiddle diagonal and a transposition. Typi√ cally, k, m ≈ n is chosen (also called “square root decomposition”). Then, () results in the longest possible vector operations except for the stride permutation in the middle. The four-step FFT was originally developed for vector computers and the stride permutation (or transposition) was originally implemented explicitly while the smaller FFTs were expanded with some other FFT—typically iterative. The transposition can
F
FFT (Fast Fourier Transform)
Vector FFT
Shuffe
Vector FFT
16 4 4 (((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I4) L16 4 T 4 (((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2) ⊗ I4)
FFT (Fast Fourier Transform). Fig. Four-step FFT for n = and k = m =
be implemented efficiently using blocking techniques. Equation () can be a good choice on parallel machines that execute operations on long vectors well and on which the overhead of a transposition is not too high. Examples includes vector computers and machines with streaming memory access like GPUs. Six-step FFT. The six-step FFT is given by DFTn = Lnk (Im
n ⊗ DFTk )Lnm Tm (Ik
⊗ DFTm )Lnk ,
n = km, ()
and shown in Fig. . It is built from two stages of parallel butterfly blocks, the twiddle diagonal, and three global transpositions (all-to-all data exchanges). () was originally developed for distributed memory machines and out-of-core computation. Typically, √ n is chosen to maximize parallelism. The k, m ≈ transposition was originally implemented explicitly as all-to-all communication while the smaller FFTs were expanded with some other FFT algorithm—typically iterative. As in (), the required matrix transposition can be blocked for more efficient data movement. () can be a good choice on parallel machines that have
√ n =
multiple memory spaces and require explicit data movement, like message passing, off-loading to accelerators (GPUs and FPGAs), and out-of-core computation. Multicore FFT. The multicore FFT for a platform with p cores and cache block size μ is given by kp
((L p ⊗Im/pμ )⊗I μ )
DFTn = (Ip ⊗ (DFTk ⊗Im/p ))
n Tm
n/p
× (Ip ⊗ (Ik/p ⊗ DFTm )Lk/p ) pm
× ((Lp ⊗ Ik/pμ ) ⊗ I μ ) ,
n = km,
()
and is a version of () that is optimized for homogeneous multicore CPUs with memory hierarchies. An example is shown in Fig. . () follows the recursive FFT () closely but ensures that all data exchanges between cores and all memory accesses are performed with cache block granularity. For a multicore with cache block size μ and p cores, () is built solely from permutations that permute entire cache lines and p-way parallel compute blocks. This property allows for parallelization of small problem sizes across a moderate
FFT (Fast Fourier Transform) Communication
Parallel DFTs
Communication
Parallel DFTs
F
Communication
F
4 4 16 16 4 4 16 L16 4 (I4 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L4 T 4 (I4 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L4
FFT (Fast Fourier Transform). Fig. Six-step FFT for n = and k = m =
Block exchange
Parallel DFTs
Block exchange
√ n =
Parallel DFTs
Block exchange
4 8 8 (L84 ⊗ I2) (I2 ⊗ ((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) (L82 ⊗ I2) T 16 4 (I2 ⊗ (I2 ⊗ (DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2)) R 2) (L2 ⊗ I2)
FFT (Fast Fourier Transform). Fig. Multicore FFT for n = , k = m = , p = cores, and cache block size μ =
F
FFT (Fast Fourier Transform)
Vector FFTs
In-Register Shuffes
4 (((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) T 16 4 (I2 ⊗ (L 2
Vector FFTs
Vector Shuffe
⊗ I2) (I2 ⊗ L 42) (((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) (L 82 ⊗ I2)
FFT (Fast Fourier Transform). Fig. Short vector FFT in () for n = , k = m = , and (complex) vector length ν =
number of cores. Implementation of () on a cachebased system relies on the cache coherency protocol to transmit cache lines of length μ between cores and requires a global barrier. Implementation on a scratchpad-based system requires explicit sending and receiving of the data packets, and depending on the communication interface additional synchronization may be required. The smaller DFTs in () can be expanded, for example, with the short vector FFT (discussed next) to optimize for vector extensions. SIMD short vector FFT. For CPUs with SIMD ν-way vector extensions like SSE and AltiVec and a memory hierarchy, the short vector FFT is defined as n (Ik/ν DFTn = ((DFTk ⊗Im/ν ) ⊗ Iν ) Tm
×
(Im/ν ⊗ Lνν
n/ν )(DFTm ⊗Iν )) (Lk/ν
⊗ (Lm ν
ν-way vectorized is the stride permutations Lνν , which can be implemented efficiently using in-register shuffle instructions. () requires the support or implementation of complex vector arithmetic and packs ν complex elements into a machine vector register of width ν. A variant that vectorizes the real rather than the complex dataflow exists. Vector recursion. The vector recursion performs a locality optimization for deep memory hierarchies for the first stage (I k ⊗ DFTm )Lnk of (). Namely, in this stage DFTm is further expanded using again () with m = m m and the resulting expression is manipulated to yield m ) (Ik ⊗ DFTm )Lnk = (Ik ⊗ (DFTm ⊗Im )Tm
⊗ Iν )
⊗ Iν ) , n = km, ()
and can be implemented using solely vector arithmetic, aligned vector memory access, and a small number of vector shuffle operations. An example is shown in Fig. . All compute operations in () have complex ν-way vector parallelism. The only operation that is not
× (Lkm ⊗ Im ) (Im ⊗ (Ik ⊗ DFTm ) Lkm k k )
× (Lm m ⊗ Ik ) .
()
While the recursive FFT () ensures that the working set will eventually fit into any level of cache, large two-power FFTs induce large -power strides. For caches with lower associativity these strides result in a high number of conflict misses, which may impose
FFT (Fast Fourier Transform) Vector stage
Vector stage
Vector shuffle
Recursive FFTs
F
Vector shuffle
F
8 4 4 8 8 (DFT2 ⊗ I8) T 16 8 (I2 ⊗ (DFT2 ⊗ I4) T 4) (L2 ⊗ I4) (I2 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2))) R 2) (L2 ⊗ I2)
FFT (Fast Fourier Transform). Fig. Vector recursive FFT for n = . The vector recursion is applied once and yields vector shuffles, two recursive FFTs, and two iterative vector stages
a severe performance penalty. For large enough twopower sizes, in the first stage of () every single load will result in a cache miss. The vector recursion alleviates this problem by replacing the stride permutation in () by stride permutations of vectors, at the expense of an extra pass through the working set. Since () matches (Ik ⊗ DFTn ) Lkn k , it is recursively applicable and will eventually produce child problems that fit into any cache level. The vector recursion produces algorithms that are a mix of iterative and recursive as shown in Fig. .
Other FFT Topics So far the discussion has focused on one-dimensional complex two-power size FFTs. Some extensions are mentioned next. General size recursive FFT algorithms. DFT algorithms fundamentally different from () include primefactor (n is a product of coprime factors), Rader (n is
prime), and Bluestein or Winograd (any n). In practice, these are mostly used for small sizes < , which then serve as building blocks for large composite sizes via (). The exception is Bluestein’s algorithm that is often used to compute large sizes with large prime factors or large prime numbers. DFT variants and other FFTs. In practice, several variants of the DFT in () are needed including forward/inverse, interleaved/split complex format, for complex/real input data, in-place/out-of-place, and others. Fortunately, most of these variants are close to the standard DFT in (), so fast code for the latter can be adapted. An exception is the DFT for real input data, which has its own class of FFTs. Multidimensional FFT algorithms. The Kronecker product naturally arises in D and D DFTs, which respectively can be written as DFTm×n = DFTm ⊗ DFTn , DFTk×m×n = DFTk ⊗ DFTm ⊗ DFTn .
() ()
F
FFT (Fast Fourier Transform)
For a D DFT, applying identities from Table to () yields the row-column algorithm DFTm×n = (DFTm ⊗In )(Im ⊗ DFTn ).
()
The D vector-radix algorithm can also be derived with identities from Table from (): I ⊗L rn r ⊗Is
DFTmn×rs = (DFTm×r ⊗Ins ) m
I ⊗L rn r ⊗Is
× (Imr ⊗ DFTn×s ) m
(Tnmn ⊗ Tsrs )
rs (Lmn m ⊗ Lr ) .
()
Higher-dimensional versions are derived similarly, and the associativity of ⊗ gives rise to more variants.
Related Entries ATLAS (Automatically Tuned Linear Algebra Soft-
ware) FFTW Spiral
Bibliographic Notes and Further Reading The original Cooley–Tukey FFT algorithm can be found in []. The Pease FFT in [] is the first FFT derived and represented using the Kronecker product formalism. The other parallel FFTs were derived in [] (Korn– Lambiotte FFT), [] (Stockham FFT), [] (four-step FFT), [] (six-step FFT). The vector-radix FFT algorithm can be found in [], the vector recursion in [], the short vector FFT in [], and the multicore FFT in []. A good overview on FFTs including the classical parallel variants is given in Van Loan’s book [] and the book by Tolimieri, An and Lu []; both are based on the formalism used here. Also excellent is Nussbaumer FFT book []. An overview on real FFTs can be found in []. At the point of writing the most important fast DFT libraries are FFTW by Frigo and Johnson [, ], Intel’s MKL and IPP, and IBM’s ESSL and PESSL. FFTE is currently used in the HPC Challenge as Global FFT benchmark reference implementation. Most CPU, GPU, and FPGA vendors maintain DFT libraries. Some historic DFT libraries like FFTPACK are still widely used. Numerical Recipes [] provides C code for an iterative radix- FFT implementation. BenchFFT provides up-to-date FFT benchmarks of about singlenode DFT libraries. The Spiral system is capable of generating parallel DFT libraries directly from the tensor product-based algorithm description [, ].
Bibliography . Bailey DH () FFTs in external or hierarchical memory. J. Supercomput :– . Cooley JW, Tukey JW () An algorithm for the machine calculation of complex Fourier series. Math Comput :– . Franchetti F, Püschel M () Short vector code generation for the discrete Fourier transform. In: Proceedings of the th International Symposium on Parallel and Distributed Processing, pp –, Washington, DC, USA . Franchetti F, Voronenko Y, Püschel M () FFT program generation for shared memory: SMP and multicore. In: Proceedings of the ACM/IEEE conference on Supercomputing, New York, NY, USA . Franchetti F, Püschel M, Voronenko Y, Chellappa S, Moura JMF () Discrete Fourier transform on multicore. IEEE Signal Proc Mag, special issue on “Signal Processing on Platforms with Multiple Cores” ():– . Frigo M, Johnson SG () FFTW: An adaptive software architecture for the FFT. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol , pp –, Seattle, WA . Frigo M, Johnson SG () The design and implementation of FFTW. Proc IEEE special issue on “Program Generation, Optimization, and Adaptation” ():– . Harris DB, McClellan JH, Chan DSK, Schuessler HW () Vector radix fast Fourier transform. In: Proc Inter Conf Acoustics, Speech, and Signal Processing. Conference Proceedings (ICASSP ’), pp –, Los Alamitos, IEEE Comput. Soc. Press . Heidemann MT, Johnson DH, Burrus CS () Gauss and the history of the fast fourier transform. Arch Hist Exact Sci :– . Korn DG, Lambiotte JJ, Jr. () Computing the fast Fourier transform on a vector computer. Math Comput (): – . Norton A, Silberger AJ () Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Trans Comput ():– . Nussbaumer HJ () Fast Fourier transformation and convolution algorithms, nd ed. Springer, New York . Pease MC () An adaptation of the fast Fourier transform for parallel processing. J ACM ():– . Press WH, Flannery BP, Teukolsky SA, Vetterling WT () Numerical recipes in C: the art of scientific computing, nd ed. Cambridge University Press, Cambridge . Püschel M, Moura JMF, Johnson J, Padua D, Veloso M, Singer BW, Xiong J, Franchetti F, Gaˇci´c A, Voronenko Y, Chen K, Johnson RW, Rizzolo N () SPIRAL: Code generation for DSP transforms. Proc IEEE special issue on “Program Generation, Optimization, and Adaptation” ():– . Schwarztrauber PN () Multiprocessor FFTs. Parallel Comput :– . Tolimieri R, An M, Lu C () Algorithms for discrete Fourier transforms and convolution, nd ed. Springer, New York . Van Loan C () Computational framework of the fast Fourier transform. SIAM, Philadelphia, PA, USA
File Systems
. Voronenko Y, de Mesmay F, Püschel M () Computer generation of general size linear transform libraries. In: Proceedings of the th annual IEEE/ACM International Symposium on Code Generation and Optimization, pp –, Washington, DC, USA . Voronenko Y, Püschel M () Algebraic signal processing theory: Cooley-Tukey type algorithms for real DFTs. IEEE Trans Signal Proc ():–
F
File Systems Robert B. Ross Argonne National Laboratory, Argonne, IL, USA
Synonyms Cluster file systems
Fast Algorithm for the Discrete Fourier Transform (DFT) FFT (Fast Fourier Transform)
FFTW FFTW is a C library for computing the Discrete Fourier Transform (DFT). It was originally developed at MIT by Matteo Frigo and Steven G. Johnson. FFTW is an example of autotuning in two ways. First, the library is adaptive by allowing for different recursion strategies. The best one is chosen at the time of use by a feedback-driven search. Second, the recursion is terminated by small optimized kernels (called codelets) that are generated by a special purpose compiler. The library continues to be maintained by the developers and is widely used due to its excellent performance. The name FFTW stands for the somewhat whimsical “Fastest Fourier Transform in the West.”
Related Entries Autotuning FFT (Fast Fourier Transform) Spiral
Bibliography . Frigo M, Johnson SG () FFTW: An adaptive software architecture for the FFT. In: Proceedings IEEE international conference acoustics, speech, and signal processing (ICASSP), , pp – . Frigo M () A fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN conference on programming language design and implementation (PLDI ‘). ACM, New York, pp – . Frigo M, Johnson SG () The design and implementation of FFTW. Proceedings of the IEEE ():–
F Definition A parallel file system (PFS) is a system software component that organizes many disks, servers, and network links to provide a file system name space that is accessible from many clients; distributes data across many devices to enable high aggregate bandwidth; and coordinates changes to file system data so that clients’ views of the data are kept coherent.
Discussion Introduction File systems are the traditional mechanism for reading and writing data on persistent storage. The current standard interface for file systems, part of the Portable Operating System Interface (POSIX) standard, was first defined in []. As networking of computers became common, sharing storage between multiple computers became desirable. In [], the Network File System (NFS) standard was created, allowing a file system stored locally on one computer to be accessed by other computers. With the advent of NFS, shared file name spaces were possible – many computers could see and access the same files at the same time. Figures a and b show a high-level view of a local and network file system, respectively. In the local file system case, a single software component manages local storage resources (typically one or more disks) and allows processes running on that computer to store and retrieve data organized as files and directories. In the network file system case, the software allowing access to storage is split from the software managing the storage. The file system client presents the traditional view of files and directories to processes on a computer, while the file system server software manages the storage and coordinates access from multiple clients.
F
File Systems
Application
Application
Application
File system client
File system client
File system software Interconnection network
a
File system server
Local file system
b
Networked file system
Parallel application
Parallel application
Parallel application
Parallel FS client
Parallel FS client
Parallel FS client
Interconnection network
Parallel FS server
c
Parallel FS server
Parallel file system
File Systems. Fig. Local, network, and parallel file systems
As distributed-memory parallel computers became popular, the utility of a shared file name space became obvious as a way for parallel programs to store and retrieve data without concern for the actual location of files and file data. However, accessing data stored on a single server quickly became a bottleneck, driving the implementation of parallel file systems – file systems that opened up many concurrent paths between computers and persistent storage. Figure c depicts a parallel file system. With multiple network links, servers, and local storage units, high degrees of concurrent access are possible. This enables very high aggregate I/O rates for the parallel file system as a whole when it is accessed by many processes on different nodes in the system, especially when those accesses are relatively evenly distributed across the storage servers. A wide variety of network configurations
and technologies are used in PFS deployments, from storage area networks and InfiniBand to Gigabit Ethernet and TCP/IP. This discussion focuses on parallel file system deployments where processes accessing the file system are on compute nodes distinct from the nodes that manage storage of data (storage nodes). This model is the most common in high-performance computing (HPC) deployments, because (a) there are typically many more compute nodes than needed to provide storage services, (b) placing disks in all compute nodes has an impact on their reliability and power consumption, (c) servicing I/O operations for some other process on a node executing an HPC application can perturb performance of that application, and (d) keeping storage nodes distinct enables higher-reliability components to be used and helps colocate storage resources in the machine room. That said, Beowulf-style commodity clusters and clusters used for Data Intensive Scalable Computing (DISC) applications are typically configured with local drives, and organizing those drives with a parallel file system can be an effective way to provide a low-cost, high-performance, cluster-wide storage resource.
Data Distribution A number of approaches are used for managing data distribution in parallel file systems. Figures a and b show a simple example file system name space and a distribution of that name space onto multiple storage nodes, respectively. In the name space, two HPC applications have generated three files. A fusion code has written a large checkpoint, while a DISC application is analyzing a pair of relatively small images from observations of the night time sky. The checkpoint has been split into a set of blocks distributed across multiple storage nodes, while the data for each of the smaller images resides on a single server. In addition to distributing the file data across storage nodes, in this example the file system metadata has also been distributed, as depicted by directory data residing on different storage nodes. File system metadata is the data that represents the directory tree as well as information such as the owners and permissions of files. File data is typically divided into fixed-size units called stripe units and distributed in a round-robin manner over a fixed set of servers. This approach is similar
File Systems
/disc
ckpoint43.h5
sky4325.img
B232
sky8792.img
B089
B756
B443 B781
a
Parallel file system name space
Application
Application
Application
Parallel FS client
Parallel FS client
Parallel FS client
Interconnection network
Parallel FS server
/disc B089
b
Parallel FS server
/pfs B756
Access Patterns
/pfs
/fusion
F
Parallel FS server
B443 /fusion
Parallel FS server
B781 B232
Distribution of data across parallel file system servers
File Systems. Fig. Data distribution in a parallel file system (PFS)
to the way in which data is distributed over disks in a disk array. How PFSes keep track of where data is stored will be covered in the section on Metadata Storage and Access. In a configuration such as the one shown here, the file system deployment is said to be symmetric, meaning that metadata and data are both distributed and stored on the same storage nodes. Not all parallel file systems operate this way, and those that keep metadata on separate servers are considered asymmetric. There are advantages to both approaches. The symmetric approach provides a high degree of concurrency for metadata operations, which can be important for applications that generate and process many files. On the other hand, storing metadata on a smaller set of nodes leads to greater locality of access for metadata and simplifies fault tolerance for the name space. Both approaches are commonly used, and some parallel file systems, such as GPFS [] and PVFS [], can be deployed with either configuration depending on workload expectations.
The designs of parallel file systems have been heavily influenced by the characteristics and access patterns of HPC applications. These applications execute at very large scale, with current systems allowing over , application processes to execute at once. Potentially all of those processes might initiate an I/O operation at any time, including all at once. Additionally, HPC applications are primarily scientific simulations CROSSREF. These codes operate on complex data models, such as regular or irregular grids, and these data models are reflected in the data that they store. The end result is that HPC access patterns are often highly concurrent and complex, including noncontiguity of access []. This often arises from access to subarrays of large global arrays. A primary use case for parallel file systems in HPC environments is for checkpointing. Checkpointing is the writing of application state to persistent storage for the purpose of enabling an application to restart at some point in the event of a system failure. This state is typically referred to as a “checkpoint.” If a node on which the application is running crashes, then the application can be restarted using the checkpoint to avoid having to recalculate early results. Typically an application will compute for some amount of time, and then all the processes will pause and write their checkpoint so that the checkpoint represents the same point in time across all the processes. Users are often surprised that their serial workloads are not faster when run on a parallel file system. The cost of coordination in the PFS in conjunction with different caching policies and the latency of communicating with file servers can result in performance no different from, or even worse than, that of local storage. Serial workloads are simply not the primary focus of PFS designs.
Consistency In the context of PFSes, consistency is about the guarantees that the file system makes with respect to concurrent access by processes, particularly processes on different compute nodes. The stronger these guarantees, the more coordination is generally necessary, inside the PFS, to ensure the guarantees.
F
F
File Systems
For example, most parallel file systems ensure that data written by one process is visible to all other processes immediately following the completion of the write operation. If no other copies of file data are kept in the system, then this is easy to enforce: the server owning the data simply always returns data from the most recent completed write. Some parallel file systems make a stronger guarantee, that of atomicity of writes. This means that other processes accessing the PFS will always see either all, or none, of a write operation. This guarantee requires coordination, because if a write spans multiple servers, then the servers need to agree on which data to return in reads if reading is allowed while a write is in progress. Alternatively, the PFS could force readers to wait when a write is ongoing. Many parallel file systems do guarantee atomicity of writes, and this approach of delaying reading when writes are ongoing is commonly used to provide this guarantee. These PFSes include a distributed locking component. Distributed locking systems in parallel file systems are typically single-writer, multireader, meaning that they allow many client nodes to simultaneously read a resource but permit only a single client node access when writing is in progress. Scalable parallel file systems perform locking on ranges of bytes in files, usually rounded to some block size. This allows concurrent writes to different ranges from different compute nodes. The use of distributed locking in parallel file systems has two significant impacts on designs that rely on them. First, the granularity of locking places a limit on concurrency and leads to false sharing during writes. False sharing occurs when two processes on different compute nodes attempt to write to different bytes of a file that happen to map to a single lock. While the two processes are not really performing shared access to the same bytes, the locking system behaves as if they were, serializing their writes. Given the nature of HPC access patterns, false sharing is common. Second, availability is impacted. If a compute node fails while holding a lock, access to the locked resource is halted until the lock can be reclaimed. Determining the difference between a busy compute node and a failed one can be difficult at large scale, especially in the presence of lightweight, non-preemptive kernels. For these reasons, some PFS designs eschew the use of distributed
locking, instead slightly relaxing consistency guarantees to a level that may be supported without locking infrastructure [].
Parallel File System Design Five major mechanisms shape a PFS design: ● ● ● ● ●
Underlying storage abstraction Metadata storage and access Caching Tolerance of failures Enforcement of consistency semantics
Consistency semantics have been discussed. The other four topics are covered next.
Underlying Storage Abstraction Some early parallel file systems relied on storage area networks (SANs) to provide shared access to physical drives. In that approach, data was accessed directly by block locations on drives or logical volumes. Eventually designers realized that interposing storage servers had some advantages, but this block access model was retained. IBM’s General Parallel File System (GPFS) uses this model, with software on the server side providing access to storage over an interconnection network such as Gigabit Ethernet. Figure a shows the division of functionality in such a parallel file system. On the client side, PFS software translates application I/O operations directly into block accesses on specific file servers and obtains permission to access those blocks. File servers receive block access requests and perform them on locally accessible storage. An important development that shaped more recent PFS designs was the concept of object storage devices. Object storage devices (OSDs) are storage devices (e.g., disks, disk arrays) that allow users to store data in logical containers called objects. By abstracting away block locations, file systems built using this abstraction are simpler to build, and blocks can be managed locally, at the device itself. Recently developed parallel file systems, such as Lustre [], PVFS [], and PanFS [], all use object-based access. Figure b shows how functionality is divided in such a design. Application accesses are translated into accesses to objects stored on file
File Systems
PFS client translates application I/O requests into accesses to data blocks stored on remote servers. Blocks are allocated from a global pool when needed.
File system server provides access to locally attached storage devices, using a block protocol similar to SCSI.
a
Parallel application
Parallel application
Block translation
Object translation
Block I/O interface
Object I/O interface
Interconnect
Interconnect
Object I/O manager
Block I/O manager
Block I/O manager
Block-based data access
b
F
PFS client translates application I/O requests into accesses to objects stored on remote servers.
F File system server provides access to locally attached storage devices through an object interface. Objects are mapped to local storage blocks on the
Object-based data access
File Systems. Fig. The two most common approaches to data storage in parallel file systems are block-based (left) and object-based (right)
servers. File servers translate these accesses into locations on locally accessible storage and manage allocation of space when needed. A less obvious advantage of the object-based approach is the possibility of algorithmic distribution of file data. This relates to file system metadata.
Metadata Storage and Access There are three major components of PFS metadata: directory contents, file and directory attributes, and file data locations. Directory contents are simply the listing of files and directories residing in a single directory. Typically the contents for a single directory are stored on a single server. This can become a bottleneck for workloads that generate a lot of files in a single directory, so some parallel file systems will split the contents of a single directory across multiple servers (e.g., GPFS []). File and directory attributes are pieces of information such as the owner, group, modification time, and permissions on a file or directory. Some of this information changes infrequently (e.g., owner), while
other information can change rapidly (e.g., modification time). Some parallel file systems keep all these attributes on a single server, while others distribute the attributes for different files to different servers to allow concurrent access. File data locations are the metadata necessary to map a logical portion of a file to where it is stored in the file system. For a block-based PFS, this information generally consists of a list of blocks or (server, block) pairs, similar to the traditional inode structure in a local file system. For an object-based PFS, a list of objects or (server, object) pairs is kept along with some parameters that define how data is mapped from the logical file into the objects. In a typical round-robin distribution, a stripe unit parameter defines how much data is placed in each object for a single stripe, and this pattern is repeated as necessary over the set of objects. This algorithmic approach to data distribution has the advantages that file data location metadata does not grow over time and does not need to be modified as the file grows.
F
File Systems
Updates to metadata are usually handled in one of two ways. First, if the PFS already implements a distributed locking system, that system might be used to synchronize access to metadata by clients. Clients can then read, modify, and write metadata themselves in order to make updates. This approach has the advantage of allowing caching of metadata as well. Second, the PFS might implement a set of atomic operations that allow clients to update metadata without locks. For example, a special “insert directory entry” operation can be used by a client to add a new file. The server receiving this message is then responsible for ensuring atomicity of the operation, blocking read operations, or returning previous state until it completes its modifications. This approach reduces network traffic for modifications, but it does not lend itself to caching of modifiable data on the compute node side.
Caching Caching of file system data contributes greatly to the observed performance of local file systems, in part because access patterns on local file systems (e.g., editing files, compiling code, surfing the Internet) often focus on a relatively small subset of data with multiple accesses to the same data in a short period of time. Implementing caching of file system data in the local file system context is also fairly straightforward, because the file system code sees all accesses and can ensure that the cache is kept consistent with what is on storage without any communication. Caching in the context of a parallel file system is more complex when consistency semantics are taken into account. Keeping multiple copies of file system data and metadata consistent across multiple compute nodes requires the same sort of infrastructure as needed to enforce serialization of changes to file data and to ensure notification of changes. Typically parallel file systems that allow caching of data on compute nodes rely on a distributed locking system to ensure that all cached copies are the same, or coherent. Thus the caches are organized much the same as a software distributed memory system. The benefits of caching in parallel file systems are highly dependent on the applications using the system. Checkpointing operations do not typically benefit from caching, because the amount of data written usually exceeds data cache sizes, and the data is not accessed
again soon after. Likewise, large analysis applications that read datasets in parallel often exhibit access patterns that confuse typical cache read-ahead algorithms. Nevertheless, if the parallel file system is also used for home directory storage, or if applications access many small files, then caching can be beneficial.
Fault Tolerance Leading HPC systems use parallel file systems with hundreds of servers and thousands of drives. With that many components, failures are inevitable. In order to ensure that data is not lost, and that the file system remains available in the event of component failures, parallel file systems rely on a collection of fault tolerance techniques. Nearly all parallel file systems rely on redundant array of independent disk (RAID) techniques to cope with drive failures, usually employed only on drives directly accessible by servers (i.e., not between multiple servers). These techniques use parity or erasure codes to allow data to be reconstructed in the event of a failure using the data that remains. Typically, RAID is performed on blocks of storage, but some systems perform RAID on an object-by-object basis []. Server failures are often handled through the use of failover techniques. Figures a and b depict a set of PFS servers configured with shared storage to allow for failover. In Fig. a, no failures have occurred. The system is running in an active–passive configuration, meaning that an additional server is sitting idle in case a failure occurs. When a server does fail (Fig. b), the idle server is brought into active service to take over where the previous server failed. The active–passive configuration allows for a certain number of failures to occur without degrading service. Alternatively, in an active–active configuration all servers are active all the time, with some server “doubling up” and handling the workload of a failed server when a failure occurs. This can lead to a significant degradation of performance. While not depicted in the figures, this configuration can also tolerate storage controller failures, because all drives are accessible through either of the storage controllers. If a storage controller were to fail, the servers on the remaining functional storage controller would take over the responsibilities of the servers that could
File Systems
F
Interconnection network
Active PFS server
Active PFS server
Storage controller
Active PFS server
Idle PFS server
Storage controller
Internal storage network
a
F
Fault tolerant storage configuration with active-passive failover, no failures. Interconnection network
Active PFS server
Failed PFS server
Storage controller
Active PFS server
Active PFS server
Storage controller
Internal storage network
b
Fault tolerant storage configuration after failure; previously idle server has taken over.
File Systems. Fig. Typical PFS configuration using shared backend storage to enable server failover, before (a), and after (b) a server failure. Compute nodes have been omitted from the figure
no longer see storage, maintaining accessibility but with obvious performance penalties. Designs exist that do not share access to storage on the back-end [, ]. As opposed to the approach described above, these designs apply RAID techniques across multiple servers. This approach eliminates the need for expensive enterprise storage on the back-end, but it requires additional coordination between clients and servers in order to correctly update erasure code blocks. In the presence of concurrent writers to the same file region (e.g., writing to different portions of the same file stripe), performance can be significantly reduced.
Parallel File System Interfaces Most applications access parallel file systems through the same interface provided for local file systems, and the industry standard for file access is the POSIX standard. As mentioned previously, the POSIX standard was developed primarily with local file systems and singleuser applications in mind, and as a result it is not an ideal interface for parallel I/O. One reason is its use of the open model for translating a file name into a reference for use in I/O. This model forces name translation to take place on every process, which can be expensive at scale.
F
File Systems
Additionally, the POSIX read and write I/O calls allow for description only of contiguous I/O regions. Thus, complex I/O patterns cannot be described in a single call, leading instead to a long sequence of small I/O operations. Moreover, the POSIX consistency semantics require atomicity of writes and immediate visibility of changes on all nodes. These semantics are stronger than are needed for many HPC applications, and require communication to enforce. An effort is underway to define a set of extensions to the POSIX standard for HPC I/O. This effort includes a set of new function calls that address the three issues outlined here. Implementations of these calls have not yet begun to appear in operating systems, but a few custom implementations do exist in specific parallel file systems, and early results are encouraging. The MPI-IO interface standard is the only current alternative to POSIX for low-level file system access.
Related Entries Checkpointing Clusters Cray XT and Cray XT Series of Supercomputers Cray XT and Seastar -D Torus Interconnect Exascale Computing Fault Tolerance HDF IBM Blue Gene Supercomputer MPI-IO NetCDF I/O Library, Parallel Roadrunner Project, Los Alamos Synchronization Software Distributed Shared Memory
Bibliographic Notes and Further Reading Early Parallel File Systems The Vesta parallel file system [, ] was a file system built at the user level (i.e., not in the kernel) and tested on IBM SP systems. AIX Journaled File System volumes were used for storage by file system servers. Vesta is notable for using an algorithmic distribution of data, exposing a two-dimensional view of files, and providing numerous collective I/O modes and control over consistency semantics.
The Frangipani [] parallel file system and Petal [] virtual block device were originally developed at Digital Equipment Corporation and are well-documented examples of one way of organizing a block-based parallel file system. The Petal system combines a collection of block devices into a single, large, replicated pool that the Frangipani system uses as its underlying storage.
Current Parallel File Systems Currently the largest HPC systems tend to use one of four parallel file systems: IBM’s General Parallel File System (GPFS); Sun’s Lustre; Panasas’s PanFS; and the Parallel Virtual File System (PVFS), a communitybuilt parallel file system whose development is led by Argonne National Laboratory and Clemson University. GPFS [] grew out of the Tiger Shark multimedia file system. It uses a block-based approach, and can operate either with clients physically attached to a storage area network or, more typically, with clients accessing storage attached to file system servers via a shared disk model. A lock-based approach is used for coordinating access to blocks of data, and through its locking system GPFS is able to cache data on the client side and provide full POSIX semantics. Lustre [] is an asymettric parallel file system that uses software-based object storage servers (OSSes) to provide data storage. It is the file system chosen by Cray for use on the Cray XT// systems. File data is striped across objects stored on multiple OSSes for performance. A locking subsystem is used to coordinate file access by clients and allows for coherent, client-side caching. Failure tolerance is provided by using shared storage and traditional server failover. PanFS [] is an asymmetric parallel file system that uses a clustered metadata system in conjunction with lightweight data storage blades that provide data access via the ANSI T- object storage device (OSD) protocol [, ]. It is the file system used on the IBM Roadrunner system at Los Alamos National Laboratory. PanFS is notable in that it does not rely on shared storage on the back-end; instead clients compute and update parity information directly on data storage blades. Parity is calculated on a perfile basis, and objects in which file data is stored are spread across all storage blades to enable declustering of rebuilds in the event of a failure.
Fixed-Size Speedup
The PVFS [] design team chose to specifically target HPC workloads with their design. It is one of the two parallel file systems deployed on the IBM Blue Gene/P system at Argonne National Laboratory; GPFS is the other. Coordinated caching and locking were not used, eliminating false sharing and locking overheads, and API extensions are available that allow complex noncontiguous accesses to be described with single I/O operations. Data is distributed algorithmically across a user-specifiable number of servers, and an object-based model is used that allows servers to locally allocate storage space. Failure tolerance is provided via shared storage between file system servers.
Acknowledgments This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-ACCH. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC-CH. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
. Welch B, Unangst M, Abbasi Z, Gibson G, Mueller B, Small J, Zelenka J, Zhou B () Scalable performance of the Panasas parallel file system. In: Proceedings of the th USENIX Conference on File and Storage Technologies (FAST). San Jose, . Weil SA, Brandt SA, Miller EL, Long DDE. Maltzahn C () Ceph: a scalable, high-performance distributed file system. In: Proceedings of the th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, , pp – . Corbett PF, Feitelson DG () The Vesta parallel file system. ACM Trans Comput Syst ():– . Corbett PF, Feitelson DG, Prost J-P, Baylor SJ () Parallel access to files in the Vesta file system. In: Proceedings of Supercomputing ’. IEEE Computer Society Press, Portland, , pp – . Thekkath C, Mann T, Lee E () Frangipani: a scalable distributed file system. In: Proceedings of the Sixteenth ACM Symposium on Operating System Principles (SOSP). Saint-Malo, October . Lee EK, Thekkath CA () Petal: DISTRIBUTED virtual disks. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems. Cambridge, October , pp – . ANSI INCITS T Committee, Project t/-d working draft: Information technology – SCSI object-based storage device commands (OSD). July . Weber RO () SCSI object-based storage device Commands- (OSD-), revision a. INCITS Technical Committee T/-D working draft, work in progress, Tech. Rep., January . Posix, IEEE () (ISO/IEC) [IEEE/ANSI Std ., Edition] Information Technology — Operating System Interface (POSIX ) — Part : System Application: Program Interface (API) [C Language]. IEEE, New York, NY, USA . Nowicki, B () NFS: Network File System Protocol Specification, RFC , Sun Microsystems, Inc. March, , http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc.html
Bibliography . Schmuck F, Haskin R () GPFS: a shared-disk file system for large computing clusters. In: First USENIX Conference on File and Storage Technologies (FAST’), Monterey, January , pp – . Lang S, Carns P, Latham R, Ross R, Harms K, Allcock W () I/O performance challenges at leadership scale. In: Proceedings of Supercomputing. Portland, November . Nieuwejaar N, Kotz D, Purakayastha A, Ellis CS, Best M () File-access characteristics of parallel scientific workloads. IEEE T Parall Distr, ():– [Online]. Available: http://www.computer.org/tpds/td/abs.htm . Carns PH, Ligon III WB, Ross RB, Thakur R () PVFS: a parallel file system for linux clusters. In: Proceedings of the th Annual Linux Showcase and Conference. Atlanta: USENIX Association, October , pp –. [Online]. Available: http://www.mcs.anl.gov/thakur/papers/pvfs.ps . Braam PJ () The luster storage architecture. Cluster File Systems, Tech. Rep., [Online]. Available: http://lustre.org/docs/lustre.pdf
F
Fill-Reducing Orderings METIS and ParMETIS
First-Principles Molecular Dynamics Car-Parrinello Method
Fixed-Size Speedup Amdahl’s Law
F
F
FLAME
FLAME libflame
Floating Point Systems FPS-B and Derivatives Chris Hsiung Hewlett Packard, Palo Alto, CA, USA
Synonyms SIMD (single instruction, multiple data) machines
Definition FPS AP-B was the first commercially successful attached array processor product designed and manufactured by Floating Pont Systems Inc., in . It is designed to be attached to a host computer, such as the DEC PDP-, as a number crunching accelerator. The key innovation behind this family of array processors is the horizontal micro-coding of the wide instruction set architecture with pipelined execution units to achieve high parallelism. Floating Point Systems Inc., later introduced several derivative products to -bit arithmetic and floating point calculations, by employing higher speed silicon technology, VLSI, and multiple floating point coprocessors.
Discussion Introduction Floating Point Systems (FPS) Inc. was founded in by former Techtronix engineer C. Norm Winningstad in Beaverton, Oregon, to manufacture low-cost, high performance attached floating point accelerators, mainly, for minicomputers, targeting signal processing applications. FPS AP-B was the first attached processor product marketed under the company’s own brand. It was co-designed by George O’Leary and Alan Charlesworth, first delivered in . Its pipelined architecture gave it a peak performance of MFLOPS (Million FLoating-point Operations Per Second), with -bit floating point calculations. By , roughly , machines had been delivered []. Since FPS-B
was designed for mini-computer host machines, such as the DEC PDP-, a larger memory version, FPS-L, was also introduced for mainframe hosts such as the IBM- series. In order to expand to more general purpose scientific and engineering applications, FPS announced the FPS- system in . The word length was extended from -bit to -bit. The addressing capability was expanded from -bit to -bit, and the memory size grew to . MW (Million -bit Words). Even though the peak speed went down a bit, from MFLOPS to MFLOPS, the size of the problem and the precision of the calculations have both increased. The silicon technology remained the same, TTL. In , FPS announced the first -bit machine at higher speed with enhanced architecture, which is the FPS-/MAX. This was the first VLSI design which incorporated matrix accelerator (MAX) boards. This is a SIMD (Single Instruction stream and Multiple Data stream) design that can incorporate up-to MAX boards, each board having computation power equivalent to that of two FPS- processors. With a total of FPS- CPU’s, the peak computational power reached MFLOPS. In , FPS announced the first ECL version, FPS, with a peak speed of MFLOPS. In the meantime, FPS- also went through a makeover with the new VLSI technology, which can now be attached to both the MicroVAX or the Sun workstations. Another derivative of the FPS-B system is a MIMD (Multiple Instruction and Multiple Data streams) version called FPS-, announced in . It incorporated a shared-memory bus architecture among a control processor, three arithmetic coprocessors and an I/O processor, connected to a host. The control processor is the traditional AP-B, and the arithmetic coprocessors, XP, were a new design, but by employing the same concept as AP-B. Since this MIMD architecture approach digressed from the array concept, we will not go into any length in this discussion. One word of clarification regarding AP. In the high performance computing circle, AP stands for Array Processors, even though, unlike the ICL DAP family, there was not necessarily any “array” of processors presented in the architecture. So, realistically, AP can be viewed as Attached Processors, or, we can interpret “array processor” as the attached processor designed
Floating Point Systems FPS-B and Derivatives
especially for efficient array operations. For this family of systems, the common grounds are the horizontal micro-coding concept (each instruction contains multiple fields of micro-instructions, and are decoded in parallel) and the pipelined architecture.
FPS AP-B Hardware The FPS AP-B was a micro-coded, pipelined, array processor design, attached to host computers such as the DEC PDP-/. Data transfer between host and AP was done via MDA (Direct Memory Access). Its cycle time was ns ( MHz), with a peak speed (multiple-add) of MFLOPS. For a detailed architecture block diagram, please consult with Hockney and Jesshope’s book, Fig. ., p. , in []. The whole idea of micro-coding was based on the concept of the (synchronous) parallel execution of multiple execution units. A single -bit instruction word was divided into arithmetic field (address calculation), floating point fields, and two data pad register fields, all under the control of the decoding and execution unit. The address and logic unit (ALU) was -bit wide. There were of -bit S-registers for the ALU. The floating point units (FMUL and FADD) were -bit wide. There were two data pad registers (DX and DY), each was -bit wide. They were both for receiving data from memory and used as operands in floating point calculations. The FADD was a two-stage adder, and FMUL was a three-stage multiplier. In order to keep the pipelines moving, they need to be explicitly “pushed.” For example, in order to push the result of a FADD operation into its destination register, we need to add another FADD operation in the next cycle, either with a valid floating add of another pair of operands (FADD DX, DY), or with another FADD operation (FADD) without any operands. The same principle is true with FMUL. There were three kinds of memory units, the program memory, the main (data) memory, and the table memory. The program memory was K × -bit. The table memory was K × -bit. It contained approximations for sine/cosine and other special functions, and was used to calculate intrinsic function values. The main memory was interleaved into two banks, odd and even. For fast memory option ( ns access time), each bank was capable of producing one result every other cycle
F
(two cycle latency to data pad register). Hence, in the case of fast memory, sequential accesses could be done at each cycle with no delay. For slow memory, access time ns, each bank can only deliver one result every three cycles. The floating point format of the -bit data was a bit unconventional in its word length. The exponent field had -bits in Radix-, and the mantissa had -bits in ’s complement format. Thus, the exponent had a dynamic range of to the power ±, and the mantissa had -digit of accuracy. Hence, it provided much better range and accuracy compared to the IBM -bit format. Extra guard bits were also kept in the calculation in order to improve precision. Format conversion between host and array was done on the fly during the data move. Data moving into, or out of, AP-B were performed by I/O-port (IOP) or a general programmable I/O port (GPIOP) by DMA operations, by stealing cycles if necessary. There were two data width, -bit and -bit. The bandwidth of moving into AP memory was . MW/s, and that of moving out was . MW/s. Special features to support real-time audio and video processing were also provided.
Software and Performance In order to achieve optimal performance on pipelined and horizontally micro-coded machines, it was best to write compact loops, based on skillful arrangement of the critical path. If done right, it was truly a thing of beauty. In carrying out matrix-vector type of operations, for example, the main loop body sometimes could be written in an one-line (single instruction) loop structure. Within that one line, there could be a floating multiple-add pair (might be unrelated ops, but typically “chained” together), indexing, memory ops, and conditional test and branch. Sometimes, an extra cycle could even be saved by storing/accessing some temp data in the table memory, since table memory had an independent data path. Of course, once the loop body was set, the programmer needed to wrap the initialization part and the winding-down code around the loop in order to complete the whole execution. The whole philosophy was very much like coding for vector machines, except that there were multiple instruction fields in the horizontally micro-coded instruction set.
F
F
Floating Point Systems FPS-B and Derivatives
There was no hardware support for divide in the machine. The loop body for a vector divide was three-cycles, for example. A scalar divide was much longer. Initially, in order to use the array processor, users had to use library procedures from the main program running on the host machine. Users use library procedures to transfer data to-and-from the attached machine, and then library procedures to do the computation. Library procedures mainly were used to execute vector (array) operations including intrinsic functions, basic arithmetic operations, signal processing library, image processing library (including D FFT, convolution, filtering), and advanced math library (inc. sparse matrix solution and eigenvalues, Runge–Kutta integration). The next step was the availability of cross Fortran compiler on the host machine, to compile Fortran IV code into object code for the AP, and executed on the array processor side. With this offering, the AP became much more versatile. But, since AP-B did not provide a real time clock, most timing analyses were done on the host side, through APSIM, a timing simulator. Of course, with the use of a simulator, which was , times slower, it could only be useful for shorter running codes.
FPS- The FPS-, announced in , was a -bit enhancement from the -bit AP-B. It incorporated some VLSI components in its design. The cycle time was at ns, a bit slower than its predecessor. It made for MFLOPS peak speed. Other than the double precision design, there are other improvements as well. First, the instruction memory (program memory) and the data memory were merged into one large main memory. Instruction cache was loaded automatically from the instruction memory as needed. Hence this instruction cache (size: , -bit) replaced the program memory of the AP-B. There was also a stack memory, consisting of -bit entries used to store subroutine return addresses, to expedite procedure calls. The table memory, also called auxiliary memory, has primarily become random access memory, used to store intermediate data from registers (e.g., register overflow area). However,
the first K or memory space was read-only, used to store commonly used constants and table values. The main memory consisted of memory modules. Each module had two boards (hence two banks). The memory size ranged from the initial . MW (with K-bit dynamic NMOS) to . MW later (with Kbit dynamic NMOS). All memory accesses were done in a pipelined fashion, both for main memory and table memory. Memory mapping and protections were also available. The FPS- did provide programmable real-time clock and a CPU timer, which enabled accurate timing of program executions.
FPS-/MAX The FPS-/MAX was introduced in . The “MAX” here stood for matrix algebra accelerator. The machine consisted of one FPS- processor, and up-to MAX boards. Each board was capable of two FPS- CPU. Besides, four vector registers were added, each with , elements, at each CPU. In total, the computing power consisted of -CPUs, and had a peak performance of MFLOPS. The MAX boards could execute just a handful of vector operations (SDOT, vector multiple, vector add, scalar vector multiple add, etc.). The logic was implemented with CMOS VLSI technology. The vector registers were memory mapped with the top MW of the MW main memory. That means, vector operations were executed simply by reading from, or writing to, appropriate memory locations. Of course, in order to take advantage of the extra computing power, the arrays had to be very long. Hence, this type of SIMD architecture was more of a special purpose machine.
FPS- The FPS-, announced in , was an (custom aircooled Fairchild K) ECL implementation of the (Schottky TTL technology) FPS-. It increased the clock cycle time from to ns, a speed up of more than three times. Other improvements in the instruction cache and data memory (static NMOS) boasted a total improvement of x to x over the FPS- generation.
Flow Control
Bibliographic Notes and Further Reading For interested readers, the best textbook that covers the complete family of FPS products can be found in []. For a complete history of all parallel computers, please find it in []. In , FPS brought to market its long anticipated T-Series product, a hypercube, distributed memory, message passing multicomputer, with Occam software. Because of its arcane software environment, and nongeneral-purpose nature, it was not very successful in the marketplace. Since then, it changed direction, and brought FPS- product family to the market. These developments are beyond the scope of this entry. Interested readers can find more details from: “FPS Computing: A History of Firsts,” by Howard Thraikill, former President and CEO of FPS, in [].
Bibliography . Hockney RW, Jesshope CR () Parallel computers : architecture, programming, and algorithms. IOP, Bristol . Wilson GV The history of the development of parallel computing. http://ei.cs.vt.edu/~history/Parallel.html . Ames KR, Brenner A () Frontiers of supercomputing II: a national reassessment. University of California Press, Berkeley
Flow Control Jos´e Flich Technical University of Valencia, Valencia, Spain
Synonyms Backpressure
Definition Flow control is a synchronization protocol for transmitting and receiving units of information. It determines the advance of information between a sender and a receiver, enabling and disabling the transmission of information. Since messages are usually buffered at intermediate switches, flow control also determines how resources in a network are allocated to messages traversing the network.
F
Discussion Flow control is defined, in its broad sense, as a synchronization protocol that dictates the advance of information from a sender to a receiver. Flow control determines how resources in a network are allocated to packets traversing the network. First, basic definitions and types of flow control are introduced. Then, differentiation between flow control and switching is highlighted. Finally, basic flow control mechanisms are described and briefly compared.
Messages, Packets, Flits, and Phits Usually, the sender needs to transmit information to the receiver. To do this, the sender transmits blocks of information, referred to as messages, through a path. A header including the routing information is prepended to the message. Depending on the switching technique used (see below and the “switching” entry), the message may be further divided into packets. A packet is made up of a header (with the proper information to route the packet through the network) and a payload. Every message or packet is then further split into flits. The flit (flow control bit) term is defined as the minimum amount of information that is flow controlled. Flits can then be divided into smaller data units called phits (physical units). A phit is usually the amount of information that is forwarded in a cycle through a network link. Depending on the link width a certain number of phits will be required to transmit a flit. Phits are not flow controlled, they only exist because of the link width. Figure shows the breakdown of a message into packets, flits, and phits. Not all the communication systems use these four types of data units. Indeed, in some networks, packets are not needed and the entire message is transmitted (see Wormhole Switching in Switching Techniques entry). Also, in some networks, the flit size equals the phit size, and thus, there is no need to split a flit into smaller phits. In addition, as will be highlighted later, the flit size can span from few bits to the entire packet size.
Types of Flow Control The sender and receiver can be connected in different ways, for instance they can be directly connected through a link, or they can be connected through a
F
F
Flow Control
Message
Packets
Flits
Phits
Flow Control. Fig. Messages, packets, flits, and phits
network (made of routers, switches, and links) where several routers/switches are visited, and links traversed, prior to reaching the receiver. In both cases, a flow control protocol is defined to set up the proper transmission of information. Therefore, different levels of flow control can be used at the same time: end-to-end flow control and link-level flow control.
the type of network (lossy or lossless) constrains the solutions for packet routing, congestion management, and deadlock issues. Also, network performance will be significantly different. Typically, storage/system-area networks (SAN) used in parallel systems and on-chip networks are lossless and local area networks (LAN) and wide area networks (WAN) are lossy.
Bufferless Flow Control Versus Buffered Flow Control
Flow Control Versus Switching
Usually, flow control is used to prevent the overflow of buffering resources when transmitting information. When the queue at the receiver gets full then the flow control must stop the transmission of information at the sender. Once there is room at the queue the flow control resumes the transmission of information. Therefore, flow control is closely related to buffer management. However, there are cases where buffers (or queues) are not used along the transmission path of messages. In that case a bufferless flow control is defined. This is the simplest form of flow control. In that case information is simply forwarded, and potentially the information gets discarded or misrouted (due to contention along the path from the sender to the receiver). A bufferless flow control is also used in the so-called lossy networks were packets may be dropped (discarded) at receiver when the buffer fills up. In such networks, the sender will need to retransmit the packet (via time-out or a NACK control packet sent back from the receiver from upper communication layers, e.g., transport layer). However, in buffered networks, some flow control mechanism is usually used to prevent packet dropping, and such networks are known as lossless (or flow controlled) networks. This is an important distinction of types of networks with huge implications. Indeed,
Flow control is also closely related to switching. Switching determines when and how the information advances through a switch or router. Nowadays, the most frequently used switching mechanisms are wormhole and virtual cut-through. One of the key differences between wormhole switching and virtual cut-through switching lies in the size of buffers and messages. In wormhole switching buffers at routers are usually smaller than message sizes, having slots for only a few flits (and messages can be tens of flits in size or even much larger). However, in virtual cut-through switching, buffers must be larger than packets so as to store an entire packet in a router. Indeed, in virtual cut-through, whenever a packet blocks in the network the packet needs to be stored completely at the blocking router. On the contrary, in wormhole switching the message is spread along its paths across several routers. This fact greatly affects the flow control mechanism and, indeed, affects the flit size of the flow control protocol. In wormhole switching, the flit size is a small part of the message, whereas in virtual cut-through, the flit size is the size of the packet. Because original messages from the sender could be larger than buffers, packetization (splitting a message into bounded packets) is required in virtual cut-through.
Flow Control
As information may travel across multiple routers, the switching inside a router is also seen as a means of flow control. Indeed, some authors do not distinguish between flow control and switching [], and they describe most used flow control mechanisms as wormhole flow control and virtual cut-through flow control. However, a basic distinction exists depending on where the flow control is applied. If applied inside a router or switch then switching mechanism is used and if applied over a link connecting two nodes then flow control is used. Keep in mind also the end-to-end flow control as a third possibility. Figure shows a pair of connecting devices (source and destination nodes) connected through a series of switches and links. The figure illustrates the different levels of flow control (end-toend and link-level) and where switching is performed (inside the switches). Taking into account this basic differentiation between flow control and switching, the following sections describe the most frequently used link-level flow control mechanisms. They are described linked to the suitable switching mechanisms they usually work with. Another key property that distinguishes flow control from switching is that flow control is used as a means to provide backpressure not to overflow receiving buffers. Instead, switching is related to decide when and how information is switched from input ports of a switch/router to output ports of the switch/router.
Impact of Flow Control and Buffer Size Flow control is a key component in a network. An optimized flow control protocol will keep a fraction
of the buffers used most of the times and will minimize the time flits are seated waiting at queues, thus achieving high performance in the network. However, a suboptimal flow control protocol may keep resources underutilized and flits blocked most of the time waiting at buffers, thus ending up in a network with low performance (low throughput and high latencies). In a flow control protocol, the buffer resources at the receiver end must be efficiently used to keep maximum performance. To do this, the buffer depth must be sized according to the latency of the path between the transmitter and the receiver (the time it takes for a flit to travel from the transmitter to the receiver). As both receiver and transmitter will exchange flow control information, the buffer must be sized, at minimum, to allow for the round-trip time of the link. This value is the path latency multiplied by two (two way transmission) plus the time it takes to generate and process the flow control command at both sides. Notice that this is needed in any type of flow-controlled transmission, either link-level or end-to-end. Flow control is also highly related to congestion management. Congestion management deals with high levels of blocking in the network among different messages. To solve such a situation, a congestion management technique may reduce the injection of traffic from a router or from an end node, thus resembling flow control methods (as they dictate when flows advance). However, the reader must be aware that in a congestion management technique the goal is to reduce the blocking experienced in the network, whereas in a flow control method the goal is not to overrun a buffer.
End-to-end flow control
... Destination node
Source node
...
Switch
Switch
Switch Point-to-point flow control
Flow Control. Fig. Different levels of flow control
F
... Switch
Switching
F
F
Flow Control
Link-Level Flow Control The most commonly used link-level flow control protocols are Xon/Xoff, credit-based, and Ack/Nack. However, there are other solutions that, although basic and probably low in performance, help to better understand the concept. This is the case for a handshake protocol. Next, each of these flow control protocols is described. Handshake protocol. The most straightforward method to implement flow control among two devices is by setting a handshake protocol. As an example, a sender may assert a request signal to the receiver thus notifying it has a flit to send. The receiver (depending on its availability for storing new flits) may assert a grant signal to let the transmitter know that there is slot at the queue. Then, the transmission of the flit may begin. For the next flit the handshake protocol needs to be repeated (the handshake works at the flit granularity). Although such protocol is correct and ensures that no queue will overrun, the problem is the low performance achieved in the transmission. Indeed, the delay incurred in the control signal exchange is not overlapped with the transmission of flits, and therefore, there are bubbles introduced along the transmission path. Another potential problem is the excessive exchange of control information (request, grant signals) between the transmitter and the receiver for every transmitted flit. The following link-level flow control methods reduce such overhead. Xon/Xoff flow control. Xon/Xoff flow control is also known as Stop&Go. It is mostly used in networks with wormhole switching. Basically, the receiver notifies the sender either to stop or to resume sending flits once high and low buffer occupancy levels are reached, respectively. To do this, two queue thresholds are set at the receiving side. When the queue fills and passes the Xoff threshold, then the receiver side sends an off flow control signal to the transmitter to stop transmission (in order to prevent buffer overrun). At the transmitter side, when the off control signal is received the transmission is stopped and the sender enters the off mode. Transmission will be resumed only when notified by the receiver. Indeed, when the queue drains and reaches the Xon threshold, then the receiver sends an on flow control signal to the transmitter. The transmitter enters again the on mode and resumes transmission.
Xon and Xoff thresholds are critical from the performance point of view. Indeed, Xoff threshold must be set not to overrun the receiving queue, and the Xon threshold must be set not to introduce bubbles along the transmission path (once forwarding of flits is resumed). Therefore, the round-trip time must be taken into account when setting the thresholds. In addition, Xoff and Xon thresholds must be away enough so as not to send many off and on control signals unnecessarily. It is common to set the Xoff threshold at / and the Xon threshold at / of buffer capacity. Control signals may be sent back to the sender either by using special control signals (thus not using the link data path) or using special control packets through the link data path. With appropriate thresholds, Xon/Xoff significantly reduces the amount of flow control information exchange. However, it requires large buffer memories to avoid buffer overruns. However, there are other approaches to set up the thresholds that may be beneficial. This is the case for the Prizma switch [] where both thresholds are set to the same value. This is efficient if the link length is short enough (round trip time is short) as is the case for the Prizma switch. The benefit in this case is the use of a single control signal that when set means a Go, and when reset means a Stop. Credit-based flow control. Credit-based flow control is usually linked to networks using virtual cutthrough switching. Indeed, the flit size is the packet size and, therefore, packet size is limited to a maximum (buffer size needs to be larger or equal than packet size). In credit-based flow control the sender has a counter of credits. The counter is initially set to the number of flits/slots available at the receiver side. Typically, a credit means a slot for a flit is granted at the receiver side. There are, however, systems where the credit means a chunk of bytes not necessarily the size of the packet (e.g., in InfiniBand network the credit means a slot for bytes). The sender transmits a flit whenever it has credits available. If the credit counter is different from zero, then a flit can be transmitted, and the counter is decremented by one. If the counter reaches zero then the transmitter cannot send more flits to the receiver. The receiver keeps track of receptions and transmissions. Whenever a flit is consumed the receiver sends a new credit upstream. This is notified by a flow
F
Flow Control
Stop & Go
control signal (again this info can travel decoupled from the transmission link or as a control packet through the transmission link). Alternatively, and to reduce the overhead in transmitting control information, the receiver may collect credits and send them in a batch. However, in that case, performance may suffer since forwarding of flits may be delayed. Credit-based flow control requires less than half the buffer size required by Xon/Xoff flow control, thus being more effective regarding resource requirements. In addition, the link reaction time is shorter in creditbased flow control. When the sender runs out of credits then it stops sending flits. When a single credit is returned from the receiver then the transmission is resumed. In Xon/Xoff the go threshold needs to be reached before sending the go signal, thus incurring higher transmission latencies. Figure shows a comparison between the Stop&Go flow control protocol (Xon/Xoff) and the credit-based flow control. As can be seen, the credit-based flow control protocol is able to resume forwarding flits sooner than Stop&Go. Ack/Nack flow control. Flow control is also linked to fault-tolerance in message transmissions. Indeed, the
Stop
Stop
Stop
Stop
Go
Go
Go
Go
Stop signal Sender Last flit returned by stops reaches receiver receiver transmission buffer
Sender uses last credit
Last flit reaches receiver buffer
Flits in buffer get processed
Flits get processed and credits returned
Credit based
# credits returned to sender
Ack/Nack flow control allows the receiver to notify the sender that the last flit sent was correct (received without error) or erroneous (due to a failure or even because there was no buffer available). There is no state information at the transmitter side. Indeed, the transmitter is optimistic and keeps sending flits to the receiver. In this sense, Ack/Nack protocol is also known as optimistic flow control. This flow control method, although it reduces the latency (the sender just transmits the flit when it is available) has a poor bandwidth utilization (sending flits that are dropped). In addition, the transmitter needs logic to deal with timeouts in order to trigger retransmissions (in the absence of either an ACK or NACK signal). Also, it needs to keep a copy of the flit until it receives an ACK signal. Thus, the protocol incurs in an inefficient buffer utilization. Xon/Xoff and credit-based flow control methods do not support failures. Indeed, these protocols only care about buffer flooding. When using such protocols and in the face of a failure, the system needs to rely on a higher-level flow control protocol, most of the time being an end-to-end flow control protocol.
Flow Control. Fig. Flow control comparison (Xon/Xoff vs credits)
Go signal returned to sender
Sender transmits flits
First flit reaches buffer
Sender resumes transmission
First flit reaches buffer
Time Flow control latency observed by receiver buffer
Time
F
F
Flow Control
Other Specific Flow Control Protocols Depending on the type of network used, other flow control protocols may be used trying to add new functionalities, like higher performance, lower flit latencies, fault coverage, etc. One example is T-ERR [] flow control protocol, proposed for on-chip networks. T-ERR increases the operating frequency of a link with respect to a conventional design. In such situations, timing errors become likely on the link. These faults are handled by a repeater architecture leveraging upon a second clock to resample input data, so as to detect any inconsistency. Therefore, T-ERR obtains higher link throughput at the expense of introducing timing errors. Flow Control and Virtual Channels An important issue related to flow control is how to deal with virtual channels (Switch Architecture). Flow control can be instantiated for each virtual channel so that individual virtual channels do not get overflowed. However, this leads to the assumption that buffer resources are statically assigned to each virtual channel, leading to low memory utilization. The problem with this approach comes from the fact that every virtual channel must be deep enough to cover the round-trip time of the link and avoid introducing bubbles in the transmission line. To reduce such overhead, it is common to design a single (and shared) landing pad at the input port of the switch. The landing pad is flow controlled. Flits arriving to the landing pad are then moved to the appropriate virtual channel. More interesting approaches rely on the dynamic assignment of memory resources to virtual channels. Thus virtual channels may grow in depth depending on the traffic demands. In order to cope with such dynamicity, it is common to combine two flow control mechanisms. One example is the use of the Stop&Go flow control protocol per virtual channel and the credit-based flow control protocol for the entire memory at the input port of the switch. Thus, whenever a switch wants to forward a flit to the next switch, it needs to check whether there are global credits available at the input port and whether the receiving virtual channel is in go state. Notice that with the proper setting of thresholds a particular
virtual channel will not consume the entire memory resources, thus leaving room for other virtual channels.
End-to-End Flow Control Link-level flow control protocols can, in principle, be applied in end-to-end scenarios. Indeed, nothing impedes them to work. This is the case for the Ack/Nack protocol implemented at end nodes, in order to guarantee the reception of messages at destination. However, since the round trip time is longer, large buffers are required at both sides to transmit multiple messages while waiting for ACK signals. Usually this is not a major problem, since end nodes implement much larger buffer resources than routers and switches. There are, however, other protocols designed at that level. Indeed, these flow control protocols are known also as message protocols. Two examples are the eager protocol and the rendezvous protocol. The eager protocol just assumes the receiver will have enough allocated space at its buffers to store the transmitted message. In case the receiver was not expecting the message then it has to allocate more buffering and copy the message. During that time, the network may fill up. Thus, the eager protocol may impact performance. In contrast, the rendezvous protocol does not send a message unless the receiver is aware of it. This is of particular interest when the amount of information to send is large. When using the rendezvous protocol, the transmitter first sends a single short message requesting buffer storage at the reception so as to allocate a large buffer. The receiver, after allocating the buffer, sends an acknowledgment to the sender. Upon reception, the sender injects all the messages, which include a sequence number in their header. The receiver just stores every incoming message at the correct buffer slot, depending on the sequence number the message has. Upon reception of all the messages the data info is at the receiver and ordered. The use of the eager or the rendezvous protocol may also affect the way programs are coded. This is the case for MPI programming where different commands can be implemented with different end-to-end flow control protocols. Also, out of order delivery of
Flynn’s Taxonomy
messages can be avoided with the rendezvous protocol. As there are applications that do not work in networks with unordered delivery of messages (e.g., using adaptive routing), then the choice of the rendezvous protocol may be a good choice for exploiting the benefits of adaptive routing without suffering out-of-order delivery.
Michael Flynn Stanford University, Stanford, CA, USA
Related Entries
SIMD (single instruction, multiple data) machines
F
Flynn’s Taxonomy
Synonyms
Congestion Management Switching Techniques
Bibliographic Notes and Further Reading There are several books that cover the field of interconnection networks. High-performance interconnection networks (used in Massively Parallel Systems and/or clusters of computers) are covered in the books of Duato [] and Dally []. Both offer a good text for flow control and in some cases complementary views. In particular, Dally’s book couples flow control with switching providing a common text for both terms. In Duato’s book the term is clearly differentiated and is more focused on switching (rather than flow control). For more specific and emerging scenarios, like onchip networks, the book by De Micheli and Benini [] provides alternative flow control protocols (link-level flow control in Chap. and end-to-end flow control protocols in Chap. ). For computer networks (LANs and WANs) the reader should focus on books like Tanenbaum [] (Chap. for the transport layer and Chap. for the data link layer).
Bibliography . Dally W, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco, CA . Duato J, Yalamanchili S, Ni N () Interconnection networks: an engineering approach. Morgan Kaufmann, San Francisco, CA . de Micheli G, Benini L () Networks on chips: technology and tools. Morgan Kaufmann, San Francisco, CA . Tanenbaum AS () Computer networks. Prentice-Hall, Upper Saddle River, NJ . Tamhankar R, Murali S, Micheli G () Performance driven reliable link for networks-on-chip. In: Proceedings of the Asian Pacific Conference on Design Automation, Shahghai . Denzel WE, Engbersen APJ, Iliadis I () A flexible sharedbuffer swtich for ATM at Gb/s rates. Comput Netw ISDN Syst ()
Definition Flynn’s taxonomy is a categorization of forms of parallel computer architectures. From the viewpoint of the assembly language programmer, parallel computers are classified by the concurrency in processing sequences (or streams), data, and instructions. This results in four classes SISD (single instruction, single data), SIMD (single instruction, multiple data), MISD (multiple instruction, single data), and MIMD (multiple instruction, multiple data).
Discussion Introduction Developed in [] and slightly expanded in [], this is a methodology to classify general forms of parallel operation available within a processor. It was proposed as an approach to clarify the types of parallelism either supported in the hardware by a processing system or available in an application. The classification is based on the view of either the machine or the application by the machine language programmer. It implicitly assumes that the instruction set accurately represents a machine’s micro-architecture. At the instruction level, parallelism means that multiple operations can be executed concurrently within a program. At the loop level, consecutive loop iterations are ideal candidates for parallel execution provided that there is no data dependency between subsequent loop iterations. Parallelism available at the procedure level depends on the algorithms used in the program. Finally, multiple independent programs can obviously execute in parallel. Different instruction set (or instructions set architectures or simply, computer architectures) have been built to exploit this parallelism. In general, an instruction set architecture consists of instructions each
F
F
Flynn’s Taxonomy
of which specifies one or more interconnected processor elements that operate concurrently, solving a single overall problem.
The Basic Taxonomy The taxonomy uses the stream concept to categorize the instruction set architecture. A stream is simply a sequence of objects or actions. There are both instruction streams and data streams, and there are four simple combinations that describe the most familiar parallel architectures []: . SISD – single instruction, single data stream; this is the traditional uniprocessor (Fig. ) which includes pipelined, superscalar, and VLIW processors. . SIMD – single instruction, multiple data stream, which includes array processors and vector processors (Fig. shows an array processor). . MISD – multiple instruction, single data stream, which includes systolic arrays, GPUs, and dataflow machines (Fig. ). . MIMD – multiple instruction, multiple data stream, which includes traditional multiprocessors (multicore and multi-threaded) as well as the newer work of networks of workstations (Fig. ). As the stream description of instruction set architecture is the programmer’s view of the machine, there are limitations to the stream categorization. Although it serves as useful shorthand, it ignores many subtleties of
I
PE
Data memory
Flynn’s Taxonomy. Fig. SISD – single instruction operating on a single data unit
an architecture or an implementation. As pointed out, the original work in even an SISD processor can be highly parallel in its execution of operations. This parallelism need not be visible to the programmer at the assembly language level, but generally becomes visible at execution time through improved performance and, possibly, programming constraints.
SISD: Single Instruction, Single Data Stream The SISD class of processor architecture includes most commonly available workstation type computers. Although a programmer may not realize the inherent parallelism within these processors, a good deal of concurrency can be present. Obviously, pipelining is a powerful concurrency technique and was recognized in the original work. Indeed, in that work, the idea of simultaneously decoding multiple instructions was mentioned but was regarded as infeasible for the technology of the time. (That paper referred to the issuance of more than one instruction at a time as a bottleneck and later publications sometimes refer to this problem as Flynn’s bottleneck). Of course by now almost all machines use pipelining and many machines use one or another form of multiple instruction issue. These multiple issue techniques aggressively exploit parallelism in executing code whether it is declared statically or determined dynamically from an analysis of the code stream. During execution, a SISD processor executes one or more operations per clock cycle from the instruction stream. Strictly speaking, the instruction stream consists of a sequence of instruction words. An instruction word is a container that represents the smallest execution packet visible to the programmer and executed by the processor. One or more operations are contained within an instruction word. In some cases, the distinction between instruction word and operations is crucial to distinguish between processor behaviors. When there is only one operation in an instruction word, it is simply referred to an instruction. Scalar and superscalar processors executes one or more instructions per cycle where each instruction contains a single operation. VLIW processors, on the other hand, execute a single instruction word per cycle where this instruction word contains multiple operations.
Flynn’s Taxonomy
F
I
PE[n –1]
PE[1]
PE[0]
F Data memory[0]
Data memory[n –1]
Data memory[1]
Inter PE Communication Network
Flynn’s Taxonomy. Fig. SIMD – array processor type using PEs with local memory
Data memory
I[0]
I[1]
PE[0]
PE[1]
I[n –1]
DATA IN[0]
Data in[n –1] PE[n –1]
Data out[n –1] to memory
Data out[1]
Flynn’s Taxonomy. Fig. MISD – a streaming processor example
The amount and type of parallelism achievable in a SISD processor has four primary determinates: . The number of operations that can be executed concurrently on a sustainable basis. . The scheduling of operations for execution. This can be done statically at compile time, dynamically at execution, or possibly both. . The order that operations are issued and retired relative to the original program order – these operations can be in order or out of order. . The manner in which exceptions are handled by the processor – precise, imprecise, or a combination.
On the occurrence of an exception (or interrupt), a precise exception handler will (usually) complete processing of the current instruction and immediately process the exception. An imprecise handler may have already completed instructions that followed an instruction that caused the exception and will simply signal that the exception has occurred and it is left to the operating system to manage the recovery. Most SISD processors implement precise exceptions, although a few high-performance architectures allow imprecise floating-point exceptions.
F
Flynn’s Taxonomy I[0]
I[1]
I[n –1]
PE[0]
PE[1]
PE[n–1]
Data memory[1]
Data memory[n–1]
Data memory[0]
Data communications network
Flynn’s Taxonomy. Fig. MIMD – multiple instruction
Scalar (Including Pipelined) Processors A scalar processor is a simple (usually pipelined) processor that processes a maximum of one instruction per cycle and executes a maximum of one operation per cycle. These processors process instructions sequentially from the instruction stream. The next instruction is not processed until the execution for the current instruction is complete, and its results have been stored. The semantics of the instruction determines that a sequence of actions that must be performed to produce the desired result (instruction fetch, decode, data or register access, operation execution, and result storage). These actions can be overlapped but the result must appear in the specified serial order. This sequential execution behavior describes the sequential execution model that requires each instruction executed to completion in sequence. In the sequential execution model, execution is instruction-precise if the following conditions are met: . All instructions (or operations) preceding the current instruction (or operation) have been executed and all results have been committed. . All instructions (or operations) after the current instruction (or operation) are unexecuted or no results have been committed. . The current instruction (or operation) is in an arbitrary state of execution and may or may not have completed or had its results committed.
For scalar and superscalar processors with only a single operation per instruction, instruction-precise and operation-precise executions are equivalent. The traditional definition of sequential execution requires instruction-precise execution behavior at all times, mimicking the execution of a nonpipelined sequential processor. Most scalar processors directly implement the sequential execution model. The next instruction is not processed until all execution for the current instruction is complete and its results have been committed. Pipelining is based on concurrently performing different phases (instruction fetch, decode, execution, etc.) of processing an instruction, with a maximum of one instruction being decoded each cycle. Ideally, these phases are independent between different operations and can be overlapped; when this is not true, the processor stalls the process to enforce the dependency. Thus, multiple operations can be processed simultaneously with each operation at a different phase of its processing. For a simple pipelined machine, only one operation occurs in each phase at any given time. Thus, one operation is being fetched, one operation is being decoded, one operation is accessing operands, one operation is in execution, and one operation is storing results. The most rigid form of a pipeline, sometimes called the
Flynn’s Taxonomy
static pipeline, requires the processor to go through all stages or phases of the pipeline whether required by a particular instruction or not. Dynamic pipeline allows the bypassing of one or more of the stages of the pipeline depending on the requirements of the instruction. There are at least two basic types of dynamic pipeline processors: . Dynamic pipelines that require instructions to be decoded in sequence and results to be executed and stored in sequence. This type of dynamic pipeline provides a small performance over a static pipeline. In-order execution requires the actual change of state to occur in order specified in the instruction sequence. . Dynamic pipelines that require instructions to be decoded in order, but the execution stage of operation need not be in order. In these organizations, the address generation stage of the load and store instructions must be completed before any subsequent ALU instruction does a write back. The reason is that the address generation may cause a page fault and affect the execution sequence of following instructions. This type of pipeline can result in imprecise exceptions mentioned earlier without the use of special order-preserving hardware.
Superscalar Processors Even the most sophisticated scalar dynamic pipelined processor is limited to decoding a single operation per cycle. Superscalar processors decode multiple instructions in a cycle and use multiple functional units and a dynamic scheduler to process multiple instructions per cycle. These processors can achieve execution rates of several instructions per cycle (usually limited to or , but more is possible in favorable application). A significant advantage of a superscalar processor is that processing multiple instructions per cycle is done transparently to the user and provide binary code compatibility. Compared to a dynamic pipelined processor, a superscalar processor adds a scheduling instruction window that analyzes multiple instructions from the instruction stream each cycle. Although processed in parallel, these instructions are treated in the same manner as in a pipelined processor. Before an instruction is
F
issued for execution, dependencies between the instruction and its prior instructions must be checked by hardware. Advanced superscalar processors usually include order-preserving hardware to insure precise exception handling, simplifying the programmer’s model. Because of the complexity of the dynamic scheduling logic, highperformance superscalar processors are usally limited to processing four to eight instructions per cycle.
F VLIW Processors VLIW processors also decode multiple operations per cycle and use multiple functional units. A VLIW processor executes operations from statically scheduled instruction words that contain multiple independent operations. In contrast to superscalar which uses hardware-supported dynamic analyses of the instruction stream to determine which operations can be executed in parallel, VLIW processors rely on static analyses by the compiler. VLIW processors are thus less complex than superscalar processors, and have, at least, the potential for higher performance. For applications that can be effectively scheduled statically, VLIW implementations offer high performance. Unfortunately, not all applications can be so effectively scheduled as execution does not proceed exactly along the path defined by the compiler code scheduler. Two classes of execution variations can arise and affect the scheduled execution behavior: . Delayed results from operations whose latency differs from the assumed latency scheduled by the compiler. . Exceptions or interrupts, which change the execution path to a completely different and unanticipated code schedule. While stalling the processor can control delayed result, this can result in significant performance penalties. The most common execution delay is a data cache miss. Most VLIW processors avoid all situations that can result in a delay by avoiding data caches and by assuming worst-case latencies for operations. However, when there is insufficient parallelism to hide the worst-case operation latency, the instruction schedule can have many incompletely filled or empty operation slots in the instructions, resulting in degraded performance.
F
Flynn’s Taxonomy
SIMD: Single Instruction, Multiple Data Stream The SIMD class of processor architecture includes both array and vector processors. The SIMD processor is created around the use of certain regular data structures, such as vectors and matrices. From the reference point of an assembly-level programmer, programming SIMD architecture appears to be very similar to programming a simple SISD processor except that some operations perform computations on these aggregate data structures. As these regular structures are widely used in scientific programming, the SIMD processor has been very successful in these environments. Array processor and the vector processor differ both in their implementations and in their data organizations. An array processor consists of interconnected processor elements with each having its own local memory space. A vector processor consists of a single processor that references a single global memory space and has special functional units that operate specifically on vectors.
Array Processors The array processor consists of multiple processor elements connected via one or more networks, possibly including local and global inter element communications and control communications. Processor elements operate in lockstep in response to a single broadcast instruction from a control processor. Each processor element has its own private memory, and data is distributed across the elements in a regular fashion that is dependent on both the actual structure of the data and also on the computations to be performed on the data. Direct access to global memory or another processor element’s local memory is expensive, so intermediate values are propagated through the array through local interprocessor connections, which requires that the data be distributed carefully so that the routing required to propagate these values is simple and regular. It is sometimes easier to duplicate data values and computations than it is to affect a complex or irregular routing of data between processor elements. As instructions are broadcast, there is no means local to a processor element of altering the flow of the instruction stream; however, individual processor elements can conditionally disable instructions based on local status information – these processor elements are
idle when the specified condition occurs. Often implementation consist of an array processor coupled to a SISD general-purpose control processor that executes scalar operations and issues array operations that are broadcast to all processor elements in the array. The control processor performs the scalar sections of the application, interfaces with the outside world, and controls the flow of execution; the array processor performs the array sections of the application as directed by the control processor. The programmer’s reference point for an array processor is typically the high-level language level; the programmer is concerned with describing the relationships between the data and the computations but is not directly concerned with the details of scalar and array instruction scheduling or the details of the inter processor distribution of data within the processor. In fact, in many cases, the programmer is not even concerned with size of the array processor. In general, the programmer specifies the size and any specific distribution information for the data and the compiler maps the implied virtual processor array onto the physical processor elements that are available and generates code to perform the required computations.
Vector Processors A vector processor is a single processor that resembles a traditional SISD processor except that some of the function units (and registers) operate on vectors – sequences of data values that are processed as a single entity. These functional units are deeply pipelined and have a high clock rate; although the vector pipelines have a long or longer latency than a normal scalar function unit, their high clock rate and the rapid delivery of the input vector data elements results in a significant throughput that cannot be matched by scalar function units. Early vector processors processed vectors directly from memory. The primary advantage of this approach was that the vectors could be of arbitrary lengths and were not limited by processor registers or resources; however, the high startup cost, limited memory system bandwidth, and memory system contention were significant limitations. Modern vector processors require that vectors be explicitly loaded into special vector registers and stored back into memory from these registers, the same course that modern scalar processors have taken for similar reasons. The vector register file consists of several sets of
Flynn’s Taxonomy
vector registers (with perhaps registers in each set). Vector processors have several features that enable them to achieve high performance. One feature is the ability to concurrently load and store values between the vector registers and main memory while performing computations on values in the vector register file. This feature is important because the limited length of vector registers requires that long vectors be processed in segments. Not being able to overlap memory accesses and computations would pose a significant performance bottleneck. Just like SISD processors, vector processors support a form of result bypassing – in this case called chaining – that allows a follow-on computation to commence as soon as the first value is available from the preceding computation. Thus, instead of waiting for the entire vector to be processed, the follow-on computation can be largely overlapped with the preceding computation that it is dependent on. Sequential computations can be efficiently compounded and behave as if they were a single operation with a total latency equal to the latency of the first operation with the pipeline and chaining latencies of the remaining operations but none of the start-up overhead that would be incurred without chaining. As with the array processor, the programmer’s reference point for a vector machine is the high-level language. In most cases, the programmer sees a traditional SISD machine; however, as vector machines excel on vectorizable loops, the programmer can often improve the performance of the application by carefully coding the application, in some cases explicitly writing the code to perform register memory overlap and chaining, and by providing hints to the compiler that help to locate the vectorizable sections of the code. As languages are defined (such as Fortran or HighPerformance Fortran) that make vectors a fundamental data type, the programmer is exposed less to the details of the machine and to its SIMD nature.
MISD: Multiple Instruction, Single Data Stream The MISD architecture is a pipelined ensemble where the function of the individual stages are determined by the programmer before execution is begun; then data is streamed through the pipeline, forwarding results from one (stage) function unit to the next. On the micro-architecture level this is exactly what the vector processor does. However, in the vector pipeline
F
the programmer initially organizes the data into suitable data structure, then streams the data elements into stages where the operations are simply fragments of an assembly-level operation, as distinct from being a complete operation. Surprisingly, some of the earliest attempts in the s could be seen as the MISD computers (they were actually programmable calculators; as they were not “stored program machines” in the sense that the term is now used). They used plug boards for programs, where data in the form of a punched card was introduced into the first stage of a multistage processor. A sequential series of actions was taken where the intermediate results were forwarded from stage to stage until, at the final stage, a result would be punched into a new card. While the MISD term is not commonly used, there are interesting uses of the MISD organization. A frequently used synonym for MISD is “streaming processor.” Some of these processors are commonly available. One example is the GPU (graphics processing unit); early and simpler GPUs provided limited programmer flexibility in selecting the function or operation of a particular stage (ray tracing, shading etc.). More modern GPGPUs (general purposed GPU) are more truly MISD with a more complete operational set at the individual stage in the pipeline. Indeed, many modern GPGPU incorporate both MISD and MIMD as these systems offer multiple MISD cores. Other MISD style architectures include dataflow machines. In these machines, the source program is converted into a data flow graph, each node of which is a required operation. Data then is streamed across the implemented graph. Each path through the graph is a MISD implementation. If the graph is relatively static for particular data sequences, then the path is strictly a MISD if multiple paths are invoked during program execution then each path is MISD and there are MIMD instances in the implementation. Such dataflow machines have been realized by FPGA implementations.
MIMD: Multiple Instruction, Multiple Data Stream The MIMD class of parallel architecture consists of multiple processors (of any type) and some form of interconnection. From the programmer’s point of view each processor executes independently but to cooperatively execute to solve a single problem although some
F
F
Flynn’s Taxonomy
form of synchronization is required to pass information and data between processors. Although no requirement exists that all processor elements be identical, most MIMD configurations are homogeneous with all processor elements identical. There have been heterogeneous MIMD configurations that use different kinds of processor elements to perform different kinds of tasks, but these configurations are usually for special-purpose applications. From a hardware perspective, there are two types of homogeneous MIMD: multi-threaded and multi-core processors; frequently implementations use both.
Multi-Threaded Processors In multi-threaded MIMD, a single base processor is extended to include multiple sets of program and data registers. In this configuration, separate threads or programs (instruction streams) occupy each register set. As resources are available, the threads continue execution. Since the threads are independent so is their resource usage. So ideally, the multiple threads make better use of resources such as floating point units and memory ports resulting in a higher number of instructions executed per cycle. Multi-threading is a quite useful complement to large superscalar implementations as these have relative extensive set of expensive floating point and memory resources. Critical threads (tasks) can be given priority to insure short execution while less critical tasks avail of the underused resources. As a practical matter, these designs must insure a minimum service for every task to avoid a type of task lockout. Otherwise, the implementation is straightforward since all tasks can share data caches and memory without any interconnection network.
Multi-Core and Multiple Multi-Core Processor Systems At the other end of the MIMD spectrum are multiple multi-core processors. These implementations must communicate results through an interconnection network and coordinate task control. Their implementation is significantly more complex than the simpler MIMD or even more complex SIMD array processor. The interconnection network in the multiprocessor passes data between processor elements and synchronizes the independent execution streams between processor elements. When the memory of the
processor is distributed across all processors and only the local processor element has access to it, all data sharing is performed explicitly using messages and all synchronization is handled within the message system. When communications between processor elements are performed through a shared memory address space – either global or distributed between processor elements (called distributed shared memory to distinguish it from distributed memory) – there are two significant problems that arise. The first is maintaining memory consistency: the programmer-visible ordering effects on memory references, both within a processor element and between different processor elements. This problem is usually solved through a combination of hardware and software techniques. The second is cache coherency – the programmer invisible mechanism to ensure that all processor elements see the same value for a given memory location. This problem is usually solved exclusively through hardware techniques. The primary characteristic of a large MIMD multiprocessor system is the nature of the memory address space. If each processor element has its own address space (distributed memory), the only means of communication between processor elements is through message passing. If the address space is shared (shared memory), communication is through the memory system. The implementation of a distributed memory machine is easier than the implementation of a shared memory machine, when memory consistency and cache coherency are taken into account, while programming a distributed memory processor can be more difficult. In recent times, large-scale (more than , cores) multiprocessors have used the distributed memory model based on implementation requirements.
Related Entries Cray Vector Computers Data Flow Computer Architecture EPIC Processors Fujitsu Vector Computers IBM Blue Gene Supercomputer Illiac IV Superscalar Processors VLIW Processors
FORGE
Bibliographic Notes and Further Reading Parallel processors have been implemented in one form or another almost as soon as the first uniprocessor implementation. Most early implementations were straightforward MIMD multiprocessors. With the advent of SIMD array processors, such as Illiac IV [], the need for a distinction became clear. Because the taxonomy is simple, numerous extensions have been proposed; some fairly general such as those of Handler [], Kuck [], and El-Rewini []. Other extensions were of more specialized nature such as Göhringer [], Chemij []. Additional comments on the taxonomy and elaborations can be found in the standard texts on parallel architecture such as Hwang [], Hockney [], and El-Rewini []. Xavier [] and Quinn [] are helpful texts which view the taxonomy from the programmer’s view point.
Bibliography . Flynn MJ () Very high speed computers. Proc IEEE :– . Flynn MJ () Some computer organizations and their effectiveness. IEEE Trans Comput ():– . Flynn MJ, Rudd RW () Parallel architectures. ACM Comput Surv ():– . Barnes GH, Brown RM, Kato M, Kuck D, Slotnick DL, Stokes RQ () The ILLIAC IV computer. IEEE Trans Comput C-: – . Handler W () Innovative computer architecture – how to increase parallelism but not complexity. In: Evans DJ (ed) Parallel processing systems, an advanced course. Cambridge University Press, Cambridge, pp –, --- . Kuck D () The Structure of Computers and Computation vol. , J Wiley, New York . El-Rewini H, Abd-El-Barr M () Advanced computer architecture and parallel processing. Wiley, New York . Göhringer D et al () A taxonomy of reconfigurable single-/ multiprocessor systems-on-chip. Int J Reconfigurable Comput : . Chemij W () Parallel computer taxonomy, MPhil, Aberystwyth University . Hwang K, Briggs FA () Computer architecture and parallel processing. McGraw Hill, London, pp –, --- . Hockney RW, Jesshope CR () Parallel computers . Adam Hilger/IOP Publishing, Bristol, --- . Xavier C, Iyengar SS () Introduction to parallel algorithms, Wiley, New York . Quinn MJ () Designing efficient algorithms for parallel computers. McGraw Hill, London, ---
F
Forall Loops Loops, Parallel
FORGE John M. Levesque , Gene Wagenbreth Cray Inc., Knoxville, TN, USA University of Southern Califorina, Topanga, CA, USA
Synonyms Interactive parallelization; Parallelization; Vectorization; Whole program analysis
Definition FORGE is an interactive program analysis package built on top of a database representation of an application. FORGE was the first commercial package that addressed the issue of performing interprocedural analysis for Program Consistency, Shared and Distributed Parallelization, and Vectorization. FORGE was developed by a team of computer scientists employed by Pacific Sierra Research and later Applied Parallel Research.
Discussion Introduction When FORGE was marketed in , whole program parallelization was not quick, easy, or convenient. Vectorization was a mature user and compiler art. Many users could write vectorizeable loops that a compiler could translate to efficient vector instructions. Distributed and shared memory multiprocessors could be treated as vector processors and parallelized in the same manner. This is frequently very inefficient. Parallelization introduces much more overhead than vectorization. To overcome overhead parallelization requires a much higher granularity than vectorization. Parallelization of outer loops containing nested subroutine calls is usually the best approach. Achieving this high granularity was very difficult for a user to do by hand, and practically impossible for a compiler to accomplish automatically. FORGE was created to bridge
F
F
FORGE
this gap, allowing the user to interactively initiate compiler analysis and view the results. The user controlled the parallelization process using high-level program knowledge. FORGE performed the time-consuming, tedious analysis that computers are better at than programmers. At the time, a group at Rice University, under the direction of Ken Kennedy, were developing a “Whole Program Analysis” software called Parascope. At the same time, a few of the developers of the VAST vectorizing pre-compiler began a project to develop FORGE, based on a database representation of the input application. In addition to the source code, runtime statistics could be entered into the database to identify important features derived during the execution of the application. Upon this database representation of the program an interactive, windows-based environment was built that allowed the user to interactively:
. Peruse the database to find inconsistencies in the program a. COMMON block analysis b. Tracing of variable through the call chain showing aliases from subroutine calls, equivalences, and COMMON blocks c. Identification of variables that are used and not set d. Identification of variables that are set and not used e. Queries to identify variables that satisfied certain user-defined templates . Interactively Vectorize Application a. Perform loop restructuring i. Loop splitting ii. Routine inlining – pull routine into a loop iii. Routine outlining – push loop into a routine iv. Loop inversion . Interactive Shared Memory Parallelization a. Array scoping b. Storage conflicts due to COMMON blocks c. Identification of Reduction function variables d. Identification of Ordered regions due to data dependencies . Interactive Distributed Memory Parallelization a. User-directed loop parallelization b. User-directed array distribution
c. Automatic loop parallelization based on array distribution d. Automatic array distribution based on parallel loop access e. Iteration of c. and d. for whole program parallelization f. Generate HPF . Interactive Program Restructuring a. Loop Inversion b. Loop splitting c. Array index reordering d. Routine inlining e. Routine outlining f. TASK COMMON generation . Interactive CACHE View tool a. Given a hardware’s Level and Level Cache sizes and runtime statistics i. Instrument code ii. Identify Cache alignment issues iii. Identify Cache overflows iv. Propose Blocking strategies
General Structure Forge was written in the C programming language. The parser was constructed using the compiler-compiler tools LEX and YACC. The windowing system used was originally a custom interface written for SUN, APOLLO, and DEC workstations. Later a MOTIF based X-Windows system was used.
The Database To perform the aforementioned functions, FORGE needed significant information concerning the input application. At the end of the development of FORGE in , the database could input the application source code (Only Fortran ) and runtime statistics. From the source code, the following data was extracted: . For each routine a. Variable information i. Aliasing from COMMON, EQUIVALENCE, SUBROUTINE argruments ii. Read and Write references iii. Data type and size iv. Array dimensions v. Array indexing information
FORGE
b. Control structures within the routine – IF’s and DO’s . Call chain information a. Static Call chain – all possible calling sequences b. Dynamic Call chain – derived from the runtime statistics . Variable Aliasing through call chain, both static and dynamic
F
. Basic Block (control flow) information for the entire application a. All IF constructs controlling the execution of the basic block b. All reaching definitions for variables used in the basic block c. All uses of variable set in the basic block . Array section analysis a. Information on every looping structure i. Section of array that is accessed by each DO loop . Integer value information (FACTS) for array index and data dependency analysis
Database Design The FORGE database was designed and implemented by the developers. An off-the-shelf database was not used. In design of the database, the generation of the database was performed incrementally with the ability to automatically update the database when the source code was modified. The intent was to spend the time building the database so the interactive analysis within the FORGE environment could be performed quickly. FORGE stored a program as a Package. Each package contained a list of the source files in a program, along with other information such as directories to search, compiler options to use, and target hardware. Brute force whole program analysis using the hardware available at the time was very time consuming for convenient interactive program analysis. For that reason, a database was created containing only data needed for parallelization of a program. Each program unit (subroutine, function, main program) was analyzed and stored separately in the database, on disk. Last modified dates were saved for each file. A checksum was saved for each program unit in a file. When the modification date for a file indicated that the database was out of date, the file was rescanned and the checksums for each routine compared to the checksums saved with the database. The database was regenerated only for modified routines. Building the initial database for a large program could take an hour. After the initial creation, updating the database for source file changes usually took a few seconds. When the user selected a package for analysis, tables were built in memory combining the databases for called and calling routines. This process typically took
F
F
FORGE
a few seconds. The efficiency of the incremental update of database and the handling of called routines in memory are the key features that made interactive whole program analysis feasible. The database was designed to quickly supply information to the Vectorizer and Parallelizer when performing array section and scalar dependency analysis. All of the analysis features used by the parallelization analysis were made available to the user for program browsing. This only required the implementation of menus, since all the analysis was already implemented. The browser ended up as one of the most popular and powerful features of FORGE. Users could develop templates to define routines that were called from the application and whose source was not available. These templates simply supplied the information about variables that were either passed into the routine through the argument list or through COMMON blocks.
Runtime Statistics FORGE had an option to instrument the application to gather information about important routines, DO loops, IF constructs, and Cache utilization. The instrumented program could then be run with an impor-
tant dataset and an ASCII output would be generated that could be fed into the database to assist in performing analysis. There was an option to ignore static information in favor of the dynamic information when performing the aforementioned analysis. The following runtime statistics were gathered. – – – – –
CPU time of each routine, inclusive and exclusive Lengths of DO loops CPU time of each DO loop, inclusive and exclusive Frequency of IF constructs (optional) Array addresses (optional)
Interactive Tools available to the User Query Database A popular tool named Query allowed the user to find all variables in a program that matched criteria defined in templates. Templates specified criteria for: ● ● ● ● ●
Variable name Variable type Variable storage Statement context Statement type
FORGE
F
Many common templates were supplied or the user could create their own templates.
contained I/O or unknown routine, the user would be alerted to that. The user would then say:
● ● ● ●
. Parallelize Automatically . Parallelize Interactively
Show all variables read into the program Show all variables written out Show all variables used within particular DO loop Show all variables set under a condition and used unconditionally
Cache Vu Tool Cache Vu tool, another tools within the FORGE system, was extremely powerful and informative. The runtime statistics to support this feature was very large and had to be collected in a controlled manner. FORGE was able to visually show the user which arrays were overlapping in the cache and being evicted from cache due to cache associativity.
COMMON Block Grid A grid was displayed with routines down the side and COMMON blocks across the top. When a COMMON block was selected, variables within that COMMON block were shown across the top showing how each variable was referenced in each routine.
USE-DEF and DEF-USE This was extremely useful in debugging an application. Plans to attach FORGE to a debugger never materialized. – User could click on a variable in the source and drop down a menu to select, show reaching definition, or show subsequent uses. The uses and/or sets were shown within the control structures controlling the execution of the line.
Tracing Variables Through the Call Chain All occurrences of a variable were shown interspersed within the call chain, including IF constructs. Aliasing was shown where occurs due to EQUIVALENCE, COMMON block, or SUBROUTINE call.
Shared Memory Parallelizer If runtime statistics were available, the window would present the user with the highest level DO loop that is a potential candidate for parallelization. If the DO loop
The automatic mode would complete without interaction and show the user if there were any dependency regions, storage conflicts, or other inhibitors. The Interactive mode would converse with the user, showing each inhibitor as encountered. The users could then instruct the parallelizer to ignore dependencies, if appropriate, and continue. Finally, the user could output the parallelized code in OpenMP, Cray Micro-tasking, or as a parallelized routine to be run with FORGE’s shared memory runtime library. The latter was accomplished by outlining the DO loop into a subroutine call that got called in parallel. Shared variables were passed as arguments and private variables made local to the outlined routine.
Distributed Memory Parallelizer The user could select a set of arrays to decompose, decompose the arrays through a menu – only single index decomposition was supported. FORGE would then identify every DO loop that accessed the decomposed dimension and the user could have FORGE distribute the loop iterations using the owner execute rule automatically. During the loop parallelization, FORGE searched for additional arrays to decompose and loops to parallelize. This cycle of array decomposition and loop distribution continued until no new transformations were found. Parallelization was saved in internal format between sessions. The user could choose to save the parallelization in a new copy of the source program with HPF directives. Executable parallel code could be generated either with the interactive parallelizer or a batch translator named XHPF. The translated code contained calls to a runtime library. Storage for distributed arrays was allocated at runtime as determined by the size of the array and the number of distributed CPUs used. Ghost cells were allocated as needed. Parallel loop bounds were computed at runtime. Internode communication was performed pre and post loop as required. Communication was inserted in serial code as required. One node was used to perform all IO operations.
F
F
FORGE
Program Restructurer The parallelized code was available to the user for all the program transformations executed by FORGE. When the source code was modified, it would be presented to the user and the user could accept the new version and the database would be updated accordingly.
Parallel Program Generation FORGE used the result of analysis to generate parallel programs. It performed source-to-source translation. The input could contain compiler parallelization directives. Syntax defined by most major vendors of the time (IBM, NEC, FUJITSU, DEC, INTEL, NCUBE) were supported, as well as HPF and OMP standards. FORGE could take any of these as input, combine them with interactive or automatic analysis, and generate directives. FORGE could also generate a parallel program containing calls to a FORGE runtime library. Versions of the library existed for shared and distributed memory parallelization for the major vendors. The user would compile the FORGE output with the vendor supplied F compiler, link with the appropriate FORGE library, and run the parallel application. Instrumented versions of the FORGE parallel libraries were provided. When linked with the instrumented libraries and run, timing data was created. The timing data could be viewed interactive with FORGE, supper imposed on a dynamic callchain. Efficiencies and “hotspots” were identifiable and correctable. Batch tools, SPF – shared memory parallelizer, and XHPF (translate HPF) performed the same translation functions as FORGE. A user could modify program directives with a conventional editor and run these tools as precompilers as part of the normal compilation stack. Future Directions There were three major deficiencies in Forge. First, Forge handled only a subset of Fortran . Second, Forge was developed with Unix Xwindows interface. A convenient native MS Windows interface was needed. Last, the display of parallel loop inhibitors due to loop carried dependencies often contained many false
FORGE
F
F
dependencies. The minimization of the false dependencies and a convenient method for the user to view and ignore dependencies were problems for Forge, as well as all other parallelization tools then and now. Work continued in these areas through the year when Applied Parallel Research went out of business.
Bibliography . Balasundaram V, Kennedy K, Kremer U, McKinley K, Subhlok J () The parascope editor: an interactive parallel programming tool. In: Conference on high performance networking and computing proceedings of the ACM/IEEE conference on Supercomputing, ACM, New York . Kushner EJ () Automatic parallelization of grid-based applications for the iPSC/. In: Lecture notes in computer science, vol , pp – . Gernt M () Automatic parallelization of a crystal growth simulation program for distributed-memory system. In: Lecture notes in computer science, vol , pp – . Applied Parallel Research. FORGE Baseline system user’s guide . Applied Parallel Research. FORGE high performance Fortran xHPF user’s guide
. Applied Parallel Research. FORGE Distributed memory parallelizer user’s guide . Song ZW, Roose D, Yu CS, Berlamont J () Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE . In: Power H (ed) Applications of high-performance computing in engineering IV, pp – . Frumkin M, Hribar M, Jin H, Waheed A, Yan J () A comparison of automatic parallelization tools/compilers on the SGI Origin. In: Proceedings of the ACM/IEEE SC Conference (SC’), Orlando. IEEE Computer Society, Washington . Bergmark D () Optimization and parallelization of a commodity trade model for the IBM SP/, using parallel programming tools. In: International Conference on Supercomputing, Proceedings of the th international conference on Supercomputing, Barcelona . Kennedy K, Koelbel C, Zima H () The rise and fall of high performance Fortran: an historical object lesson - History of Programming Languages. In: Proceedings of the Third ACM SIGPLAN Conference on History of Programming Languages, San Diego. ACM, New York, pp --- . Saini S () NAS Experiences of Porting CM Fortran Codes to HPF on IBM SP and SGI Power Challenge. NASA Ames Research Center, Moffett Field
F
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
Formal Methods–Based Tools for Race, Deadlock, and Other Errors Peter Müller ETH Zurich, Zurich, Switzerland
Definition Formal methods–based tools for parallel programs use mathematical concepts such as formal semantics, formal specifications, or logics to examine the state space of a program. They aim at proving the absence of common concurrency errors such as races, deadlocks, livelocks, or atomicity violations for all possible executions of a program. Common techniques applied by formal methods–based tools include deductive verification, model checking, and static program analysis.
Discussion Introduction Concurrent programs are notoriously difficult to test. Their behavior depends not only on the input but also on the scheduling of threads or processes. Many concurrency errors occur only for very specific schedulings, which makes them extremely hard to find and reproduce. An alternative to testing is to apply formal methods. Formal methods use mathematical concepts to examine all possible executions of a program. Therefore, they are able to guarantee certain properties for all inputs and all possible schedulings of a program. Most formal methods–based tools for concurrent programs focus on races and deadlocks, the two most prominent concurrency errors. Some of them also detect other errors such as livelocks, atomicity violations, undesired nondeterminism, and violations of specifications provided by the user. This entry covers tools based on three kinds of formal methods. () Tools for deductive verification use mathematical logic to construct a correctness proof for the input program. () Tools for state space exploration (model checkers) check properties of a program by exploring all its executions. () Tools for static program analysis approximate the possible dynamic behaviors of a program and then check properties of these approximations. This entry summarizes the
foundations and applications of these approaches, discusses their strengths and weaknesses, and names a few representative tools for each approach. The evaluation of the three approaches focuses on the following criteria: Soundness. A verification or analysis tool is sound if it produces only results that are valid with respect to the semantics of the programming language. In other words, a sound tool does not miss any errors. Soundness is the key property that distinguishes formal methods from testing. However, some formal methods–based tools sacrifice soundness in favor of reducing false positives or improving efficiency. Completeness. A verification or analysis tool is complete if it does not produce false positives, that is, each error detected by the tool can actually occur according to the semantics of the programming language. Minimizing the number of false positives is crucial for the practical applicability of a tool because each reported error needs to be subjected to further investigation, which usually involves human inspection. Modularity. A verification or analysis tool is modular if it can analyze the modules of a software system independently of each other and deduce the correctness of the whole system from the correctness of its components. Modularity allows a tool to analyze program libraries and improves scalability. Automation. A verification or analysis tool has a high degree of automation if it requires little or no user interaction. Typical forms of user interaction include providing the specification to be checked by the tool, assisting the tool through program annotations, and constructing proofs manually. A high degree of automation is necessary to make the application of formal methods-based tools economically feasible. Efficiency. A verification or analysis tool is efficient if it can analyze large programs in a relatively short amount of time and space. In general, the number of states and executions of a program grows exponentially in the size of the program (for instance, the number of variables, branches, and threads). Therefore, tools must be highly efficient in order to handle practical programs.
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
Deductive Verification Overview Tools for deductive verification construct a formal proof that demonstrates that a given program satisfies its specification. Common ways of constructing the correctness proof are () to use a suitable calculus such as the weakest precondition calculus to compute verification conditions – logical formulas whose validity entails the correctness of the program – and then prove their validity in a theorem prover, or () to use symbolic execution to compute symbolic states for each path through the program and then use a theorem prover to check that the symbolic states satisfy the specification. Both approaches employ a formal semantics of the programming language. The specification to be verified typically includes implicit properties (such as deadlock freedom) as well as properties explicitly specified by the user (for instance through pre- and postconditions). Deductive verifiers for concurrent programs apply various verification methodologies, which are illustrated by the Java example below. Monitor invariants (see entry Monitors, Axiomatic Verification of) allow one to express properties of the monitor variables (here, the fields in class Account) that hold when the monitor is not currently locked by any thread. For instance, one could express that the transaction count is a nonnegative number. Rely-guarantee reasoning [] allows one to reason about the possible interference of other threads. For instance, one could prove that no thread ever decreases the transaction count. Deadlock freedom can be proved by establishing a partial order on locks and proving that each thread acquires locks in ascending order (see entry on Deadlocks). For instance, one could specify that the order on the locks associated with Account objects is given by the less-than order on the accounts’ numbers. The check that locks are acquired in ascending order would fail for the example because the locks of this and to are acquired irregardless of the order. Therefore, the concurrent execution of x.transfer(y, a) and y.transfer(x, b) might deadlock. class int int int
Account { number; balance; transactionCount;
F
void transfer(Account to, int amount) { if(this == to) return; acquire this; acquire to; balance = balance - amount; to.balance = to.balance + amount; transactionCount = transactionCount + 1; to.transactionCount = to.transactionCount + 1; release to; release this; } // constructors and other methods omitted }
The absence of data races can be proved using a permission system that maintains an access permission for each shared variable. The permission gets created when the variable comes into existence; it can be transferred among threads and between threads and monitors, but not duplicated. Each shared variable access gives rise to a proof obligation that the current thread possesses the permission for that variable. Since access permissions cannot be duplicated, this scheme prevents data races. In the example above, one could store the permissions for balance and transactionCount in the monitor of their object. By acquiring the monitors for this and to, the current thread obtains those permissions and, therefore, is allowed to access the fields. When the monitors are released, the permissions are transferred back to the monitors. Such a permission system is for instance applied by concurrent separation logic []. An alternative technique for proving the absence of data races is through object ownership []. Ownership schemes associate zero or one owner with each object. An owner can be another object or a thread. A thread becomes the owner of an object by creating it or by locking its monitor. Each access to an object’s field gives rise to a proof obligation that the object is (directly or transitively) owned by the current thread. Since an object has at most one owner, this scheme prevents data races. Besides preventing data races, both permissions and ownership guarantee that the value of a variable does not change while a thread holds its permission or owns the object it belongs to. Therefore, they also give an atomicity guarantee, which allows verifiers to apply sequential reasoning techniques in order to prove additional program properties.
F
F
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
Discussion Program verifiers are sound unless there is a flaw in the underlying logic or the implementation of the verifier. In principle, program verification is complete if the underlying logic is complete with respect to the semantics of the programming language. However, program verifiers typically enforce some kind of verification methodology as described above. These methodologies simplify the verification of certain programs, but fail for others. For instance, a tool that requires lock ordering for deadlock prevention might not be able to handle programs that prevent deadlocks by other strategies. Another source of incompleteness are the limitations of theorem provers. Since the validity of verification conditions is in general undecidable, program verifiers based on automatic provers such as SMT solvers report spurious errors when the prover cannot confirm the validity of a verification condition. A number of program verifiers support modular reasoning. For instance, a permission system removes the necessity to inspect the whole program in order to detect data races. Nevertheless, modular verification is still challenging for certain program properties such as global invariants and liveness properties. Deductive program verifiers are typically not fully automatic. Users have to provide the specification the program is verified against. Even if the property to be proved is the absence of a simple error such as data races, users typically have to annotate the code with auxiliary assertions such as loop invariants. Moreover, interactive verifiers require users to guide the proof search. The efficiency of automatic verifiers (those verifiers where the only user interaction is by writing annotations) depends on the number and complexity of the generated verification conditions as well as on the performance of the underlying automatic theorem prover. The former aspect is positively influenced by applying a modular verification methodology, the latter benefits from the enormous progress in the area of automatic theorem proving. In summary, since program verifiers are sound and since modular verifiers scale well, they are well suited for software projects with the highest quality requirements, especially safety-critical software. However, the overhead of writing specifications as well as limitations of existing verification methodologies and automatic
theorem provers have so far prevented program verification from being applied routinely in mainstream software development.
Examples Most verifiers for concurrent programs are in the stage of research prototypes. Microsoft’s VCC is an automatic verifier for concurrent, low-level C code. It has been used to verify the virtualization layer Hyper-V. VCC applies a verification methodology based on ownership []. Permission systems are used by verifiers based on separation logic such as VeriFast for C programs [], SmallfootRG [], and Heap-Hop [], as well as by Chalice []. Both Heap-Hop and Chalice support shared variable concurrency and message passing. Lock-free data structures have been verified using the interactive verifier KIV [].
State Space Exploration Overview Software model checkers consider all possible states that may occur during the execution of a program. They check properties of individual states or sequences of states (traces) through an exhaustive search for a counterexample. This search is either performed by enumerating all possible states of a program execution (explicit state model checking) or by representing states through formulas in propositional logic (symbolic model checking). In both approaches, programs are regarded as transition systems whose transitions are described by a formal semantics of the programming language. The specifications to be checked are given as formulas in propositional logic for individual states (assertions) or as formulas in a temporal logic (typically LTL or CTL) for traces. The use of temporal logic allows model checkers in particular to check liveness properties such as “every request will eventually be handled.” Liveness properties are important for the correctness of all parallel programs, but especially for those based on message passing (such as MPI programs) because they often apply sophisticated protocols. Other errors such as deadlocks can be detected without giving an explicit specification. For instance for the account example above, a model checker detects that for some input values and thread interleavings, method transfer does not reach a valid terminal
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
F
state. The inputs and interleavings are then reported as a counterexample that can be inspected and replayed by the user. The main limitation of software model checking is the so-called state space explosion problem. The state space of a program execution grows exponentially in the number of variables and the number of threads. Therefore, even small programs have a very large state space and require elaborate reduction techniques to be efficiently checkable. Partial order reduction reduces the number of relevant interleavings by identifying operations that are independent of actions of other threads (for instance, local variable access). Abstraction is used to simplify the transition system before checking it. It is common that many aspects of a program execution are not relevant for a given property to be checked. For instance, the values of amount and balance in the transfer example are not relevant for the existence of a deadlock. Therefore, they can be abstracted away, which reduces the state space significantly. Yet another approach to reduce the state space is to limit the search to program executions with a small number of data objects, threads, or thread preemptions. Such limits compromise soundness, but practical experience shows that many concurrency errors may be detected with a small number of threads and preemptions. For instance, detecting the deadlock in method transfer requires only two Account objects, two threads, and one preemption.
iteratively refine the abstraction. In contrast to deductive verification, model checking is not limited by the expressiveness of a verification methodology. Model checkers are thus capable of analyzing arbitrary programs, including for instance optimistic concurrency, benevolent data races, and interlocked operations. In practice, software model checking can be inconclusive because the state space is too large, causing the model checker to abort. Although there is work on compositional model checking, most model checkers are not modular because it is difficult to summarize temporal properties at module boundaries. Model checking, including abstraction, is fully automatic. Users need to provide only the property to be checked. When the checking is not feasible with the available resources, users also need to manually simplify the program or provide upper limits on the number of objects, threads, etc. Due to the large state space of programs (in particular of parallel programs and programs that manipulate large data structures) and the non-modularity of the approach, model checking is typically too slow to be applied routinely during software development (in contrast to type checking and similar static analyses). However, for control-intensive programs with rather small data structures such as device drivers or embedded software, model checking is an efficient and very effective verification technique.
Discussion For finite transition systems (i.e., programs with a finite number of states such as programs with an upper bound on the number of threads, the number of dynamically allocated data objects, and the recursion depth, in particular, terminating programs), model checking is sound since all possible states and traces can be checked. However, some model checkers explore only a subset of the possible states of a program execution based on heuristics that are likely to detect errors; this approach trades efficiency for soundness. Model checking for finite transition systems is in principle complete, although abstraction may lead to spurious errors. Counterexample-guided abstraction refinement [] is a technique to recover completeness by using the counterexample traces of spurious errors to
Examples Even though most software model checkers focus on sequential programs, there are a number of tools available that check concurrency errors. JavaPathFinder [] is an explicit state model checker for concurrent Java programs that has been applied in industry and academia. It detects data races and deadlocks as well as violations of user-written assertions and LTL properties. MAGIC [] checks whether a labeled transition system is a safe abstraction of a C procedure. The C procedure may invoke other procedures which are themselves specified in terms of state machines, which enables compositional checking. MAGIC uses counterexampleguided abstraction refinement to reduce the state space of the system.
F
F
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
BLAST [] finds data races in C programs. It combines counterexample-guided abstraction refinement and assume-guarantee reasoning to infer a suitable model for thread contexts, which results in a low rate of false positives. The Zing tool [] checks for data races, deadlocks, API usage rules, and other safety properties. Its distinctive features are the use of procedure summaries to make the model checking compositional and the use of Lipton’s theory of reduction [] to prune irrelevant thread interleavings. CHESS [] is a tool for systematically enumerating thread interleavings in order to find errors such as data races, deadlocks, and livelocks in Win DLLs and managed.NET code. The analysis is complete, that is, every error found by CHESS is possible in an execution of the program. CHESS uses preemption bounding, that is, limits the number of thread preemptions per program execution to reduce the state space. Modulo this preemption bound, CHESS is sound; all remaining errors require more preemptions than the bound. Soundness is achieved by capturing all sources of non-determinism, including memory model effects. Model checkers for message-passing concurrency include for instance MPI-Spin [], which check Promela models of MPI programs, and ISP [], which operates directly on the MPI/C source code. Both tools detect deadlocks and assertion violations; MPI-Spin also checks liveness properties.
Static Program Analysis Overview Static analyzers compute static approximations of the possible behaviors of a program and then check properties of these approximations. Common forms of static analysis include, among others, data flow analysis and type systems. Both techniques are typically described formally as sets of inference rules that allow the analyzer to infer properties of execution states, for instance, the set of possible values of a variable. When annotations are given by the programmer, these rules are also used to check the annotations. Both data flow analysis and type systems have been applied extensively to detect concurrency errors.
Static data flow analyses typically apply the lockset algorithm to detect races. They compute the set of locks each thread holds when accessing any given variable. If for each variable the locksets of all threads are not disjoint then there is mutual exclusion on the variable accesses (see entry on Race Detection Techniques for details). In the example above, the static analysis would render the field accesses safe if it can show that for all accesses to x.balance and x.transactionCount, the executing thread holds at least the lock associated with the object x. Static analyses detect atomicity violations using Lipton’s theory []. Via a classification of operations into right and left movers, an analysis can show that a given code block is atomic, that is, any execution of the block is equivalent to an execution without interference from other threads. For this purpose, the analysis must show that variable accesses are both left and right movers, which amounts to showing the absence of data races. In the example, the conditional statement does not access any shared variable and is, thus, a both-mover. The acquire statements are right movers, the release statements are left movers, and the field accesses are both-movers provided that there are no data races. Consequently, the method is atomic. Deadlocks can be detected by computing a static lock-order graph and checking this graph to be acyclic. A deadlock analysis would complain about the transfer example if it cannot rule out that two threads execute transfer concurrently with reversed arguments. Most data flow analyses infer the information needed to check for concurrency errors. This also involves the computation of a whole range of additional program properties, which are required for these checks. For instance, the analyses require alias information in order to decide whether to expressions refer to the same object at run time. A may-alias analysis is for instance used for sound deadlock checking (to check whether two threads potentially compete for the same lock), whereas a must-alias analysis is required for sound race checking (to check whether two locksets definitely overlap). Another common auxiliary analysis is thread-escape analysis which distinguishes thread-local variables from shared variables; this is for instance useful to suppress warnings when accesses to thread-local variables are not guarded by any lock.
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
Discussion Static analyses are in principle sound. However, many existing tools sacrifice soundness in favor of efficiency. For instance, several data race analyses omit a mustalias analysis, which is necessarily flow-sensitive, and, therefore, cannot check soundly whether two locksets overlap. Static analyses are by nature incomplete since they over-approximate the possible executions of a program. Another source of incompleteness is that static analyses – much like deductive verification – check whether programs comply with a given methodology, such as lock-based synchronization or static lock ordering. Programs that do not comply might lead to false positives. Whether or not a static analysis is modular depends mostly on the existence of program annotations. Most data flow analyses are whole-program analyses, but there exist a number of analyses that compute procedure summaries to enable modular checking. By contrast, type systems typically require annotations, which are then checked modularly. Aside from annotations that may be required from programmers, static analyses are fully automatic. However, they often produce a large number of false positives, which need to be inspected manually or filtered aggressively (which may compromise soundness). Static analyses are extremely efficient. Many static analyzers have been applied to programs consisting of hundreds of thousands or even millions of lines of code. This scale is orders of magnitudes larger than for deductive verification and model checking. In summary, static analysis tools typically favor efficiency over soundness and completeness. This choice makes them ideal for finding errors in large applications, whereas deductive verification and model checking are more appropriate when strong correctness guarantees are needed. Examples Warlock [], RacerX [], and Relay [] use the lockset algorithm to detect data races in C programs. RacerX also computes a static lock-order graph to detect deadlocks. It uses various heuristics to improve the analysis results, for instance, to identify benign races. Relay is a modular analysis that limits the sources of unsoundness to four very specific sources, one of them being the syntactic filtering of warnings to reduce
F
the number of false positives. RacerX and Relay have been applied to millions of lines of code such as the Linux kernel. Chord [] is a context-sensitive, flow-insensitive race checker for Java that supports three common synchronization idioms: lexically scoped lock-based synchronization, fork/join, and wait/notify. Since Chord does not perform a must-alias analysis, it is not sound. Chord has been applied to hundreds of thousands of lines of code. rccjava [] uses a type system to check for data races in Java programs. The annotations required from the programmer include declarations of thread-local variables and of the locks that guard accesses to a shared variable. The type system has been extended to atomicity checking []. rccJava scales to hundreds of thousands of lines of code.
Related Entries Deadlocks Determinacy Intel Parallel Inspector Monitors, Axiomatic Verification of Race Conditions Race Detectors for Cilk and Cilk++ Programs Race Detection Techniques Owicki-Gries Method of Axiomatic Verification
Bibliographic Notes and Further Reading This entry focuses entirely on tools that detect concurrency errors present in programs. Two related areas are () tools that prevent concurrency errors during the construction of the program and () tools that detect concurrency errors in other artifacts. Tools of the first kind support correctness by construction. Programs are typically constructed through a stepwise refinement of some high-level specification to executable code. Since the code is derived systematically from a formal specification, errors including concurrency errors are prevented. Interested readers may start from Abrial’s book []. Tools of the second kind do not operate on programs but on other representations of software and hardware. Races and deadlocks may occur for instance also in workflow graphs; they can be detected by the
F
F
Formal Methods–Based Tools for Race, Deadlock, and Other Errors
approaches described here, but also through Petri-net and graph algorithms []. Similarly, hardware can be verified not to contain concurrency errors such as deadlock [], an area in which model checking has been applied with great success.
Acknowledgments Thanks to Felix Klaedtke and Christoph Wintersteiger for their helpful comments on a draft of this entry.
Bibliography . Abrial J-R () Modeling in Event-B. Cambridge University Press, Cambridge . Andrews T, Qadeer S, Rajamani SK, Xie Y () Zing: exploiting program structure for model checking concurrent software. In: Gardner P, Yoshida N (eds) Concurrency Theory (CONCUR). Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Bryant R, Kukula J () Formal methods for functional verification. In: Kuehlmann A (ed) The best of ICCAD: years of excellence in computer aided design. Kluwer, Norwell, pp – . Calcagno C, Parkinson MJ, Vafeiadis V () Modular safety checking for fine-grained concurrency. In: Nielson HR, Filé G (eds) Static analysis (SAS). Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Chaki S, Clarke EM, Groce A, Jha S, Veith H () Modular verification of software components in C. IEEE Trans. Softw Eng ():– . Clarke EM, Grumberg O, Jha S, Lu Y, Veith H () Counterexample-guided abstraction refinement for symbolic model checking. J ACM ():– . Cohen E, Dahlweid M, Hillebrand M, Leinenbach D, Moskal M, Santen T, Schulte W, Tobies S () VCC: a practical system for verifying concurrent C. In: Berghofer S, Nipkow T, Urban C, Wenzel M (eds) Theorem proving in higher order logics (TPHOLs ). Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Engler DR, Ashcraft K () RacerX: effective, static detection of race conditions and deadlocks. In: Scott ML, Peterson LL (eds) Symposium on operating systems principles (SOSP). ACM, New York, pp – . Flanagan C, Freund SN () Type-based race detection for Java. In: Proceedings of the ACM conference on programming language design and implementation (PLDI). ACM, New York, pp – . Flanagan C, Freund SN, Lifshin M, Qadeer S () Types for atomicity: static checking and inference for Java. ACM Trans Program Lang Syst ():– . Henzinger TA, Jhala R, Majumdar R () Race checking by context inference. In: Pugh W, Chambers C (eds) Programming Language Design and Implementation (PLDI). ACM, New York, pp –
. Jacobs B, Leino KRM, Piessens F, Schulte W, Smans J () A programming model for concurrent object-oriented programs. ACM Trans Program Lang Syst ():– . Jacobs B, Piessens F () The VeriFast program verifier. Technical Report CW-, Department of computer science, Katholieke Universiteit Leuven . Jones CB () Specification and design of (parallel) programs. In: Proceedings of IFIP’. North-Holland, pp – . Leino KRM, Müller P, Smans J () Verification of concurrent programs with Chalice. In: Aldini A, Barthe G, Gorrieri R (eds) Foundations of security analysis and design V. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Lipton RJ () Reduction: a method of proving properties of parallel programs. Commun ACM ():– . Musuvathi M, Qadeer S, Ball T, Basler G, Nainar PA, Neamtiu I () Finding and reproducing heisenbugs in concurrent programs. In: Draves R, van Renesse R (eds) Operating systems design and implementation (OSDI). USENIX Association, pp – . Naik M, Aiken A, Whaley J () Effective static race detection for Java. In: Schwartzbach MI, Ball T (eds) Programming language design and implementation (PLDI), ACM, pp – . O’Hearn PW () Resources, concurrency, and local reasoning. Theor Comput Sci (–):– . Siegel SF () Model checking nonblocking MPI programs. In: Cook B, Podelski A (eds) Verification, model checking, and abstract interpretation (VMCAI). Lecture Notes in Computer Science, vol , pp – . Sterling N () Warlock – a static data race analysis tool. In: USENIX Winter, pp – . Tofan B, Bäumler S, Schellhorn G, Reif W () Verifying linearizability and lock-freedom with temporal logic. Technical report, Fakultät für Angewandte Informatik der Universität Augsburg, . van der Aalst WMP, Hirnschall A, Verbeek HMWE () An alternative way to analyze workflow graphs. In: Pidduck AB, Mylopoulos J, Woo CC, Özsu MT (eds) Advanced information systems engineering (CAiSE). Lecture Notes in Computer Science, vol . Springer, pp – . Villard J, Lozes É, Calcagno C () Tracking heaps that hop with Heap-Hop. In: Esparza J, Majumdar R (eds) Tools and algorithms for the construction and analysis of systems (TACAS). Lecture Notes in Computer Science, vol . Springer, – . Visser W, Havelund K, Brat GP, Park S, Lerda F () Model checking programs. Autom Softw Eng ():– . Vo A, Vakkalanka S, DeLisi M, Gopalakrishnan G, Kirby RM, Thakur R () Formal verification of practical MPI programs. In: Principles and practice of parallel programming (PPoPP). ACM, pp – . Voung JW, Jhala R, Lerner S () Relay: static race detection on millions of lines of code. In: Crnkovic I, Bertolino A (eds) European software engineering conference and foundations of software engineering (ESEC/FSE). ACM, pp –
Fortran and Its Successors
Fortran and Its Successors Michael Metcalf Berlin, Germany
Definition Fortran is a high-level programming language that is widely used in scientific programming in the physical sciences, engineering, and mathematics. The first version was developed by John Backus at IBM in the s. Following an initial, rapid spread in its use, the language was standardized, first in and, after many intermediate standards, most recently in . The language is a procedural, imperative, compiled language with a syntax well suited to a direct representation of mathematical formulae. Individual procedures may either be grouped into modules or compiled separately, allowing the convenient construction of very large programs and of subroutine libraries. Fortran contains features for array processing, abstract and polymorphic data types, and dynamic data structures. Compilers typically generate very efficient object code, allowing an optimal use of computing resources. Modern language provide extensive facilities for object-oriented and parallel programming. Earlier versions have relied on auxiliary standards to facilitate parallel programming.
Discussion The Development of Fortran Historical Fortran Backus, at the end of , began the development of the Fortran programming language (the name being an acronym derived from FORmula TRANslation), with the objective of producing a compiler that would generate efficient object code. The first version, known as FORTRAN I, contained early forms of constructs that have survived to the present day: simple and subscripted variables, the assignment statement, a DO-loop, mixedmode arithmetic, and input/output (I/O) specifications. Many novel compiling techniques had to be developed. Based on the experience with FORTRAN I, it was decided to introduce a new version, FORTRAN II, in . The crucial differences between the two were the
F
introduction of subprograms, with their associated concept of shared data areas, and separate compilation. In , FORTRAN IV was released. It contained type statements, the logical-IF statement, and the possibility to pass procedure names as arguments. In , a standard for FORTRAN, based on FORTRAN IV, was published. This was the first programming language to achieve recognition as an American and subsequently international standard, and is now known as FORTRAN . The permissiveness of the standards led to a proliferation of dialects and a new revision was published in , becoming known as FORTRAN . It was adopted by the International Standards Organization (ISO) shortly afterwards. The new standard brought with it many new features, for instance the IF... THEN... ELSE construct, a character data type, and enhanced I/O facilities. Although only slowly adopted, this standard eventually gave Fortran a new lease of life that allowed it to maintain its position as the most widely used scientific programming language.
Modern Fortran Fortran
Fortran’s strength had always been in the area of numerical, scientific, engineering, and technical applications. In order that it be brought properly up to date, an American subcommittee, working as a development body for the ISO committee ISO/IEC JTC/SC/WG (or simply WG), prepared a further standard, now known as Fortran . It was developed in the heyday of the supercomputer, when large-scale computations using vectorized codes were being carried out on machines like the Cray and CDC Cyber , and the experience gained in this area was brought into the standardization process by representatives of computer hardware and software vendors, users, and academia. Thus, one of the main features of Fortran was the array language, built on whole array operations and assignments, array sections, intrinsic procedures for arrays, and dynamic storage. It was designed with optimization in mind. Another was the notion of abstract data types, built on modules and module procedures, derived data types (structures), operator overloading and generic interfaces, together with pointers. Also important were new facilities for
F
F
Fortran and Its Successors
numerical computation including a set of numeric inquiry functions, the parameterization of the intrinsic types, new control constructs (SELECT CASE and new forms of DO), internal and recursive procedures, optional and keyword arguments, improved I/O facilities, and new intrinsic procedures. The resulting language was a far more powerful tool than its predecessor and a safer and more reliable one too. Subsequent experience showed that Fortran compilers detected errors far more frequently than previously, resulting in a faster development cycle. The array syntax in particular allowed compact code to be written, in itself an aid to safe programming. Fortran was published by ISO in . Fortran
Following the publication of Fortran , the Fortran standards committees continued to operate, and decided on a strategy whereby a minor revision of Fortran would be prepared, establishing a principle of having major and minor revisions alternating. At the same time, the High-Performance Fortran Forum (HPFF) was founded. The HPFF was set up in an effort to define a set of extensions to Fortran such that it would be possible to write portable, single-threaded code when using parallel computers for handling problems involving large sets of data that can be represented by regular grids, allowing efficient implementations on SIMD (SingleInstruction-Multiple-Data) architectures. This version of Fortran was to be known as High Performance Fortran (HPF), and it was quickly decided, given its array features, that Fortran should be its base language. Thus, the final form of HPF was of a superset of Fortran , the main extensions being in the form of directives. However, HPF included some new syntax too and, in order to avoid the development of divergent dialects of Fortran, it was agreed to include some of this new syntax into Fortran . Apart from this syntax, only a small number of other pressing but minor changes were made, and Fortran was adopted as a formal standard in . It is currently the most widely used version of Fortran. Auxiliary Standards
With few exceptions, no Fortran standard up to and including Fortran included any feature intended
directly to facilitate parallel programming. Rather, this has had to be achieved through the intermediary of ad hoc industry standards, in particular HPF, MPI, OpenMP, and Posix Threads. The HPF directives, already introduced, take the form of Fortran comment lines that are recognized as directives only by an HPF processor. An example is !HPF$ ALIGN WITH b :: a1, a2, a3 to align three conformable arrays with a fourth, thus ensuring locality of reference. Further directives allow, for instance, aligned arrays to be distributed over a set of processors. MPI is a library specification for message passing. OpenMP supports multi-platform, shared-memory parallel programming and consists of a set of compiler directives, library routines, and environment variables that determine run-time behavior. Posix Threads is again a library specification, for multithreading. In any event, outside Japan, HPF met with little success, whereas the use of MPI and OpenMP has become widespread. Fortran
The following language standard, Fortran , was published in . Its main features were: the handling of floating-point exceptions, based on the IEEE model of floating-point arithmetic; allowing allocatable arrays as structure components, dummy arguments, and function results; interoperability with C; parameterized data types; object-orientation via constructors/destructors, inheritance, and polymorphism; derived type I/O; asynchronous I/O; and procedure variables. Although all this all represented a major upgrade, especially given the introduction of object-oriented programming, there was no new development specifically related to parallel programming. Fortran
Notwithstanding the fact that Fortran -conformant compilers had been very slow to appear, the standardization committees thought fit to proceed with yet another standard, Fortran . In contrast to the previous one, its single most important new feature is directly related to parallel processing – the addition of coarray handling facilities. Further, the DO CONCURRENT form of loop control and the CONTIGUOUS attribute
Fortran and Its Successors
are introduced. These are further described below. Other major new features include: sub-modules, enhanced access to data objects, enhancements to I/O and to execution control, and more intrinsic procedures, in particular for bit processing. Fortran was published in .
Selected Fortran Features Given that all the actively supported Fortran compilers on the market offer at least the whole of Fortran , that version is the one used as the basis of the descriptions in this section, except where otherwise stated. Only those features are described that are relevant to vectorization or parallelization. Under Fortran , achieving effective parallelization requires the exploitation of a combination of efficient Fortran serial execution on individual processors with, say, MPI, controlling the communication between processors.
The DO Constructs Many iterative calculations can be carried out in the form of a DO construct. A simplified but sufficient nested form is illustrated by: outer: DO inner: DO i = j, k, m ! from j to ! k in steps of m ! (m is optional) : IF (...) CYCLE ! take ! next iteration of ! construct ‘inner’ : IF (...) EXIT outer ! exit ! the construct ‘outer’ END DO inner END DO outer
(The Fortran language is case insensitive; however, to enhance the clarity of the code examples, Fortran keywords are distinguished by being written in upper case.) As shown, the constructs may be optionally named so that any EXIT or CYCLE statement may specify which loop is intended. Potential data dependencies between the iterations of a DO construct can inhibit optimization, including vectorization. To alleviate this problem, Fortran introduces the DO CONCURRENT form of the DO construct, as in
F
DO CONCURRENT (i = 1:m) a(k + i) = a(k + i) + scale * a(n + i) END DO
that asserts that there is no overlap between the range of values accessed (on the right-hand side of the assignment) and the range of values altered (on the left-hand side). The individual iterations are thus independent.
F
Array Handling Array Variables
Arrays are variables in their own right; typically they are specified using the DIMENSION attribute. Every array is characterized by its type, rank (dimensionality), and shape (defined by the extents of each dimension). The lower bound of each dimension is by default , but arbitrary bounds can be explicitly specified. The DIMENSION keyword is optional; if omitted, the array shape must be specified after the array-variable name. The examples REAL :: x(100) INTEGER, DIMENSION(0:1000, -100:100) :: plane
declare two arrays: x is rank- and plane is rank-. Fortran arrays are stored with elements in columnmajor order. Elements of arrays are referenced using subscripts, for example, x(n)
x(i*j)
Array elements are scalar. The subscripts may be any scalar integer expression. A section is a part of an array variable, referenced using subscript ranges, and is itself an array: x(i:j) plane(i:j, k:l:m) x(plane(7, k:l)) x(5:4)
! ! ! !
rank one rank two vector subscript zero zero length
Array-valued constants and variables (array constructors) are available, enclosed in (/ ... /)or, in Fortran , in square brackets, [ ]: (/ 1, 2, 3, 4, 5 /) (/ ( (/ 1, 2, 3 /), i = 1, 10) /) (/ (i, i = 1, limit, 2) /) [ (0, i = 1, 1000) ] [ (0.01*i, i = 1, 100) ]
F
Fortran and Its Successors
They may make use of an implied-DO loop notation, as shown, and can be freely used in expressions and assignments. A derived data type may contain array components. Given
REAL, DIMENSION(10, 20) :: x, y, z REAL, DIMENSION(5) :: a, b
it can be written:
[lower ] : [upper ] [:stride ]
! ! z = x/y ! ! ! z = 10.0 ! ! ! b = a + 1.3 ! ! ! b = 7/a + x(1:5, 5) ! ! ! ! z(1:8, 5:10) = x(2:9, z(1:8, ! ! ! a(2:5) = a(1:4) ! ! !
where [ ] indicates an optional item. Examples for an array x(10, 100) are
An array of zero size is a legitimate object, and is handled without the need for special coding.
x(i, 1:n) x(1:m, j) x(i, : ) x(i, 1::3)
Elemental Procedures
TYPE point REAL, DIMENSION(3) :: vertex END TYPE point TYPE(point), DIMENSION(10) :: points TYPE(point), DIMENSION(10, 10):: & points_array points(2) is a scalar (a structure) and points(2) %vertex is an array component of a scalar. A reference like points_array(n,2) is an element (a scalar) of type point. However, points_array(n,2) %vertex(2) is an array of type real, and points_array(n,2)%vertex(2) is a scalar element of it. An array element always has a subscript or subscripts qualifying at least the last name. Arrays of arrays are not allowed. The general form of subscripts for an array section is
x(i, 100:1:-1) x((/1, 7, 3/), 10) x(:, 1:7)
! ! ! ! ! ! ! ! ! !
part of one row part of one column whole row i every third element of row i row i in reverse order vector subscript rank-2 section
A given value in an array can be referenced both as an element and as a section: x(1, 1) x(1:1, 1)
! !
scalar (rank zero) array section (rank one)
depending on the circumstances or requirements. Operations and Assignments
As long as they are of the same shape (conformable), scalar operations and assignments are extended to arrays on an element-by-element basis. For example, given declarations of
x = y
whole array assignment whole array division and assignment whole array assignment of scalar value whole array addition to scalar value array division, and addition to section 5:10) + & 15:20) array section addition and assignment overlapping section assignment
In an assignment like y = SQRT(x)
an intrinsic function, SQRT, returns an array-valued result for an array-valued argument. It is an elemental procedure. Elemental procedures may be called with scalar or array actual arguments. The procedure is applied to array arguments as though there were a reference for each element. In the case of a function, the shape of the result is the shape of the array arguments. Most intrinsic functions are elemental and Fortran extended this feature to non-intrinsic procedures. It is a further an aid to optimization on parallel processors. An elemental procedure requires the ELEMENTAL attribute and must be pure (i.e., have no side effects, see below).
Fortran and Its Successors
Array-Valued Functions and Arguments
Users can write functions that are array valued; they require an explicit interface and are usually placed in a module. This example shows such a module function, accessed by a main program: MODULE show CONTAINS FUNCTION func(s, t) REAL, DIMENSION(:) :: s, t ! An assumed-shape array, see below REAL, DIMENSION(SIZE(s)) :: func func = s * t**2 END FUNCTION func END MODULE show PROGRAM example USE show REAL, DIMENSION(3) :: x = (/ 1., 2., 3./), y = (/ 3., 2., 1. /), s s = func(x, y) PRINT *, s END PROGRAM example
&
F
Fortran and Its Successors. Table Classes of intrinsic procedures, with examples Class of intrinsic array function Examples Numeric
abs, modulo
Mathematical
acos, log
Floating-point manipulation
exponent, scale
Vector and matrix multiply
dot_product, matmul
Array reduction
maxval, sum
Array inquiry
allocated, size
Array manipulation
cshift, eoshift
Array location
maxloc, minloc
Array construction
merge, pack
Array reshape
reshape
the corresponding dummy argument specification must define the type and rank of the array, but it may omit the shape. Thus in SUBROUTINE calc(d) REAL, DIMENSION(:, :) :: d
In order to allow for optimization, especially on parallel and vector machines, the order of expression evaluation in executable statements is not specified. However, some potential optimizations might be lost if it is not certain that an array is in contiguous storage; for instance, if the array is an assumed-shape dummy argument array or a pointer array. In Fortran , it is open to a programmer to assert that an array is contiguous by adding the CONTIGUOUS attribute to its specification, as in
it is as if d were dimensioned (, ). The shape, not the bounds, is passed, meaning that the default lower bound is and the default upper bound is the corresponding extent. However, any lower bound can be specified and the array maps accordingly. Assumed-shape arrays allow great flexibility in array argument passing.
REAL, CONTIGUOUS, DIMENSION(:, :) :: x
SUBROUTINE exchange(x, y) REAL, DIMENSION(:) :: x, y REAL, DIMENSION(SIZE(x)) :: work work = x x = y y = work END SUBROUTINE exchange
The many intrinsic functions that accept arrayvalued arguments should be regarded as an integral part of the array language. A brief summary of their categories appears in Table .
Assumed-Shaped Arrays
Given an actual argument in a procedure reference, as in: REAL, DIMENSION(0:10, 0:20) :: x : CALL calc(x)
Automatic Arrays
Automatic arrays are useful for defining local, temporary work space in procedures, as in
Here, the array work is created on entering the procedure and destroyed on leaving. The actual storage is typically maintained on a stack. Allocatable Arrays
Fortran provides dynamic allocation of storage, commonly implemented using a heap storage mechanism.
F
F
Fortran and Its Successors
An example, for establishing a work array for a whole program, is MODULE global INTEGER :: n REAL, DIMENSION(:,:), ALLOCATABLE :: & work END MODULE global PROGRAM analysis USE global READ (*, *) n ALLOCATE(work(n, 2*n),STAT=status) : DEALLOCATE (work)
The work array can be propagated throughout the whole program via a USE statement in each program unit. An explicit lower bound may be specified and several entities may be allocated in one statement. Deallocation of arrays is automatic when they go out of scope.
Masked Assignment
Often, there is a need to mask an assignment. This can be done using either a WHERE statement: WHERE (x /= 0.0) x = 1.0/x an example that avoids division by zero by replacing only non-zero values by their reciprocals, or a WHERE construct: WHERE (x /= 0.0) x = 1.0/x y = 0.0 ! all arrays assumed conformable ELSEWHERE x = HUGE(0.) y = 1.0 - y END WHERE
The ELSEWHERE statement may also have a mask clause, but at most one ELSEWHERE statement can be without a mask, and that must be the final one; WHERE constructs may be nested within one another.
The FORALL Statement and Construct
When a DO construct is executed, each successive iteration is performed in order and one after the other – an impediment to optimization on a parallel processor. To help in this situation, Fortran introduced the
FORALL statement and construct, which can be considered as an array assignment expressed with the help of indices. An example is FORALL(i = 1:n) x(i, i) = s(i)
where the individual assignments may be carried out in any order, and even simultaneously. In FORALL(i=1:n, j=1:n, y(i,j)/=0.) x(j,i) = 1.0/y(i,j)
the assignment is subject also to a masking condition. The FORALL construct allows several assignment statements to be executed in order, as in FORALL(i = 2:n-1, j = 2:n-1) x(i,j) = x(i,j-1) + x(i,j+1) + x(i-1,j) + x(i+1,j) y(i,j) = x(i,j) END FORALL
Assignment in a FORALL is like an array assignment: it is as if all the expressions were evaluated in any order, held in temporary storage, then all the assignments performed in any order. The first statement in a construct must fully complete before the second can begin. Procedures referenced within a FORALL must be pure (see below). In general, the FORALL has been poorly implemented and the DO CONCURRENT is seen as a better alternative: using DO CONCURRENT requires the programmer to ensure that the iterations are independent, whereas, with FORALL, the compiler is expected to determine whether a temporary copy is needed, which is often unrealistic.
Pure Procedures
This is another Fortran feature expressly for parallel computing, ensuring that execution of a procedure referenced from within a FORALL statement or construct (or, in Fortran , from within a DO CONCURRENT construct) cannot have any side effects (whereby the result of one reference could cause a change to the result of another). Any such side effects in a function could impede optimization on a parallel processor – the order of execution of the assignments could affect the results. To control this situation, a PURE keyword may be added to the SUBROUTINE or FUNCTION statement – an assertion that the procedure: alters no global variable;
Fortran and Its Successors
performs no input/output operations; has no saved variables; and, in the case of functions, alters no argument. These constraints are designed such that they can be verified by a compiler. All the intrinsic functions are pure. The FORALL statement and construct and the PURE keyword were adopted from HPF. Coarrays (Fortran Only)
The objective of coarrays is to distribute over some number of processors not only data, as in an SIMD model, but also work, using the SPMD (SingleProgram-Multiple-Data) model. The syntax required to implement this facility has been designed to make the smallest possible impact on the appearance of a program and to require a programmer to learn just a modest set of new rules. Data distribution is achieved by specifying the relationship among memory images using an elegant new syntax similar to conventional Fortran. Any object not declared using this syntax exists independently in all the images and can be accessed only from within its own image. Objects specified with the syntax have the additional property that they can be accessed directly from any other image. Thus, the statement REAL, DIMENSION()[∗] :: a, b specifies two coarrays, a and b, that have the same size () in each image. Execution by an image of the statement a(:) = b(:)[j]
causes the array b from image j to be copied into its own array a, where square brackets are the notation used to access an object on another image. On a shared-memory machine, an implementation of a coarray might be as an array of a higher dimension. On a distributed-memory machine with one physical processor per image, a coarray is likely to be stored at the same address in each physical processor. Work is distributed according to the concept of images, which are copies of the program that each have a separate set of data objects and a separate flow of control. The number of images is normally chosen on the command line, and its fixed value is available at execution time via an inquiry function, NUM_IMAGES. The images execute asynchronously and the execution
F
path in each may differ, possibly under the control of an image index to which the programmer has access (via the THIS_IMAGE function). When synchronization between two images is required, use can be made of a set of intrinsic synchronization procedures, such as SYNCH_ALL or LOCK. Thus, it is possible in particular to avoid race conditions whereby one image alters a value still required by another, or one image requires an altered value that is not yet available from another. Handling this is the responsibility of the programmer. Between synchronization points, an image has no access to the fresh state of any other image. Flushing of temporary memory, caches, or registers is normally handled implicitly by the synchronization mechanisms themselves. In this way, a compiler can safely take advantage of all code optimizations on all processors between synchronization points without compromising data integrity. Where it might be necessary to limit execution of a code section to just one image at a time, a critical section may be defined using a CRITICAL ... END CRITICAL construct. The codimensions of a coarray are specified in a similar way to the specifications of assumed-size arrays, and coarray sub-objects may be referenced in a similar way to sub-objects of normal arrays. The following example shows how coarrays might be used to read values in one image and distribute them to all the others: REAL :: val[*] ... IF(THIS_IMAGE() == 1) THEN ! Only image 1 executes this construct READ(*, *) val DO image = 2, NUM_IMAGES() val[image] = val END DO END IF CALL SYNC_ALL() ! Execution on all images ! pauses here until all ! images reach this point
Coarrays can be used in most of the ways that normal arrays can. The most notable restrictions are that they cannot be automatic arrays, cannot be used for a function result, cannot have the POINTER attribute, and cannot appear in a pure or elemental procedure.
F
F
Fortran, Connection Machine
Related Entries
Fortran, Connection Machine
Code Generation Data Distribution Dependences
Connection Machine Fortran
HPF (High Performance Fortran) Loop Nest Parallelization MPI (Message Passing Interface)
Fortress (Sun HPCS Language)
OpenMP Parallelization, Basic Block Run Time Parallelization
Bibliographic Notes and Further Reading The development of Fortran was very contentious and the entire issue of the journal [] is devoted to its various aspects. Fortran , is defined in the ISO publication []. Fortran was published in as []. Fortran and Fortran are informally but completely described in [] and [] and, with the latest additions contained in Fortran , in []. Coarrays have evolved from a programming model initially intended for the Cray-TD. They were extended to Fortran by Numrich and Steidel [] and subsequently defined for Fortran by Numrich and Reid []. Their formal definition is in the Fortran standard [] and they are described informally also in [].
Guy L. Steele Jr. , Eric Allen , David Chase , Christine Flood , Victor Luchangco , Jan-Willem Maessen , Sukyoung Ryu Oracle Labs, Burlington, MA, USA Oracle Labs, Austin, TX, USA Google, Cambridge, MA, USA Korea Advanced Institute of Science and Technology, Daejeon, Korea
Definition Fortress is a programming language designed to support parallel computing for scientific applications at scales ranging from multicore desktops and laptops to petascale supercomputers. Its generality and features for user extension also make it a useful framework for experimental design of domain-specific languages. The project originated at Sun Microsystems Laboratories in as part of their work on the DARPA HPCS (High Productivity Computing Systems) program [] and became an open-source project in .
Discussion Design Principles
Bibliography . Adams JC et al () The Fortran handbook. Springer, London/New York . () Comput Stand Interfaces . ISO/IEC -:. ISO, Geneva . ISO/IEC -:. ISO, Geneva . Metcalf M, Reid J, Cohen M () Fortran / explained. Oxford University Press, Oxford/New York . Metcalf M, Reid J, Cohen M () Modern Fortran explained. Oxford University Press, Oxford/New York . Numrich RW, Steidel JL () F−− : a simple parallel extension to Fortran . SIAM News ():– . Numrich RW, Reid JK () CoArray Fortran for parallel programming. ACM Fortran Forum : and Rutherford Appleton Laboratory report RAL-TR-–. ftp://ftp.numerical.rl. ac.uk/pub/reports/nrRAL.pdf
Fortress provides the programmer with a single global address space, an automatically managed heap, and automatically managed thread parallelism. Programs can also spawn threads explicitly. The three major design principles for Fortress are: ●
Mathematical syntax: Fortress syntax is inspired by mathematical tradition, using the full Unicode character set and two-dimensional notation where relevant. The goal is to simplify the desk-checking of code against what working scientists and mathematicians tend to write in their notebooks or on their whiteboards. ● Parallelism by default: Rather than grafting parallel features onto a sequential language design, Fortress
Fortress (Sun HPCS Language)
uses parallelism wherever possible and reasonable. The design intentionally makes sequential idioms a little more verbose than parallel idioms, and intentionally makes side effects a little more verbose than “pure” (side effect free) code. ● A growable language []: As few features as possible are built into the compiler, and as many language design decisions as possible are expressed at the source-code level. In many ways, Fortress is a framework for constructing domain-specific languages, and while the first such language is aimed at scientific programmers, the intent is that radically different design choices can also be supported. Fortress is a strongly typed language. The type system is object oriented, with parameterized polymorphic types but also with type inference so that programmers frequently need not declare the types of bound variables explicitly. The objects enjoy multiple inheritance of code but not of data fields. A Fortress trait is much like an interface in the JavaTM programming language but can contain concrete implementations of methods that can be inherited. A Fortress object can declare data fields and in other respects is also much like a Java class, but a Fortress object can extend (and therefore inherit from) only traits, not other objects. Methods may be overloaded; overload resolution is performed by multimethod dispatch using the dynamic types of all arguments. (In contrast, the Java programming language uses the dynamic type of the invocation target “before the dot” but the static types of the arguments in parentheses.) Fortress supports and favors the use of pure, functional object-oriented programming, but also supports side effects, including assignment to variables and to fields of objects that have been declared mutable. Atomic blocks are used to manage the synchronization of side effects. Fortress also favors divide-and-conquer approaches to parallelism (as opposed to, say, pipelining); generators split aggregate data structures and numeric ranges into portions that can be processed in parallel, and reducers combine or aggregate the results for further processing. Generators and reducers are classified according to their algebraic properties, such as associativity and commutativity; this classification is expressed through,
F
and enforced by, specific traits defined by a standard library, enabling a generator to choose from a variety of implementation strategies.
Salient Features Syntax Fortress syntax is designed to resemble standard mathematical notation as much as possible while integrating programming-language notions of variable binding, assignment, and control structures. All mathematical operator symbols in the Unicode character set [], such as ⊙ ⊕ ⊗ ≤ ≥ ∩ ∪ ⊓ ⊔ ⊆ ⊑ ≺ ≽ ∑ ∏ , may be defined and overloaded as user-defined infix, prefix, and postfix operators; a large number of bracketing symbols such as { } ⟨ ⟩ ⌜ ⌝ ⌞ ⌟ ⌊ ⌋ ⌈ ⌉ ∣ ∣ ∥ ∥ are likewise available for library or user (re-) definition. Relational operators may be chained, as in ≤ n ≤ ; the compiler treats it as if it had been written ( ≤ n) ∧ (n ≤ ) but arranges to evaluate the expression n just once. Simple juxtaposition is itself a binary operator that can be defined and overloaded by Fortress source code; the standard library defines juxtaposition to perform function application (as in f (x, y) or print x ), numerical multiplication (as in n and x y ), and string concatenation (as in “My name is” myName “.” ). Unicode characters may be used directly or represented in ASCII through a Wiki-inspired syntax that is reasonably readable in itself. Fortress also provides concise syntax to mimic mathematical notation: semicolons and parentheses may often be omitted, and curly braces { } denote sets (rather than blocks of statements). The notation is whitespace sensitive but not indentation sensitive. Fortress is expression oriented, and it favors the use of immutable bindings over mutable variables, requiring a slightly more verbose syntax for declaration of mutable variables and for assignment. The left-hand side of Fig. shows a sample Fortress function that performs a matrix/vector “conjugate gradient” calculation (based on the NAS “CG” conjugate gradient benchmark for parallel algorithms) []. The function conjGrad takes a matrix and a vector, performs a number of calculations, and returns two values: a result vector and a measure of the numerical error. The right-hand side of Fig. presents the original
F
F
Fortress (Sun HPCS Language)
conjGrad E extends Number, nat N, V extends VectorE, N, M extends MatrixE, N, N (A: M, x: V): (V, E) = do z: V := r: V := x ρ: E := r⋅r p: V := r for j ← seq( # ) do q=Ap α = ρ/(p⋅q) z += α p ρ = ρ r −= α q ρ := r⋅r β = ρ/ρ p := r + β p end (z, ∥x − A z∥) end
z= r=x ρ = rT r p=r DO i = , q =Ap α = ρ/(pT q) z=z+αp ρ = ρ r=r−αq ρ = rT r β = ρ/ρ p=r+βp ENDDO compute residual norm explicitly: ∥r∥ = ∥x − A z∥
Fortress (Sun HPCS Language). Fig. Sample Fortress code for a conjugate gradient calculation (left) and the original NAS CG pseudocode specification (right)
specification of the conjugate gradient calculation to show how the Fortress program resembles the problem specification. The principal differences are that the Fortress code provides type declarations – including a generic (type-parametric) function header – and distinguishes between binding and assignment of variables (within the loop body, q and α and ρ are readonly variables that are locally bound using = syntax, whereas z and r and ρ and p are mutable variables that are updated by the assignment operator := and the compound assignment operators += and −= ). Fortress is designed to grow over time to accommodate the changing needs of its users via a syntactic abstraction mechanism [] that, in effect, allows sourcecode declarations to augment the language grammar by specifying rewriting rules for new language constructs. Parsing of new constructs is done alongside parsing of primitive constructs, allowing programmers to detect syntax errors in use sites of new constructs early. Programs in domain-specific languages can be embedded in Fortress programs and parsed as part of their host programs. Moreover, the definition of many constructs
that are traditionally defined as core language primitives (such as for loops) can be moved into Fortress’ own libraries, thereby reducing the size of the core language.
Type System Fortress has a rich type system designed to give programmers the expressive power they need to write libraries that provide features often built into other languages. The Fortress type system is a trait-based, object-oriented system that supports parametrically polymorphic named types (also known as generic traits) organized as a multiple-inheritance hierarchy. Fortress code is statically type-checked: it is a static (“compiletime”) error if the type of an expression is not a subtype of the type required by its context. Objects may be declared to extend traits, from which they can inherit method declarations and definitions. If an object inherits an abstract method declaration, then it has the obligation to provide (or inherit) a concrete implementation of that method. A trait may also extend one or more other traits. Extension creates a subtype relationship; if type A extends type B, then every
Fortress (Sun HPCS Language)
F
object belonging to type A (that is, every instance of A) necessarily also belongs to type B. In addition to subtyping, Fortress allows the declaration of exclusion and comprises relationships among types. If type A excludes type B, then no value can be an instance of both A and B. If a type T comprises a set of types { U , U , . . . , Un } , then every instance of T is an instance of some Ui (possibly more than one). This allows a more flexible description of important relationships among types than simply the ability to declare that a class is final.
Tuples A tuple is written as a comma-separated
Types, values, and expressions Fortress provides several
Functions and Methods Functions are also first-class
kinds of types, including trait types, tuple types, arrow types. At the top of the type hierarchy is the type Any, which comprises three types, Object, Tuple, and () (pronounced “void”); at the bottom of the hierarchy is the type BottomType. Every expression in Fortress has a static type. Every value in Fortress is either an object, a tuple, or the value () , and has a (run-time) tag, or ilk, which is a “leaf ” in the type hierarchy: the only strict subtype of a value’s tag is BottomType. Fortress guarantees that when an expression is evaluated, the ilk of the resulting value is a subtype of the type of the expression. Traits Like an interface in the Java programming language, a trait is named program construct that declares a set of methods, and which may extend one or more other traits, inheriting their methods. However, unlike Java interfaces, whose methods are always abstract (i.e., they do not provide bodies), the methods of Fortress traits may be either abstract or concrete (having bodies). Like functions, methods may be overloaded, and a call to such a method is resolved using the run-time tags of its arguments. Objects Objects form the leaves of the trait hierarchy:
no trait may extend an object trait type. Thus object declarations are analogous to final classes in the Java programming language. In addition to methods, objects may have fields, in which state may be stored. Numbers and booleans and characters are all objects. This does not imply that they are always heapallocated. Objects declared with the value keyword have no object identity, must have only immutable fields, and may be freely copied by the implementation.
series of expressions within parentheses, for example, (a, , b + ) ; they are convenient for passing multiple arguments and for returning multiple results. Tuples are first-class values but are not objects. Tuple types are covariant: if type A extends type B, and C extends D, and E extends F, then the tuple type (A, C, E) extends the tuple type (B, D, F) . The type of a variable, of a field, or of array elements may be a tuple type. Tuples of variables may appear on the left-hand side of binding and assignment constructs.
values, and furthermore, they are objects. However, their types are not traits, but arrow types, constructed from other types using the arrow combinator → . A function that is declared to return an instance of type B whenever it is passed an instance of type A as an argument is an instance of the arrow type A → B . Because a function can be overloaded, having multiple definitions with different parameter types, it can be an instance of several different arrow types (Fig. ). Because arrow types are covariant in their return types and contravariant in their parameter types, every arrow type is a subtype of BottomType → Any and a supertype of Any → BottomType . Fortress provides two kinds of methods that may be declared in objects and traits: dotted methods, which are similar to methods in other objectoriented languages and are invoked using the syntax target.methodName(arguments), and functional methods, which are overloadings of global function names and therefore are invoked using the syntax methodName (arguments), but one of the arguments is treated as the target just as for a dotted method (Fig. ). Ignoring the possibility of automatic coercion, a function declaration is applicable to a call to that function if its parameter type is a supertype of (possibly equal to) the run-time tag of the argument, and it is more specific than another declaration if its parameter type is a strict subtype of the parameter type of the other. (If a declaration of a function or method takes more than one argument, then its “parameter type” (singular) is considered to be a tuple type.) The Java programming language allows any set of overloaded method declarations, resolves overloading based on the static types of the arguments and the
F
F
Fortress (Sun HPCS Language)
(* The function pad is an instance of the arrow type String → String and also of the arrow type Z → Z , neither of which is a subtype of the other. *) pad(x: String): String = “ ” x pad(y: Z): Z = y (* Arrow types can include the types of exceptions that can be thrown. The function pad′ in an instance of the arrow type String → String throws EmptyString and also of the arrow type Z → Z throws ZeroNumber . *) pad′ (x: String): String throws EmptyString = if x = “” then throw EmptyString else “ ” x “ ” end pad′ (y: Z): Z throws ZeroNumber = if y = then throw ZeroNumber else y end Fortress (Sun HPCS Language). Fig. Two examples of a function that belongs to more than one arrow type
(* This example object has one field, v, and four method definitions. The method add is a conventional “dotted method”; the method name maximum is overloaded with three functional method definitions. The parameter self indicates that a method is a functional method that is invoked using functional rather than dot syntax. *) object Example(v: Z) add(n: Z) = n + v maximum(self, n: Z) = if n > v then n else v end maximum(n: Z, self) = maximum(self, n) maximum(self, other: Example) = maximum(self, other.v) end
Dotted method (no self parameter) ⊛ Functional method ⊛
⊛
Functional method
⊛
Functional method
do o = Example() p = Example() ⊛ Examples of method invocation println o.add() ⊛ Prints “” println p.add() ⊛ Prints “” println maximum(o, ) ⊛ Prints “” println maximum(, o) ⊛ Prints “” println maximum(o, p) ⊛ Prints “” end Fortress (Sun HPCS Language). Fig. Example definition of a dynamically parameterized object
Fortress (Sun HPCS Language)
dynamic type of the target, and requires this resolution be unambiguous for calls that actually appear in the program (i.e., ambiguity is separately checked at every call site). In contrast, when a call to an overloaded Fortress function or method is executed, the most specific applicable declaration is selected based on the tags of the arguments passed to the function; that is, Fortress performs full multimethod dynamic dispatch on the target and all arguments. Furthermore, Fortress imposes rules on the set of overloaded declarations so that dispatch is always unambiguous (i.e., ambiguity is checked for each set of declarations and no separate check is needed at each call site). In particular, any pair of declarations of overloaded functions of method must satisfy one of three conditions: Subtype rule: The parameter type of one declaration is a strict subtype of the parameter type of the other, and the return type of the first declaration is a subtype of (possibly the same as) the return type of the second declaration. Exclusion rule: The parameter types of the two declarations exclude each other. Meet rule: The parameter types of the two declarations are incomparable (there is neither a subtype nor an exclusion relationship between them) and there is another declaration of that function whose parameter type “covers” the intersection of the parameter types of the two incomparable declarations. If all pairs of declarations satisfy one of these conditions, there can be no ambiguity for any particular call to that function or method: in the first case, one declaration is more specific than the other; in the second case, there is no call to which both declarations are applicable; and in the third case, whenever both declarations are applicable, there is another declaration that is more specific than both that is also applicable, so neither of the two incomparable declarations can be the most specific applicable declaration. These rules are how the design of Fortress solves the problem of multiple inheritance of methods. Access to fields of objects is abstract: the field access syntax x.fieldName actually invokes a getter method that takes no arguments but returns a value. Declaring a field by default provides a trivial getter method that simply fetches the contents of the field, but this method can be suppressed or displaced by
F
an explicit getter declaration. Similarly, the assignment operator := actually invokes a setter method, and the declaration of a mutable field by default provides a trivial setter method. In this manner, the behavior of field access and assignment is completely programmable.
Parametric polymorphism (or Generic types and functions) Fortress supports generic traits (and objects), as
well as generic functions (and methods). All these entities can have static parameters of various kinds: type parameters (for which any type can serve as an actual argument), nat and int and bool parameters (for which static arithmetic and Boolean expressions can serve as actual arguments), unit and dim parameters (for which declared physical dimensions and units can serve as actual arguments), and opr parameters (for which operator symbols can serve as actual arguments). Type parameters may be given one or more bounds, thereby requiring that they be instantiated only with subtypes of all the bounds. If no bound is given, it is implicitly assumed to be bound by Object. Note that Object is not the root of the Fortress type hierarchy; a type parameter that may be instantiated with a tuple type, for example, must be given an explicit bound (Fig. ). This unusual default for the bounds of type parameters helps to reduce confusion when functions and methods are overloaded with different numbers of parameters. The Java programming language has a system of generic types that erases type parameter information at run time. In contrast, Fortress types are not erased, and it is possible to perform the equivalent of an instanceof check to distinguish a list of strings from a list of booleans. Thus, there is no significant disparity between the static type system and the run-time tags associated with Fortress values: tags are simply a special class of types corresponding to leaves in the type hierarchy. In addition to standard parametric polymorphism, Fortress provides where clauses, in which hidden type variables can be declared and used within a trait (or object) or function declaration; they are called “hidden” because they are not static parameters of the trait or function – rather, they are universally quantified within their declared bounds.
F
F
Fortress (Sun HPCS Language)
(* The trait LinkedList can contain elements of type T. Because the declared bound on the type T is Any, the actual element type may be a tuple type. Because of the comprises clause, only instances of the object types ConsT and Empty can be instances of LinkedListT . *) trait LinkedList T extends Any comprises { ConsT, Empty } end (* A Cons instance has two fields, first and rest. This object declaration implicitly declares a constructor function called Cons that takes two arguments and creates a new instance with its two fields initialized to the given arguments. *) object Cons T extends Any (first: T, rest: ListT) extends LinkedListT end The single object Empty belongs to all LinkedList types. object Empty extends LinkedListT where { T extends Any } end ⊛
Fortress (Sun HPCS Language). Fig. Example definitions of a statically parameterized trait and two objects that implement it
Having to write out all static type arguments and complete type signatures for all functions and methods, especially local “helper” functions, can be tedious, and often results in code that is difficult to read and maintain. To mitigate these problems, Fortress (like ML and Haskell) has a type-inference mechanism that allows many types and static parameters to be elided in practice. Coercion A trait T may contain one or more coercion declarations; such a declaration specifies a computation that, when given any instance of another type U, will return a corresponding instance of type T. Such declarations should be used with care, only in situations where it is desirable that every instance of type U also be regarded as, in effect, an instance of type T in every computational context whatsoever. Typically this is used to convert among specialized numerical representations, and in particular to convert numeric literals to other numeric types. (Fortress in unusual in that the type of a literal such as is not any of the standard computational integer types such as Z or Z or Z , but a special numeric-literal trait.) Coercion comes into play during multimethod dispatch: if there is no applicable function or method declaration for a given call, then the net is cast wider by considering declarations that would be applicable if the arguments were automatically coerced to some other type; if there is a unique most specific applicable
declaration in this larger set, then that is the one that will be invoked. To help prevent surprises, coercions in Fortress are not transitive.
Parallelism Many constructs in the Fortress language give rise to multiple implicit tasks to execute subexpressions, such as the components of a tuple, blocks of the do -also construct, function arguments, operands of binary operators, and the target and arguments of a method call (Fig. ). This is fork-join parallelism: all tasks for a construct execute to completion before execution of the construct itself can complete. Tasks are implicitly scheduled and provide no fairness guarantees; while tasks may execute concurrently, a Fortress compiler or run-time may choose to schedule tasks for sequential execution instead. This allows flexible allocation of hardware resources. Comprehensions, reduction expressions, and for loops are all syntactic sugar for uses of generators and reducers. A generator is an object capable of supplying some number of separate values to a reducer through one or more library-defined protocols; a reducer is an object that can accept values from a generator and produce a combined or aggregated result. For example, the nested for loop in Fig. contains two generator expressions : and (i + ) : representing parallel counted ranges; the loop desugars into a nested pair of calls to their respective loop methods.
Fortress (Sun HPCS Language)
F
(* Five examples of situations in which expressions such as f (x) and g(y) and h(z) may be executed in parallel as concurrent tasks. *) (a, b, c) = ( f (x), g(y), h(z)) do f (x) also do g(y) also do h(z) end p( f (x), g(y), h(z)) f (x) + g(y) f (x).m(g(y), h(z))
Elements of a tuple ⊛ Separate blocks of a do -also construct ⊛
F Arguments of a function call ⊛ Operands of a binary operator ⊛ Target and arguments of a method call ⊛
Fortress (Sun HPCS Language). Fig. Implicit parallelism in Fortress
The actual parallelism is provided by using implicit tasks in the implementation of the loop method, generally using a divide-and-conquer coding style. Most Fortress data structures are generators and support structurerespecting parallel traversal. It is a library convention that generators are implicitly parallel by default; the seq(self) method conventionally constructs a sequential version of a generator. Definitions of BIG operators typically define the behavior of reductions (Fig. ) and comprehensions (Fig. ) in terms of an underlying monoid. Note that while generators run computations in parallel, there is still a well-defined spatial order to the results produced by a generator (Fig. ). Fortress also provides explicit parallelism using the spawn construct (Fig. ). Spawned threads are scheduled fairly with respect to one another, and implicit tasks of separately spawned threads are scheduled fairly with respect to one another. The programmer is responsible for wrapping concurrent accesses to shared data in an atomic block. The semantics of an atomic block is that either all the reads and writes to memory happen simultaneously, or none of them do; in effect, all the memory reads and memory writes of an atomic block appear to all other tasks to occur as a single atomic memory transaction. Such transactions are in fact strongly atomic (transactional and non-transactional accesses can coexist).
Implicit parallelism may be exploited inside of a transaction. Inner transactions may be retried independently of the outer transaction (“closed nesting”). Contention may be managed by the run-time system, or at the Fortress language level using the tryAtomic construct. The current implementation of Fortress uses an automatic dynamic load-balancing algorithm (based on work done at MIT as part of the Cilk project) that uses work-stealing to move pending tasks from one processor to another.
Programming in the Large Fortress supports a number of features for programming in the large. Components and APIs Fortress source code is encapsu-
lated in components, which contain definitions of program elements. Each component imports a collection of APIs (Application Programming Interfaces) that declare the program elements used by that component. Each component also exports a collection of APIs. If a component C exports an API A, then C must include a definition for every element declared in A. Components can be linked together into larger components. If a component C exports an API A that is imported by a component D, and C and D are linked, then references in D to the declarations in A are resolved to the corresponding definitions in C.
F
Fortress (Sun HPCS Language)
(* This for loop controls two index variables in nested fashion. The : operator, given two integers, produces a CountedRange object that can serve as a generator of integer index values. *) for i ← : , j ← (i + ) : do ai,j := i − j end (* The for loop is desugared into nested calls to higher-order methods that take functions as arguments. These functions bind the index variables. The fn syntax is analogous to Lisp’s lambda expression syntax for anonymous functions. *) ( : ).loop(fn (i) ⇒ ((i + ) : ).loop(fn ( j) ⇒ do ai,j := i − j end)) (* The loop method of the CountedRange generator object uses a recursive divide-and-conquer strategy that exploits implicit task parallelism (do -also ) to permit concurrent execution of loop iterations. *) object CountedRange(lo: Z, hi: Z) extends GeneratorZ ... loop(body: Z → ()): () = if lo = hi then body(lo) elif lo < hi then lo + hi ⌋ mid = ⌊ do CountedRange(lo, mid).loop(body) also do CountedRange(mid + , hi).loop(body) end end ... seq(self) = . . . end Fortress (Sun HPCS Language). Fig. Implementation of for loops through desugaring
Contracts Fortress functions and methods can be anno-
tated with contracts that declare preconditions and postconditions (Fig. ).
Test code, properties, and automated unit testing Fort-
ress includes support for defining tests inline in programs alongside the data and functionality they test. Similar to tests are property declarations, which document conditions that a program is expected to obey. The parameters in a property declaration denote the types of values over which the property is expected to hold.
When test data is provided with a property, the property is checked against all values in the test data when the program tests are run. One use of the design-by-contract features is to document and enforce algebraic relationships that are necessary for correct parallel execution of generators and reducers (Fig. ).
Innovations ●
Fortress introduced comprises and excludes clauses that allow the type checker to deduce that two types are disjoint; this matters for certain kinds
Fortress (Sun HPCS Language) ⊛
F
Summation is one kind of reduction expression. ai,j ∑
i←: j←(i+):
The summation is desugared into nested calls to higher-order methods. ∑(( : ).nest(fn (i) ⇒ ((i + ) : ).map(fn ( j) ⇒ ai,j ))) ⊛
(* The unary ∑ operator takes a generator and gives its reduce method an object that implements the standard reduction protocol using addition. The reduce method, like the loop method, provides implicit parallelism. This ∑ operator is generic; it can be used with any data type T that implements the Additive trait, meaning that type T implements a binary + operator and can supply an identity value named zero. *) opr ∑[[T extends AdditiveT]](g: GeneratorT): T = g.reduce(SumReductionT) object SumReduction[[T extends AdditiveT]] extends ReductionT, T empty(): T = zero ⊛ AdditiveT means type T has a zero value join(a: T, b: T): T = a + b ⊛ that is the identity for the + operator on type T. singleton(a: T): T = a ⊛ The sum of one number is the number itself. end Fortress (Sun HPCS Language). Fig. Implementation of reduction operations through desugaring
List comprehension ⟨ . . . ∣ . . . ⟩ is one kind of comprehension expression. ⊛ It produces an ordered list of values. ⟨ ai,j ∣ i ← : , j ← (i + ) : ⟩ ⊛
(* Other kinds of comprehension brackets construct arrays, sets, or multisets. They are desugared in much the same manner as reduction expressions. *) (* The meaning of a comprehension is defined by library code in the same way as for a reduction operator.
*)
opr BIG ⟨T g : GeneratorT ⟩ : ListT = g.reduce(ListReductionT) (* List aggregation is essentially reduction of singleton lists using the list concatenation operator ∥ . The library definitions of the ∥ operator and the (noncomprehension) list constructors ⟨ ⟩ and ⟨ x ⟩ are not shown here. *) object ListReductionT extends Reduction[[T, ListT]] empty(): ListT = ⟨ ⟩ ⊛ Produce an empty list. join(a: ListT, b: ListT): ListT = a ∥ b ⊛ Concatenate two lists. singleton(a : T): ListT = ⟨ a ⟩ ⊛ Produce a singleton list. end Fortress (Sun HPCS Language). Fig. Implementation of comprehension expressions through desugaring
F
F
Fortress (Sun HPCS Language)
of code optimization. Such declarations can also improve code maintainability. ● Lexical syntax is inspired in part by Wiki notation: the visual appearance of “rendered” code depends
(* The squares of the nine values , , , , , , , , may be computed concurrently or in any temporal order, but the logical or spatial order of the results within ordered list will always be ⟨ , , , , , , , , ⟩ . *) ⟨ k ∣ k ← : ⟩ Fortress (Sun HPCS Language). Fig. The distinction between logical order and temporal order
on whether an identifier is lowercase, uppercase, or mixed case, and whether underscore characters are used in particular ways. Many single-character operators and brackets can be represented by multicharacter ASCII sequences, but there is no single “escape” character (such as backslash) that introduces such sequences. As a rule, if an operator has a standard name in TEX or LATEX, then that same name converted to uppercase, omitting the backslash, is a Fortress name for that operator. See Fig. . Text in comments uses a markup syntax based on Wiki Creole . []. See Fig. . ● Most programming languages handle operator precedence by imposing a total order on operators (sometimes by mapping them to specific integer values). Fortress uses only precedence rules that would
do The spawn construct forks a fair concurrent thread ⊛ and immediately returns a handle object for the thread. ⊛
firstThread = spawn f (x) secondThread = spawn g(y) Other computations here will run concurrently with the spawned threads. The val method waits for a ⊛ spawned thread to complete and then produces whatever ⊛ value or exeception the thread may have produced. ⊛ ⊛
firstThread.val + secondThread.val end Fortress (Sun HPCS Language). Fig. Explicit spawning of fair concurrent threads
(* This definition of factorial has a contract that requires the argument to be nonnegative. The contract also guarantees that the result will be nonnegative. *) factorial (n) requires { n ≥ } ensures { outcome ≥ } = if n = then else n factorial (n − ) end (* This test-only code (indicated by the keyword test ) calls the factorial function for all values from to , and checks that the result for each is not less than the argument. *) test factorialNotLessThanInput [x ← : ] = (x ≤ factorial (x)) (* This property declaration requires that the result of factorial be not less than the argument for every argument of type Z . Test data for spot-checking a property may be provided separately. *) property factorialNeverLessThanInput = ∀(x: Z) (x ≤ factorial(x)) Fortress (Sun HPCS Language). Fig. Examples of contracts, test code, and declared properties
Fortress (Sun HPCS Language)
F
(* This trait imposes upon any trait T that extends BinaryOperator T, ⊙ the requirement to provide a concrete implementation of the specified binary operator ⊙ . Note that both T and ⊙ are parameter names, for which an actual type and actual operator are substituted by an invocation such as BinaryOperator Q, + or BinaryOperator Z, MAX . *) trait BinaryOperator T, opr ⊙ comprises T opr ⊙(self, other: T): T end An associative operator is a binary operator with the associative property. trait Associative T, opr ⊙ comprises T extends { BinaryOperator T, ⊙ , EquivalenceRelation T, = } property ∀(A; T, b: T, c: T) ((a ⊙ b) ⊙ c) = (a ⊙ (b ⊙ c)) end ⊛
A monoid is a type with an operator that is associative and has an identity. trait Monoid T, opr ⊙ comprises T extends { Associative T, ⊙ , HasIdentity T, ⊙ } end ⊛
Fortress (Sun HPCS Language). Fig. Use of declared properties to enforce algebraic relationships
be familiar from high-school or college mathematics courses. A consequence is that operator precedence in Fortress is not transitive (Fig. ). ● The “meet rule” eliminates the need for arbitrary “tie-break” or “ordering” rules found in other multiple-inheritance type systems. ● Fortress addresses the “self-type problem” common to many type systems with inheritance by using generic traits that each have a static parameter that is a type that must extend the generic trait; the comprises clause is used to enforce this requirement. The parameter name then serves as a self-type adequate to solve the “binary method problem.” ● The Fortress libraries use the self-type idiom to encode and enforce (through use of design-bycontract features) a hierarchy of algebraic properties of data types and operators ranging from commutativity and associativity all the way up to partial and total orders, monoids and groups, and Boolean algebras. Type-checking ensures that, for example, a comparison operator given to a sort method actually implements a correct ordering criterion; overloaded method dispatch allows the sort method to use different algorithms depending on whether the order is partial or total.
●
Fortress supports units and dimensions with a relatively small amount of built-in mechanism. All the Fortress compiler knows is that units and dimensions may be declared (Fig. ), that they form a free Abelian group under multiplication, that these abstract values may be used as static arguments to generic types, and that such a generic type may optionally be declared to absorb units. That you can multiply apples and oranges, that apples may be added only to apples, and whether you can exclusive-or apples and apples, are all at the discretion of the library programmer. Dimensioned types can ensure that data is correctly scaled (Fig. ). ● The Maybe type (instances are either Just x or Nothing), borrowed from Haskell, can be used as a generator of at most one item. A syntactic convenience allows it to be used in an if statement so as to bind a variable to the item in just the then clause (Fig. ). ● The Java programming language allows a method of the superclass to be called using the super keyword. A language with multiple inheritance requires a more general feature. Using a type assumption expression x asif T as an argument causes method dispatch to use the specified static
F
F
Fortress (Sun HPCS Language)
ASCII
rendered
z
z
normal variable (italic)
z_
z
roman
_z
z
bold
_z_
z
bold italic
ZZ
Z
blackboard calligraphic
ZZ_ _ZZ
Z
sans serif
z_bar
z¯
math accent
z_hat
zˆ
math accent
z_dot
z˙
math accent
z’
z′
math accent
_z_vec
⃗z
math accent
z_max
zmax
z13
z
numeric subscript
_z13_bar’
z¯′
combination of features
and
and
normal variable (italic)
And
And
type name (roman)
AND
∧
operator name
OPLUS
⊕
operator name as in TEX
SQCAP
⊓
operator name as in TEX
<=
≤
less than or equal to
>=
≥
greater than or equal to
<-
←
left arrow
[\ \]
white brackets
<| |>
⟨ ⟩
angle brackets
roman subscript
Fortress (Sun HPCS Language). Fig. Examples of rendering ASCII identifiers as mathematical symbols
Fortress (Sun HPCS Language)
F
This ASCII Fortress source code includes Wiki markup in the comment: (* Compute the logical exclusive OR of two Boolean values. =Arguments ; ‘self‘ :: a first Boolean value ; ‘other‘ :: a second Boolean value =Returns > The exclusive OR of the two arguments. This is ‘true‘ if **either** argument is ‘true‘, but not if **both** arguments are ‘true‘. > | ||||| ‘other‘ ||| | |||||-------------------|| | ‘OPLUS‘ ||||| false | true || |=========================================|| | ‘self‘ || false ||| false | true || | || true ||| true | false || |-----------------------------------------|| *) opr OPLUS(self, other: Boolean): Boolean and it is rendered as follows: (* Compute the logical exclusive OR of two Boolean values. Arguments self
a first Boolean value
other
a second Boolean value
Returns The exclusive OR of the two arguments. This is true if either argument is true, but not if both arguments are true. other ⊕
false true
self false
false true
true
true false
*) opr ⊕(self, other: Boolean): Boolean Fortress (Sun HPCS Language). Fig. Example of rendering Wiki-style markup in Fortress comments
type T for the argument expresssion x during method lookup. This general mechanism solves the multiple supertype problem as a special case (Fig. ).
●
Fortress provides an object-oriented implicit coercion feature. It differs from a conventional method such as toString in that it is declared in the type converted to rather the type converted from, and
F
F
Fortress (Sun HPCS Language)
a+b>c+d p>q∧r >s w+x∧y+z (w + x) ∧ (y + z) w + (x ∧ y) + z
Correct: + has higher precedence than > ⊛ Correct: > has higher precedence than ∧ ⊛ Incorrect: no defined precedence between + and ∧ ⊛ Correct: parentheses specify desired grouping ⊛ Correct: parentheses specify desired grouping ⊛
Fortress (Sun HPCS Language). Fig. Operator precedence in Fortress is not transitive
Declarations of several physical dimensions and their default units dim Mass default kilogram dim Length default meter dim Time default second dim ElectricCurrent default ampere ⊛
Declarations of derived physical dimensions dim Velocity = Length/Time dim Acceleration = Length/Time dim Force = Mass⋅Acceleration dim Energy = Force⋅Length ⊛
Declarations of basic units and their abbreviations unit kilogram kg: Mass unit meter m: Length unit second s: Time unit ampere A: ElectricCurrent ⊛
Declarations of derived units unit newton n: Force = meter⋅kilogram/second unit joule J: Energy = newton meter ⊛
Declarations of derived units that require scaling factors unit gram g: Mass = − kilogram unit kilometer km: Length = meter unit centimeter cm: Length = − meter unit nanosecond ns: Time = − second ⊛
unit inch inches in: Length = . cm unit foot feet ft: Length = inches
Fortress (Sun HPCS Language). Fig. Example declarations of physical dimensions and units
in that it is invoked implicitly by method dispatch if necessary. Among method overloadings, one that requires no coercion to be applicable is always considered more specific than one that requires one or more coercions. Coercion declarations may also
be marked as widening; automatic rewriting rules on parts of the expression tree that include widening coercions provide the effect of “widest need evaluation” [], but in a manner that is fully under user (or library) control.
Fortress (Sun HPCS Language)
F
Declarations in terms of dimensions implicitly use default units. m v kineticEnergy(m: R Mass, v: R Velocity): R Energy = ⊛ Alternatively, declarations may specify units explicitly. m v kineticEnergy(m: R kg, v: R m/s): R J = do mySpeed = . feet per second ⊛ The function kineticEnergy requires a speed measured in ⊛ meters per second, so mySpeed by itself is not a valid argument. ⊛ The in operator multiplies by the necessary scale factor, − ⊛ which in this case is × . × . The scale factor ⊛ is determined automatically from the declarations of the units. kineticEnergy( kg, mySpeed in m/sec) end
⊛
Fortress (Sun HPCS Language). Fig. Examples of the use of declared physical dimensions and units
Assume the position method returns a value of type MaybeZ . if p ← myString.position(‘=’) then myString[ : p ] ⊛ Here p is bound to a value of type Z . else myString end ⊛
Fortress (Sun HPCS Language). Fig. Use of a generator expression in an if statement
●
Fortress uses where clauses to bind additional type parameters and to constrain relationships among static type parameters. In particular, the constraint may be a subtype relationship. This general mechanism suffices to express covariance and contravariance (Fig. ), which are important type relationships that require more specialized mechanisms in other languages. ● MPI allows a program to inquire how many hardware processors there are, but little else about the execution environment. High Performance Fortran provides alignment and distribution directives that describe how arrays should be laid out across multiple processors, but the processors must be organized as a flat Cartesian grid, and the class of distributions (such as block, cyclic, and blockcyclic) is fixed.
Fortress defines a hierarchical data structure (the region) that can describe hardware resources (both processors and memory) and relative communication costs, and can be queried by a running program. A Fortress distribution maps the parts of an aggregate data structure to one or more regions. The class of distributions is open ended and library programmers can define new distributions.
Influences from Other Programming Languages Aspects of the Fortress type system, including inheritance and overloaded multimethod dispatch, were influenced by the ML, Haskell, Java, NextGen, Scala, CLOS, Smalltalk, and Cecil programming languages, among others. Inspirations for the use of mathematical syntax and a character set beyond ASCII include
F
F
Fortress (Sun HPCS Language)
trait FloorWax advertise(self) = print “Great shine!” end trait DessertTopping advertise(self) = print “Great taste!” end trait Shimmer extends { FloorWax, DessertTopping } advertise(self) = do ⊛ Type assumption controls which “super-method” to call. advertise(self asif FloorWax) advertise(self asif DessertTopping) end end Fortress (Sun HPCS Language). Fig. Example use of asif expressions
List is covariant in its type parameter T. trait ListT extends ListU where { T extends U } ... end ⊛
OutputStream is contravariant in its type parameter E. trait OutputStreamE extends OutputStreamG where { G extends E } ... end ⊛
Fortress (Sun HPCS Language). Fig. Use of where clauses to express covariance and contravariance of type parameters
APL, MADCAP, MIRFAC, the Klerer-May System, and COLASL. The idea of having “publication” (beautifully rendered) textual form as well as one or more “implementation” forms goes back to Algol . Notions of data encapsulation, comprehensions, generators, and reducers can be traced to CLU, Alphard, KRC, Miranda, and Haskell. In the area of functional programming, higher-order functions, and divide-andconquer parallelism, many languages had an influence, but especially Lisp, Haskell, and APL. The approach to data distribution was strongly influenced by experience with High Performance Fortran and its predecessors.
Future Directions The current Fortress compiler targets the Java Virtual Machine (JVM) and uses a custom class loader to instantiate generic types as they are needed. The JVM provides good multiplatform native code generation, threads, and garbage collection, and hosts libraries implementing both work-stealing and transactions. The Fortress type system is embedded into the Java type system where it is convenient, but tuple types and arrow types are handled specially. Fortress value types are eligible for compilation into an unboxed (primitive) representation. Fortress traits compile to Java interfaces, and Fortress objects compile to Java final classes.
Fujitsu Vector Computers
A straightforward analysis of traits comprising object singletons can demonstrate finiteness, which permits the general application of unboxing to new types. A specialized virtual machine (VM) and run-time system may someday be desirable, but at this time the benefits of targeting the JVM far outweigh the overhead. The design of regions and distributions for managing the mapping of data and threads to hardware resources has been worked out on paper, but has not yet been implemented. The original design of Fortress includes a form of where clause that conditionally gates such type relationships as extension, comprising, and exclusion. For example, one might define an generic aggregate data structure AE with a multiplication operator that is defined in terms of multiplication for the element type E, and one might well wish to declare that the multiplication of aggregates is commutative if and only if multiplication of the elements is commutative. This may be achieved by declaring trait AE extends { CommutativeAE, × where { E extends CommutativeE, × } }
.
.
.
.
.
. .
The implications of such conditional type relationships for the implementation of a practical type checker are still a subject of research. An object-oriented pattern-matching facility for easily binding multiple variables to the fields of an object, similar to the pattern-matching facilities in Haskell and Scala but with more provisions for user extensibility, has been sketched out but only recently implemented.
F
Version .. http://research.sun.com/projects/plrg/fortress.pdf, March . See also http://research.sun.com/projects/plrg/ Publications/fortress.beta.pdf, March Allen E, Culpepper R, Nielsen JD, Rafkind J, Ryu S () Growing a syntax. In: ACM SIGPLAN Foundations of Object-Oriented Languages workshop, Savannah, Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Fatoohi RA, Frederickson PO, Lasinski TA, Simon HD, Venkatakrishnan V, Weeratunga SK () The NAS parallel benchmarks. Technical report, Int J Supercomput Appl ():– Corbett RP () Enhanced arithmetic for Fortran. ACM SIGPLAN Notices ():–, http://doi.acm.org/./. High Productivity Computer Systems program of the Defense Advanced Research Projects Agency (United States Department of Defense). http://www.highproductivity.org/ and see also //www. darpa.mil/IPTO/programs/hpcs/hpcs.asp Steele Jr GL () Growing a Language. Invited talk. Abstract in OOPSLA ’ Addendum: Addendum to the Proceedings of the Conference on Object-Oriented Programming, Systems, Languages, and Applications. ISBN ---. ACM, New York, http://doi.acm.org/./. Transcript in Higher-Order and Symbolic Computation , (October ), –. Video at http://video.google.com/videoplay? docid=-# The Unicode Consortium () The Unicode Standard, Version .. Addison-Wesley, Boston Wiki Creole project and website. Sponsored by the Wiki Symposium. http://www.wikicreole.org/wiki/Creole.
Forwarding Routing (Including Deadlock Avoidance)
Related Entries Chapel (Cray Inc. HPCS Language) HPF (High Performance Fortran) PGAS (Partitioned Global Address Space) Languages
Bibliography . Allen E, Chase D, Flood C, Luchangco V, Maessen JW, Ryu S, Steele Jr GL () Project Fortress Community website. http:// projectfortress.java.net . Allen E, Chase D, Hallett J, Luchangco V, Maessen JW, Ryu S, Steele Jr GL, Hochstadt ST () The Fortress Language Specification
Fujitsu Vector Computers Kenichi Miura National Institute of Informatics, Tokyo, Japan
Synonyms Fujitsu vector processors; Fujitsu VPP systems; SIMD (single instruction, multiple data) machines
F
F
Fujitsu Vector Computers
Definition Fujitsu Vector Machines are the series of supercomputers developed by Fujitsu from to . They are based on the pipeline architecture, and VPP systems have also incorporated a high degree of parallelism with distributed memory in the system architecture, utilizing the full crossbar network for non-blocking system interconnect.
Discussion FACOM – Array Processor Unit (APU) The first vector machine in Japan was FACOM – Array Processing Unit (APU) which was installed at the National Aerospace Laboratory (later became the Japanese Aerospace Exploration Agency (JAXA) in October ) in []. This system was organized as an asymmetric multiprocessor system which consists of the mainframe computer FACOM – and the Array Processor Unit. The clock was ns, and the peak performance of APU is Mflop/s for addition/subtraction and Mflop/s for multiplication. This system was a pipeline-based array processor with data registers (scalar) and one vector register with , word ( bits/word) for vector operations. The capacity of the main memory was M words with -way interleave. AP-Fortran, an extended Fortran language, was developed for the – APU to allow parallelism description.
FACOM VP/ Series The vector processor FACOM VP and FACOM VP [–] were announced in July . The major goals of the VP/ Series were to achieve a very high performance, ease-of-use, and a good affinity with general-purpose computing environment with the FACOM M-Series mainframe computers. The state-ofthe-art semiconductor technology was incorporated in the system; the ECL LSI with a gate delay time of ps (Fig. ) and and , gates per chip, and high-speed RAM with an access time of . ns for vector registers. As for the memory technology, the main memory unit with high capacity and high data transfer capability was realized by using the Kbit SRAM with access time of ns. Figure illustrates the architecture of VP/ Series system. It consists of the scalar unit, vector unit,
Fujitsu Vector Computers. Fig. VP LSI packaging technology. (Photo Credit: Fujitsu Limited)
and main storage unit. The scalar unit fetches and decodes all the instructions and executes the scalar instructions. The vector unit consists of six functional pipeline units, vector registers, and mask registers. The functional pipes are add/logical pipe, multiply pipe, divide pipe, mask pipe, and two load/store ports, which are also pipelined. The first three pipes are for arithmetic operations. The load/store pipes support the contiguous access, strided access, and the indirect addressing (gather/scatter) for flexible vector operations. The mask pipes are used to handle the conditional branches within loops. One of the most unique features of the VP/ Series is the dynamically reconfigurable vector registers. The total capacity of the vector registers for VP is K words ( bits), but they can take such configurations as (vector length) × (number of vector registers), × , × , . . . , × . The major features of the VP/ Series are summarized in Table . Figure shows the cabinet of the VP system. The first delivery of the VP-Series system took place in January , that is, VP- at the Institute for Plasma Physics in Nagoya. Later, the high-end model FACOM VP-, with improved performance, was added, and a processing performance exceeding Gflop/s was achieved. Also, the market was expanded by adding lower-end models: the FACOM VP and FACOM VP . The VP-Series E Models, enhanced versions of the VP / Series, were announced in July .
Fujitsu Vector Computers
Vector Unit (570MFLOPS)
F
Mask Registers (1KB) Mask
Add/Logical Load/Store (2.3GB/S)
Main Storage Unit
Multiply Reconfigurable Vector Registers (64KB)
Load/Store (2.3GB/S)
(Max. 256MB)
Scalar Execution Unit
Cache (64KB) Scalar Unit
Channels
Divide
General Purpose Registers (4B x 16) Floating Point Registers (8B x 8)
Fujitsu Vector Computers. Fig. VP architectural block diagram
Fujitsu Vector Computers. Table VP Series specifications Model
VP
VP
VP
VP
Announcement
April,
July,
July,
April,
Machine cycle
. ns
. ns
. ns
. ns
Max. performance
Mflop/s
Mflop/s
Mflop/s
, Mflop/s
Total vector regs. size
KB
KB
KB
KB
Basic vector length
elements
elements
elements
elements
Total Mask regs.
kbits
kbits
kbits
kbits
Add/logical
×
×
×
×
Multiply
×
×
×
×
Divide
×
×
×
×
Mask
×
×
×
×
Load/Store
×
×
×
×
Ckt. technology (ECL) & delay
gates/chip ps
gates/chip ps
gates/chip ps
gates/chip ps
Memory technology
kbits SRAM ns access
kbits SRAM ns access
kbits SRAM ns access
kbits SRAM ns access
Main memory size
, , MB
, , MB
, , MB
, , MB
Logical no. times multiplicity of vector pipes
F
F
Fujitsu Vector Computers
Fujitsu Vector Computers. Fig. VP system. (Photo Credit: Fujitsu Limited)
Vector processor
Mask registers
Mask
Mask
System storage unit
Main storage unit
Load/ store*
Multiplication & addition/logical calculation*
Vector registers
Multiplication & addition/logical calculation*
Load/ store*
Division
Scalar execution unit
Buffer storage Channel processor
Vector unit
Scalar unit **
General registers Floating registers * Model VP2000 provides separate pipelines for loading/storing and addition/subtraction/ logic calculations. **Uni-processor models have one scalar unit.
Fujitsu Vector Computers. Fig. VP architectural block diagram
Fujitsu Vector Computers
F
Fujitsu Vector Computers. Table VP Series specifications Models
VP/ VP/
VP/ VP/
VP/ VP/
VP/ VP/
Announcement
Dec.
Dec.
Dec.
Dec.
Max. performance
Mflop/s
Gflop/s
Gflop/s
Gflop/s
No. scalar unit/
–
–
–
–
vector unit
Total capacity of vector registers
KB
KB
KB
KB
Total capacity of mask registers
kbits
kbits
kbits
kbits
No. vector pipes
& multiplicity
Ckt. technology
K gates
K gates
K gates
K gates
(ECL) & delay/gate
ps
ps
ps
ps
Ckt technology
kbit SRAM
kbit SRAM
kbit SRAM
kbit SRAM
(Capacity & delay)
. ns
. ns
. ns
. ns
Capacity of main memory
MB– GB
MB– GB
MB- GB
MB– GB
Memory technology
Mbit SRAM/
Mbit SRAM/
Mbit SRAM/
Mbit SRAM/
/delay
ns
ns
ns
ns
Capacity of system storage
,,,, GB
,,,, GB
,,,, GB
,,,, GB
I/O capabilities
Max. GB/s
Max. GB/s
Max. GB/s
Max. GB/s
No. channels
–
–
–
–
Block Mpx Ch./
. MB/s
. MB/s
. MB/s
. MB/s
Optical Ch.
. MB/s
MB/s
. MB/s
. MB/s
There were five models : VP E, VP E, VP E,VP E, and VP E. These E models achieved performance which is . times more powerful than that of the previous VP / Series machines by installing high-speed vector memory and an additional arithmetic pipeline for the fused multiply-add operations.
sharing one vector unit. This feature, called Dual Scalar Processors (DSP), may be regarded as two-threaded system. The objective of DSP is to utilize the vector unit more efficiently when the vectorization ratio of users’ application programs is not very high []. The key features of VP Series system are summarized in Table .
FUJITSU VP Series The VP Series [–] was announced in December as the successor to Fujitsu’s VP-E Series supercomputers. Figure illustrates the architecture of VP Series systems. It carries most of the VP/ architectural features with various enhancements in the vector unit, such as two sets of fused multiply-add pipelines, doubled mask pipelines, and faster division circuit. The most notable feature of the VP Series is that it can be configured with one or two scalar units
Numerical Wind Tunnel and VPP-Series Numerical Wind Tunnel The National Aerospace Laboratory (NAL) of Japan developed the Numerical Wind Tunnel (NWT) [] parallel supercomputer system jointly with Fujitsu. The system architecture was a distributed memory vectorparallel computer, in which vector supercomputers
F
F
Fujitsu Vector Computers
Fujitsu Vector Computers. Fig. NWT system. (Photo Credit: JAXA)
were connected with the non-blocking crossbar network. The NWT’s final specifications were processing elements, each with . Gflop/s of processing power to give a peak performance of Gflop/s with . GB of the total main memory. The NWT’s processing element consisted of a CPU and memory. The CPU is a vector supercomputer by itself, comprising three types of LSIs: GaAs, ECL, and BiCMOS, in total. It is one of the very few commercially successful GaAs-based system. The primary cooling method was water cooling, but the forced air-cooling method was also used for the memory units. Figure shows the NWT as installed at NAL. The NWT, which went into operation in January , was rated as the most powerful supercomputer site in the world in the TOP list from to . It should be noted that the NWT was the technology precursor to the VPP, which will be described in more detail in the next section.
FUJITSU VPP The VPP was Fujitsu’s third generation supercomputer system [, –]. As stated in the previous section, it was a commercial version of NWT. Fujitsu announced the VPP in September as the world’s most powerful supercomputer at that point in
time with a speed of Gflop/s. Fujitsu has developed processor elements (PEs), each being capable of vector processing and also can operate in highly parallel fashion to obtain this performance. The maximum configuration was with processing elements to reach Gflop/s. As shown in Fig. , each processing element consisted of a scalar unit, a vector unit, a main storage unit, and a Data Transfer Unit (DTU), built with Gallium Arsenide LSI’s, Bipolar LSI’s and BiCMOS LSI’s, and ultrahigh-density circuit boards, and was able to deliver a maximum vector performance of . Gflop/s. The scalar unit is VLIW architecture and may execute up to four operations in parallel. In order to achieve very high system performance, PEs were connected with a highspeed crossbar network which allows non-blocking data transfer. The data transfer rate was MB/s from one PE to another. Since a PE can simultaneously send and receive data, the aggregate data transfer rate per PE was MB/s. Figure illustrates the system configuration of VPP, based around the crossbar network. In order to fully utilize the powerful capability of the interconnecting network, the Data Transfer Unit (DTU) was incorporated, which allows the mapping between the local and the global addresses, a very fast synchronization across PEs, and actual data transfer with proprietary message passing protocol. Various
Fujitsu Vector Computers
N E T W O R K
F
Mask Registers Mask
Data Transfer Unit
Vector Unit
Multiply
(DTU) Load
Add/Logical
Vectors Registers
Store Main Storage Unit
Divide
F Scalar Unit
Scalar Execution Unit
Cache General Purpose Registers Floating Point Registers
Fujitsu Vector Computers. Fig. VPP architectural block diagram
VU D T U
M E M
SU
PE1
PE(n-1)
PEn
VU
VU D T U
M E M
SU
PE0
...
D T U
M E M
SU
VU D T U
M E M
IOP
SU IOP
Crossbar Interconnect
Fujitsu Vector Computers. Fig. VPP system configuration
data transfer modes were supported by the DTU. They were: contiguous, constant-strided, sub-array, and indirect addressing (random gather/scatter). Figure shows the inside of one cabinet of VPP, which contains PEs. Plumbing for water cooling is also visible in this photo. The specifications of the VPP system and the follow-on products are summarized in Table .
the distributed memory units of VPP. The ability to define the array data in Fortran programs, distributed across multiple processing elements as a single global space, greatly simplified the porting and tuning of existing codes to VPP VPP Fortran may be regarded as an early implementation of PGAS (Partitioned Global Address Space). More details are described in [].
New Parallel Language Fujitsu developed a new parallel language, called VPP Fortran, which can define a single name space across
VPP/VPP/VPP Series After a great success with VPP, Fujitsu developed the CMOS versions of the vector-parallel system The
F
Fujitsu Vector Computers
oxide semiconductor) technology, were introduced in February , the Fujitsu VPP Series in March , and the VPP Series in April . All VPP Series supercomputers after the VPP Series used the CMOS technology. Packaging technology of VPP is shown in Fig. . The VPP was the successor to the former VPP/VPPE systems (with E for extended, i.e., the clock cycle . instead of ns). The clock cycle has been halved, and the vector pipes are able to deliver fused multiply-add results. With a multiplicity of for these vector pipes, floating-point results per clock cycle can be generated. In this way, a fourfold increase in speed per processor can be attained with respect to the VPPE. The architecture of the VPP nodes is almost identical to that of the VPP, but one of the major enhancements is in the performance of noncontiguous memory access by incorporating more address matching circuits.
Fujitsu Vector Computers. Fig. VPP cabinet. (Photo Credit: Fujitsu Limited)
Fujitsu VPP Series and VX Series (the single CPU model), which used CMOS (Complementary metal
Conclusion While the vector processing with very sophisticated hardware pipelines together with highly interleaved memory subsystem was the dominant supercomputer architecture in the s and s, the wave of highly (or massively) parallel scalar systems have been gaining
Fujitsu Vector Computers. Table VPP Series specifications (typical models) Models
VPP
VPPE
VPP
Announcement
September,
February,
April,
Ckt. technology
GaAs/ECL/Bipolar
CMOS(.μ)
CMOS(.μ)
Level of integration
K Tx’s (GaAs)
M Tx’s/chip
M Tx’s/chip
Delay time/gate
ps
ps
ps
Cooling method
Liquid (water)
Forced air
Forced air
Clock frequency
MHz
MHz
MHz
Max. perf./PE
. Gflop/s
. Gflop/s
. Gflop/s
Mem. capacity/PE
GB
GB
– GB
Memory technology
Mbit SRAM
Mbit SDRAM
Mbit SDRAM, Mbit SSRAM
I/O capability/PE
MB ×
MB ×
. GB ×
Max. configuration
PEs
PEs
PEs
Max. system perf.
Gflop/s
. Tflop/s
. Tflop/s
CPU architecture
bit
bit
bit
Fujitsu Vector Computers
F
F
Fujitsu Vector Computers. Fig. VPP PE/Memory board. (Photo Credit: Fujitsu Limited)
more popularities in this century. The major reasons may be stated as; . Pipelining has become the common techniques even in the scalar microprocessor design. . Larger market opportunities have arisen with the advent of high volume produced the powerful scalar microprocessors. . The highly interleaved memory structure in the vector architecture has become a more expensive implementation as compared with the cache-based approach in the scalar architecture, for the purpose of coping with the widening speed gap between CPU and memory. . More and more application programs have been tailored toward scalar architecture. Thus, Fujitsu decided to change direction from vector architecture to highly parallel scalar Symmetric Multiprocessor (SMP) and its constellations. The VPP system was Fujitsu’s last vector product. But it should also be pointed out that the recent trends in SIMD instructions in the general purpose microprocessors and the popularity of GPGPU (General Purpose Graphic Processing Unit) could be regarded as “the return of vectors,” in the sense that the
programming models and compiler technology developed for the vector architecture continue to be vital technologies in the new approaches.
Related Entries PGAS (Partitioned Global Address Space) Languages TOP VLIW Processors
Bibliography . Uchida K, Seta Y, Tanakura Y () The FACOM - Array Processor System. In: Proc rd USA-JAPAN Computer Conference, San Francisco, CA. AFIPS Press, Montvale, NJ, pp – . Miura K, Uchida K () FACOM vector processor VP-/VP. In: Kowalik JS (ed) High-speed computation, NATO ASI Series F: computer and systems sciences, vol . Springer, Berlin, pp – . Miura K () Fujitsu’s supercomputer: FACOM vector processor system. In: Fernbach S (ed) Supercomputers: class VI systems, hardware and software. North-Holland, Amsterdam, pp – . Matsuura T, Miura K, Makino M () Supervector performance without toil: Fortran implemented vector algorithms on the VP/. Comput Phys Commun :–
F
Fujitsu Vector Processors
. Miura K () Vectorization and parallelization of transport Monte Carlo simulation codes. In: NATO ASI Series vol F, pp – . Uchida N et al () Fujitsu VP Series. In: Proceedings of COMPCON Spring’, San Francisco, CA. IEEE Computer Society Press, Washington, DC, pp – . Miura K, Nagakura H, Tamura H () VP Series dual scalar and quadruple scalar models supercomputer systems – A new concept in vector processing. In: Proceedings of COMPCON Spring’, San Francisco, CA. IEEE Computer Society Press, Washington, DC, pp – . Hwang K () Section .. Fujitsu VP and VPP. In: Advanced computer architecture: parallelism, scalability, programmability. McGraw-Hill, Inc., New York, pp – . Miyoshi H et al () Development and achievement of NAL numerical wind tunnel (NWT) for CFD computations. In: Proceedings of ACM/IEEE conference on supercomputing (SC’), Washington, DC. IEEE Computer Society Press (), Washington, DC, pp – . Utsumi T, Ikeda M, Takamura M () Architecture of the VPP parallel supercomputer. In: Proceedings of supercomputing’, Washington, DC. IEEE Computer Society Press, Washington, DC, pp – . Nakashima Y, Kitamura T, Tamura H, Takiuchi M, Miura K () The scalar processor of the VPP parallel supercomputer. In: Proceedings of the th ACM international conference on supercomputing (ICS’), Manchester, England. ACM, New York, pp – . Nakanishi M, Ina H, Miura K () A high performance linear equation solver on the VPP parallel supercomputer. In: Proceedings of ACM/IEEE conference on supercomputing (SC’), Washington, DC. IEEE Computer Society Press, Washington DC, pp – . Iwashita H et al () VPP Fortran and parallel programming on the VPP supercomputer. In: Proceedings of international symposium on parallel architectures, algorithms and networks (ISPAN’), Kanazawa, Japan, – Dec , pp – . Nodomi A, Ikeda M, Takamura M, Miura K () Hardware performance of the VPP parallel supercomputer. In: Dongarra J, Grandinetti L, Joubert G, Kowalik J (eds) High performance computing: technology, method and applications, vol . Advances in Parallel Computing. Elsevier, Amsterdam, pp –
Fujitsu Vector Processors Fujitsu Vector Computers
Fujitsu VPP Systems Fujitsu Vector Computers
Functional Decomposition Domain Decomposition
Functional Languages Philip Trinder , Hans-Wolfgang Loidl , Kevin Hammond Heriot-Watt University, Edinburgh, UK University of St. Andrews, St. Andrews, UK
Definition Parallel functional languages are parallel variants of functional languages, that is, languages that treat computation as the evaluation of mathematical functions and avoid state and mutable data. Functional languages are founded on the lambda calculus. The majority of parallel functional languages add a small number of high-level parallel coordination constructs to some functional language, although some introduce parallelism without changing the language.
Discussion Introduction The potential of functional languages for parallelism has been recognized for over years, for example []. The key advantage of functional languages is that they mainly, or solely, contain stateless computations. Subject to data dependencies between the computations, the evaluation of stateless computations can be arbitrarily reordered or interleaved while preserving the sequential semantics. Parallel programs must not only specify the computation, that is, a correct and efficient algorithm, it must also specify the coordination, for example, how the program is partitioned, or how parts of the program are placed on processors. As functional languages provide high level constructs for specifying computation, for example, higher-order functions and sophisticated type systems, they typically provide correspondingly high level coordination sublanguages. A wide range of parallel paradigms and constructs have been used,
Functional Languages
for example, data-parallelism [], and skeleton-based parallelism []. As with computation, the great advantage of highlevel coordination is that it frees the programmer from specifying low-level coordination details. Where low-level coordination constructs encourage programmers to construct static, simple, or regular coordination, higher-level constructs encourage more dynamic and irregular coordination. The challenges of highlevel coordination are that automatic coordination management complicates the operational semantics, makes the performance of programs opaque, requires a sophisticated language implementation, and is frequently less effective than hand-crafted coordination. Despite these challenges many new nonfunctional parallel programming languages also support highlevel coordination, and examples include Fortress [], Chapel [] or X [] that also support high-level coordination. Coordination may be managed statically by the compiler as in PSML [], dynamically by the runtime system as in GpH [], or by both as in Eden []. Whichever mechanism is chosen, the implementation of sophisticated automatic coordination management is arduous, and there have been many more parallel language designs than well-engineered implementations. The combination of a high-level computation and coordination language requires specialized parallel development tools and methodologies, as discussed below.
Strict and Non-strict Parallel Functional Languages Most programming languages strictly evaluate the arguments to a function before evaluating the function. However, as purely functional languages have significant freedom of evaluation order, some use non-strict or lazy evaluation. Here the arguments to a function are evaluated only if and when they are demanded by the function body. In consequence more programs terminate, it is easy to program with infinite structures, and programming concerns are separated []. Like most languages, many parallel functional languages are strict, for example, NESL [] or Manticore []. However there are many parallel variants of lazy
F
languages like Haskell [] and Clean []. At first glance this is surprising as lazy evaluation is sequential and performs minimum work, but non-strict languages have a number of advantages for parallelism []. They can be easily married to many different coordination languages as the execution order of expressions is immaterial; they naturally support highly dynamic coordination where evaluation is performed and data communicated on demand; their implementations already have many of the mechanisms required for parallel execution. Laziness also exists in the language implementations, for example, distributing work and data on demand [].
Paradigms Parallel functional languages have adopted a range of paradigms. For the main parallel paradigms this section outlines the correspondence with the functional paradigm, and describes a representative language. Some languages support more than one paradigm, for example, Manticore [] combines data and task parallelism, and Eden [] combines semi-explicit and skeleton parallelism.
Skeleton-Based Languages Algorithmic Skeletons [, ] are a popular parallel coordination construct, and as higher-order functions fit naturally in the functional model. Often these higher order functions work over compound data structures like lists or vectors and consequently the resulting parallel code often resembles data parallel code as discussed in section. Example skeleton-based functional languages include SCL [], PL [], Eden [], PSML [], and HDC. HDC [] is a strictly evaluated subset of the Haskell language with skeleton-based coordination. HDC programs are compiled using a set of skeletons for common higher-order functions, like fold and map, and several forms of divide-and-conquer. The language supports two divide-and-conquer skeletons and a parallel map, and the system relies on the use of these higher-order functions to generate parallel code. Data Parallel Languages Data parallel languages [] focus on the efficient implementation of the parallel evaluation of every element
F
F
Functional Languages
in a collection. Functional languages fit well with the data parallel paradigm as they provide powerful operations on collection types, and in particular lists. Indeed, all of the languages discussed here use some parallel extension of list comprehensions and implicitly parallel higher order functions such as map. Compared to other approaches to parallelism, the data parallel approach makes it easier to develop good cost models, although, it is notoriously difficult to develop cost models for languages with a non-strict semantics. Example data parallel functional languages include NESL [] and Data Parallel Haskell []. NESL is a strict, strongly typed, language with implicit parallelism and implicit thread interaction. It has been implemented on a range of parallel architectures, including several vector computers. A wide range of algorithms have been parallelized in NESL, including a Delaunay algorithm for triangularization [], several algorithms for the n-body problem [], and several graph algorithms.
Semi-explicit Languages Semi-explicit parallel languages provide a few highlevel constructs for controlling key coordination aspects, while automatically managing most coordination aspects statically or dynamically. Historically, annotations were commonly used for semi-explicit coordination, but more recent languages provide compositional language constructs. As a result, the distinction between semiexplicit coordination and coordination languages is now rather blurred, but the key difference in the approach is that semi-explicit languages aim for minimal explicit coordination and minimal change to language semantics. Example semi-explicit parallel functional languages include Eden [] and GpH []. GpH is a small extension of Haskell with a parallel (par) composition primitive. par returns its second argument and indicates that the first argument may be executed in parallel. In this model the programmer only has to expose expressions in the program that can usefully be evaluated in parallel. The runtime-system manages the details of the parallel execution such as thread creation, communication, etc. Experience in implementing large programs in GpH shows that the unstructured use of par and seq operators often leads to
obscure programs. This problem is overcome by evaluation strategies: lazy, polymorphic, higher-order functions controlling the evaluation degree and the parallelism of a Haskell expression. They provide a clean separation between coordination and computation. The driving philosophy behind evaluation strategies is that it should be possible to understand the computation specified by a function without considering its coordination.
Coordination Languages Parallel coordination languages [] are separate from the computation language and thereby provide a clean distinction between coordination and computation. Historically, Linda [] and PCN [] have been the most influential coordination languages, and often a coordination language can be combined with many different computation languages, typically Fortran or C. Other systems such as SCL [] and PL [] focus on a skeleton approach for introducing parallelism and employ sophisticated compilation technology to achieve good resource management. Example coordination languages using functional computation languages include Haskell-Linda [] and Caliban [, ]. Caliban has constructs for explicit partitioning of the computation into threads, and for assigning threads to (abstract) processors in a static process network. Communication between processors works on streams, that is, eagerly evaluated lists, similar to Eden.
Dataflow Languages Historically a number of functional languages adopted the dataflow paradigm and some, like SISAL [], achieved very impressive performance. Example dataflow functional languages include SISAL [], pHLuid system []. SISAL [] is a first-order, strict functional language with implicit parallelism and implicit thread interaction. Its implementation is based on a dataflow model and it has been ported to a range of parallel architectures. Comparisons of SISAL code with parallel Fortran code show that its performance is competitive with Fortran, without adding the additional complexity of explicit coordination [].
Functional Languages
Explicit Languages In contrast to the high-level coordination found in the majority of parallel functional languages, a number of languages support low-level explicit coordination. For example, there are several bindings for explicit message passing libraries, such as PVM [] and MPI [] for languages like Haskell and OCaml [, ]. These languages use an open system model of explicit parallelism with explicit thread interaction. Since the coordination language is basically a stateful (imperative) language, stateful code is required at the coordination level. Although the high availability and portability of these systems are appealing, the language models suffer from the rigid separation between the stateful and purely functional levels. Other functional languages support explicit coordination, for example, Erlang []. Erlang is probably the most commercially successful functional language, and was developed in the telecommunications industry for constructing distributed, real-time fault tolerant systems. Erlang is strict, impure, weakly typed, and relatively simple: omitting features such as currying and higher order functions. However the language has a number of extremely useful features, including the OTP libraries, hot loading of new code into running applications, explicit time manipulation to support soft real-time systems, message authentication, and sophisticated fault tolerance.
F
coordination. As a consequence the parallel behaviour of programs is often far from obvious, and hence hard to tune. Hence suites of profiling and visualization tools are very important for many parallel functional languages, for example [, , ]. Implicit parallelism, often promised in the context of functional languages, offers the enticing vision of parallel execution without changes to the program. In reality, however, the program must be designed with parallelism in mind to avoid unnecessary sequentiality. In theory, program analyses such as granularity, sharing, and usage analysis can be used to automatically generate parallelism. In practice, however, almost all current systems rely on some level of programmer control. Current development methodologies, like [], have several interesting features. The combination of languages with high-level coordination and good profiling tools facilitates the prototyping of alternative parallelisations. Obtaining good coordination at an early stage of parallel software development avoids expensive redesigns. In later development stages, detailed control over small but crucial parts of the program may be required, and profiling tools can help locate expensive parallel computations. During performance tuning the high level of abstraction may obscure key lowlevel features that could be usefully controlled by the programmer.
Parallel Functional Languages Today Tools and Methodologies Development methodologies for parallel functional programs exploit the amenability of the languages to derivation and analysis. For example, a programmer can reason about costs during program design using abstract cost models like BMF-PRAM []. Guided by the cost imformation the program design can relatively easily be transformed to reduce resource consumption before implementation, for example, []. Similarly, automated static resource analysis can provide information to improve coordination, for example, predicted task execution time can improve scheduling. Parallel functional languages favour high-level and often dynamic coordination in contrast to detailed static
Parallel functional languages are used in a range of domains including numeric, symbolic, and data intensive []. Commercially Erlang has been enormously successful for developing substantial telecoms and banking applications [] and parallel Haskell for providing standardized parallel computational algebra systems. The high-level coordination and computation constructs in parallel functional languages have been very influential. For example, algorithmic skeletons appear in MPI [] and are the interface for cloud parallel search engines like Google MapReduce. Similarly, stateless threads or comprehensions occur in modern parallel languages like Fortress [], Chapel [], or X []. Moreover the dominance of multicore and emergence
F
F
Functional Languages
of many-core architectures has focused attention on the importance of stateless computation. Some trends are as follows. As clock speeds plateaux but register width increases, data parallelism becomes attractive for functional, and other, parallel languages. In the functional community there are often multiple parallelizations of a single base language, for example, GpH and Eden are both Haskell variants, moreover, a single compiler often supports more than one parallel version, for example, the Glasgow Haskell Compiler has four experimental multicore implementations [].
Related Entries Data Flow Graphs Data Flow Computer Architecture MPI (Message Passing Interface) Parallel Skeletons Profiling
Languages Chapel (Cray inc. HPCS Language) Concurrent ML Eden Fortress (Sun HPCS Language) Glasgow Parallel Haskell Multilisp NESL
Bibliographic Notes and Further Reading A sound introduction to parallel functional language concepts and research can be found in []. A comprehensive survey of parallel and distributed languages based on Haskell is available in []. A recent series of workshops studies declarative aspects of multicore programming, for example [].
Bibliography . Allen E, Chase D, Flood C, Luchangco V, Maessen J-W, Steele S, Ryu GL () Project fortress: a multicore language for multicore processors. Linux Magazine, September . Armstrong J () Programming Erlang: software for a concurrent world. The Pragmatic Bookshelf, Raleigh, NC
. Bacci B, Danelutto M, Orlando S, Pelagatti S, Vanneschi M () P L: a structured high level programming language and its structured support. Concurrency Practice Exp ():– . Berthold J, Loogen R () Visualizing parallel functional program runs – case studies with the eden trace viewer. In: ParCo’: Proceedings of the parallel computing: architectures, algorithms and applications, Jülich, Germany . Blelloch GE () Programming parallel algorithms. Commun ACM ():– . Breitinger S, Loogen R, Priebe S () Parallel programming with Haskell and MPI. In: IFL’ – International Workshop on the implementation of functional languages, University College London, UK, September . Draft proceedings, pp – . Blelloch GE, Miller GL, Talmor D () Developing a practical projection-based parallel Delaunay algorithm. In: Symposium on Computational Geometry. ACM, Philadelphia, PA . Blelloch GE, Narlikar G () A practical comparison of N-body algorithms. In: Parallel Algorithms, volume of Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, University of Tennessee, Knoxville, TN . Cann D () Retire Fortran? A debate rekindled. Commun ACM ():– . Chamberlain BL, Callahan D, Zima HP () Parallel programmability and the chapel language. Int J High Perform Comput Appl ():– . Charles P, Donawa C, Ebcioglu K, Grothoff C, Kielstra A, von Praun C, Saraswat V, Sarkar V () X: An object-oriented approach to non-uniform cluster computing. In: OOPSLA’. ACM Press, New York . Carriero N, Gelernter D () How to write parallel programs: a guide to the perplexed. ACM Comput Surv ():– . Chakravarty MMT (ed) () ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming (DAMP’). ACM Press, New York . Cole M () Algorithmic skeletons. In: Hammond K, Michaelson G (eds) Research directions in parallel functional programming. Springer-Verlag, New York, pp – . Darlington J, Guo Y, To HW () Structured parallel programming: theory meets practice. In Milner R, Wand I (eds) Research Directions in Computer Science. Cambridge University Press, Cambridge, MA . Ellmenreich N, Lengauer C () Costing stepwise refinements of parallel programs. Comp Lang Syst Struct (–):– (special issue on semantics and cost models for high-level parallel programming) . Flanagan C, Nikhil RS () pHluid: the design of a parallel functional language implementation on workstations. In: ICFP’ – International Conference on functional programming, ACM Press, Philadelphia, PA, pp – . Foster I, Olson R, Tuecke S () Productive parallel programming: The PCN approach. J Scientific Program ():– . Fluet M, Rainey M, Reppy J, Shaw A () Implicitly-threaded parallelism in Manticore. In: ICFP’ – International Conference on functional programming, ACM Press, Victoria, BC
Futures
. Glasgow Haskell Compiler. WWW page, . http://www.haskell.org/ghc/ . Herrmann C, Lengauer C () HDC: a higher-order language for divide-and-conquer. Parallel Processing Lett (–): – . Hammond K, Michaelson G () Research directions in parallel functional programming. Springer-Verlag, New York . Hughes J () Why functional programming matters. Computer J ():– . Jones Jr D, Marlow S, Singh S () Parallel performance tuning for haskell. In: Haskell ’: Proceedings of the nd ACM SIGPLAN symposium on Haskell. ACM Press, New York . Kelly PHJ () Functional programming for loosely-coupled multiprocessors. Research monographs in parallel and distributed computing. MIT Press, Cambridge, MA . Kelly P, Taylor F () Coordination languages. In: Hammond K, Michaelson G (eds) Research directions in parallel functional programming. Springer-Verlag, New York, pp – . Sisal Performance Data. WWW page, January . http://www.llnl.gov/sisal/PerformanceData. . Loogen R, Ortega-Mallen Y, Pena R () Parallel functional programming in Eden. J Funct Program ():– . MPI-: Extensions to the message-passing interface. Technical report, University of Tennessee, Knoxville, TN, July . Chakravarty MMT, Leshchinskiy R, Jones SP, Keller G () Partial vectorisation of haskell programs. In DAMP’: ACM SIGPLAN workshop on declarative aspects of multicore programming, San Francisco, CA . Michaelson G, Horiguchi S, Scaife N, Bristow P () A parallel SML compiler based on algorithmic skeletons. J Funct Program ():– . Nöcker EGJMH, Smetsers JEW, van Eekelen MCJD, Plasmeijer MJ () Concurrent clean. In: PARLE’ — parallel architectures and languages Europe, LNCS . Springer-Verlag, Veldhoven, the Netherlands, pp – . O’Donnell J () Data parallelism. In: Hammond K, Michaelson G (eds) Research directions in parallel functional programming. Springer-Verlag, New York, pp – . Jones SLP, Hughes J, Augustsson L, Barton D, Boutel B, Burton W, Fasel J, Hammond K, Hinze R, Hudak P, Johnsson T, Jones M, Launchbury J, Meijer E, Peterson J, Reid A, Runciman C, Wadler P () Haskell : a non-strict, purely functional language. Electronic document available on-line at http://www.haskell.org/, February . Peterson J, Trifonov V, Serjantov A () Parallel functional reactive programming. In: PADL’ – practical aspects of declarative languages, LNCS . Springer-Verlag, New York, pp – . Parallel Virtual Machine Reference Manual. University of Tennessee, August . Rabhi FA, Gorlatch S (eds) () Patterns and skeletons for parallel and distributed computing. Springer-Verlag, New York, . Skillicorn DB, Cai W () A cost calculus for parallel functional programming. J Parallel Distrib Comput ():–
F
. Taylor FS () Parallel functional programming by partitioning. PhD thesis, Department of Computing, Imperial College, London . Trinder PW, Hammond K, Loidl H-W, Jones SLP () Algorithm + Strategy = Parallelism. J Funct Program (): – . Trinder PW, Hammond K, Mattson Jr JS, Partridge AS, Jones SLP () GUM: a portable implementation of Haskell. In: PLDI’ — programming language design and implementation, Philadephia, PA, May . Trinder PW, Loidl H-W, Pointon RF () Parallel and distributed Haskells. JFP, . Special issue on Haskell . Weber M () hMPI — Haskell with MPI. WWW page, July . http://www-i.informatik.rwth-aachen.de/~michaelw/ hmpi.html . Wegner P () Programming languages, information structures and machine organisation. McGraw-Hill, New York
Futures Cormac Flanagan University of California at Santa Cruz, Santa Cruz, CA, USA
Synonyms Eventual values; Promises
Definition The programming language construct “future E” indicates that the evaluation of the expression “E” may proceed in parallel with the evaluation of the rest of the program. It is typically implemented by immediately returning a future object that is a proxy for the eventual value of “E.” Attempts to access the value of an undetermined future object block until that value is determined.
Discussion Introduction The construct “future E” permits the run-time system to evaluate the expression “E” in parallel with the rest of the program. To initiate parallel evaluation, the runtime system forks a parallel task that evaluates “E,” and also creates an object known as a future object that will eventually contain the result of that computation of “E.” The future object is initially undetermined, and becomes
F
F
Futures
determined or resolved once the parallel task evaluates “E” and updates (or resolves/fulfils/binds) the future object to contain the result of that computation. Parallel evaluation arises because the future object is immediately returned as the result of evaluating “future E,” without waiting for the future object to become determined. Thus, the continuation (or enclosing context) of “future E” executes in parallel with the evaluation of “E” by the forked task. The result of performing a primitive operation on a future object depends on whether that operation is considered strict or not. Some program operations, such as assignments, initializing a data structure, passing an argument to a procedure, returning a result from a procedure, are non-strict. These operations can manipulate (determined or undetermined) future objects in an entirely transparent fashion, and do not need to know the underlying value of that future object. In contrast, strict program operations, such as addition, need to know the value of their arguments, and so cannot be directly applied to future objects. Instead, the arguments to these operations must first be converted to a regular (non-future) value, by forcing or touching that argument. Forcing a future object that has been determined extracts its underlying value (which may in turn be a future and so the force operation must operate recursively). Forcing an undetermined future object blocks until that future is determined, and so force operations introduce synchronization between parallel tasks.
Implicit Versus Explicit Futures A central design choice in a language with futures is whether forcing operations should be explicit in the source code, or whether they should be performed implicitly by strict operations, such as addition, that need to know the values of their argument. Explicit futures can be implemented as a library that introduces an additional data type. For example, the polymorphic type Future
introduce additional overheads, since every strict primitive operation must now check whether its operands are futures. As an optimization, a static analysis can be used to translate from a language with implicit futures into one with explicit futures. Essentially, the translation introduces explicit touch operations on those primitive operations that, according to a static analysis, may be applied to futures. In the context of the Gambit compiler, this optimization reduces the overhead of implicit futures from roughly % to less than % []. As an orthogonal optimization, a copying garbage collector could replace a determined future object by its underlying value, saving space and reducing the cost of subsequent touch operations.
Task Scheduling Future annotations are often used to introduce parallel evaluation into recursive, divide-and-conquer style computations such as quicksort. In this situation, there may be exponentially more tasks than processors, and consequently task scheduling has a significant impact on overall performance. A naïve choice of roundrobin scheduling would interleave the executions of all available tasks and result in an essentially breadthfirst exploration of the computation tree, which may requires exponentially more memory than an equivalent sequential execution. Unfair scheduling policies significantly reduce this memory overhead, by better matching the number of concurrently active tasks to the number of processors. In the Multilisp implementation of this technique [], each task has two possible states: active and pending, and each processor maintains a LIFO queue of pending tasks. When evaluating “future E,” the newly forked task evaluating “E” is marked active, while the parent task is moved to the pending queue. In the absence of idle processors, the parent task remains in the pending queue until the task evaluating “E” terminates, resulting in the same evaluation order (and memory footprint) as sequential execution, and the pending task queue functions much like the control stack of a sequential execution. The benefit of the pending task queue is that if another processor becomes idle, it can steal and start evaluating one of these pending tasks. In this manner, the number of active tasks in the system is roughly
Futures
dynamically balanced with the number of available processors. Each processor typically has just one active task. Once that task terminates, it removes and makes active the next task in its pending queue. If the processor’s pending task queue becomes empty, it tries to steal and activate a pending task from one of the other processors in the system. An active task may block by trying to force an undetermined future object. In this situation, the task is typically put on a waiting list associated with that future. Once the future is later determined, all tasks on the waiting list are marked as pending, and will become active once an idle processor is available.
Lazy Task Creation Lazy task creation [, ] provides a mechanism for reducing task creation overhead. Under this mechanism, the evaluation of “future E” does not immediately create a new task. Instead, the expression “E” is evaluated as normal, except that the control stack is marked to identify that the continuation of this future expression is implicitly a pending task. Later, an idle processor can inspect this control stack to find this implicit pending task, create the appropriate future object, and start executing this pending task. The control stack is also modified so that after the evaluation of “E” terminates, its value is used to resolve that future object. In the common case, however, the evaluation of “E” will terminate without the implicit pending task having been stolen by a different processor. In this situation, the result of “E” will be returned directly to the context of “future E” which still resides on the control stack, with little additional overhead and without any need to create a new task or future object.
Futures in Purely Functional Languages Programs in purely functional languages offer numerous opportunities for executing program components in parallel. For example, the evaluation of a function application could spawn a parallel task for the evaluation of each argument expression. Applying such a strategy indiscriminately, however, leads to many parallel tasks whose overhead outweighs any benefits of parallel execution.
F
Futures provide a simple method for taming the implicit parallelism of purely functional programs. A programmer who believes that the parallel evaluation of some expression outweighs the overhead of creating a separate task may annotate the expression with the keyword “future.” In a purely functional language, these future annotations are transparent in that they have no effect on the final values computed by the program, and only influence the amount of parallelism exposed during the program’s computation. Consequently, purely functional programs with future are deterministic, just like their sequential counterparts.
Futures in Mostly Functional Languages Futures have been used effectively in “mostly functional” languages such as Multilisp [] and Scheme [], where most computation is performed in a functional style and assignment statements are limited to situations where they provide increased expressiveness. In this setting, futures are not transparent, as they introduce parallelism that may change the order in which assignments are performed. To preserve determinism, the programmer needs to ensure there are no race conditions between parallel computations, possibly by introducing additional forcing operations that block until a particular task has terminated and all of its side effects have been performed. Mostly functional languages include a side effect– free subset with substantial expressive power, and so the future construct is still effective for exposing parallelism within functional sub-computations written in these languages.
Deterministic Futures in Imperative Languages Futures have been combined with transactional memory techniques to guarantee determinism even in the presence of side effects. Essentially, each task is considered to be a transaction, and these transactions must commit in a fixed, as-if-sequential order. Each transaction tracks its read and write sets. If one transaction accesses data that another transaction is mutating, then the later transaction is rolled back and restarted [].
F
F
Futures
Futures in Distributed Systems Future objects have been used in distributed systems to hide the latency associated with a remote procedure call (RPC). Instead of waiting for a message send and a matching response from a remote machine, an RPC can immediately return a future object as a proxy for the eventual result of that call. Note that this future could be returned even before the first RPC message is sent over the network. Consequently, if one machine performs multiple RPCs to the same remote machine, then these calls can all be batched into a single message; this technique is referred to as promise pipelining.
Thread-Specific Futures, Resolvable Futures, and I-vars The programming language construct “future E” described above does not expose the ability to resolve the future to an arbitrary value; it can only be resolved to the result of evaluating the expression “E.” An I-var is analogous to a future object in that it functions as a proxy for a value that may not yet be determined. The key difference is that an I-var does not have an explicit associated thread that is computing its value (as in the case for future objects). Instead, any thread can update or resolve the I-var. Thus, an I-var is essentially a single entry, one-shot queue. Attempts to access an I-var before it is determined block, as with future objects. Somewhat confusingly, the terms “future” and “promise” are sometimes used to mean an I-var object that also exposes this resolve operation. The terms “thread-specific future” and “resolvable future” help disambiguate these distinct meanings.
Futures and Lazy Evaluation Futures are closely related to lazy evaluation, in that both immediately return a placeholder for the result of a pending computation. A major difference is that future expressions typically must be evaluated before program termination, but lazy expressions need not be evaluated before program termination. This semantic difference means that preemptive evaluation of lazy expressions is considered speculative, whereas evaluation of future expressions is considered mandatory. Sometimes, as in the language Alice ML, the term “lazy future” is used to denote a future whose expression
is evaluated lazily, that is, when the future object is first touched.
Futures and Exceptions In the construct “future E,” an interesting design choice is how to handle any exceptions that are raised during the evaluation of “E,” since enclosing exception handlers from the context of “future E” may be no longer active. A more general version of this question is how the future construct should interact with call/cc (call-withcurrent-continuation). This question has been studied by a number of researchers (see, e.g., Moreau []), with the goal of providing determinism guarantees in the presence of these constructs. Some languages take a more pragmatic choice, whereby if “E” raises an exception, then any attempt to touch the corresponding future object will also raise that exception.
Related Entries Cilk
Bibliographic Notes and Further Reading Friedman and Wise introduced the term “promise” in []. Peter Hibbard described a form of explicit futures, called “eventual values,” in the context of Algol []. Baker and Hewitt [] proposed implicit futures for the parallel evaluation of functional languages. Implicit futures were implemented in the Multilisp version of Scheme in the mid-s [], and futures have since been adapted to a variety of other languages and compilers, including the Gambit Scheme compiler [], Mul-T [], Butterfly Lisp, Portable Standard Lisp, Act , Alice ML, Habanero Java, and many others. Libraries supporting futures have been implemented for several languages, including (to name just a few of many examples) Java, C++x, OCaml, python, Perl, and Ruby. The single-assignment I-var structure originated in Id [] and was included in subsequent languages such as parallel Haskell and Concurrent ML.
Bibliography . Halstead R (Oct ) Multilisp: A language for concurrent symbolic computation. ACM Trans Program Lang Syst ():–
Futures
. Feeley M () An efficient and general implementation of futures on large scale shared-memory multiprocessors. PhD thesis, Department of Computer Science, Brandeis University . Flanagan C, Felleisen M () The semantics of future and its use in program optimization. In: Proceedings of the ACM SIGPLAN-SIGACT symposium on the principles of programming languages. POPL’, pp – . Moreau L () Sound evaluation of parallel functional programs with first-class continuations. PhD thesis, Universite de Liege . Kranz D, Halstead R, Mohr E () Mul-T: A high performance parallel lisp. In: Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation, Portland, pp – . Knueven P, Hibbard P, Leverett B (June ) A language system for a multiprocessor environment. In: Proceedings of the th international conference on design and implementation of algorithmic languages, Courant Institute of mathematical Studies, New York, pp – . Friedman DP, Wise DS () The impact of applicative programming on multiprocessing. International conference on parallel processing, pp –
F
. Welc A, Jagannathan S, Hosking A () Safe futures for Java. In: Proceedings of the th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA ), ACM, pp – . Baker H, Hewitt C (Aug ) The incremental garbage collection of processes. In: Proceedings of symposium on AI and programming languages, ACM SIGPLAN Notices (): – . Moreau L () The semantics of scheme with future. In: Proceedings of the first ACM SIGPLAN international conference on functional programming, pp – . Mohr E, Kranz D, Halstead R () Lazy task creation: a technique for increasing the granularity of parallel programs. In: Proceedings of the ACM conference on LISP and functional programming, pp – . Arvind, Nikhil RS, Pingali KK () I-Structures: Data structures for parallel computing. ACM Trans Program Lang Syst ():–
F
G GA Global Arrays Parallel Programming Toolkit
Gather Collective Communication
Gather-to-All Allgather
Gaussian Elimination Dense Linear System Solvers Sparse Direct Methods
GCD Test Banerjee’s Dependence Test Dependences
Gene Networks Reconstruction Systems Biology, Network Inference in
Generalized Meshes and Tori Hypercubes and Meshes
Genome Assembly Ananth Kalyanaraman Washington State University, Pullman, WA, USA
Synonyms Genome sequencing
Definition Genome assembly is the computational process of deciphering the sequence composition of the genetic material (DNA) within the cell of an organism, using numerous short sequences called reads derived from different portions of the target DNA as input. The term genome is a collective reference to all the DNA molecules in the cell of an organism. Sequencing generally refers to the experimental (wetlab) process of determining the sequence composition of biomolecules such as DNA, RNA, and protein. In the context of genome assembly, however, the term is more commonly used to refer to the experimental (wetlab) process of generating reads from the set of chromosomes that constitutes the genome of an organism. Genome assembly is the computational step that follows sequencing with the objective of reconstructing the genome from its reads.
Discussion Introduction
Gene Networks Reverse-Engineering Systems Biology, Network Inference in
Deoxyribonucleic acid (or DNA) is a double-stranded molecule which forms the genetic basis in most known organisms. The DNA along with other molecules, such as the ribonucleic acid (or RNA) and proteins, collectively constitute the subject of study in various
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
G
Genome Assembly
branches of biology such as molecular biology, genetics, genomics, proteomics, and systems biology. A DNA molecule is made up of two equal length strands with opposite directionality, with the ends labeled from ′ to ′ on one and ′ to ′ on the other. Each strand is a sequence of smaller molecules called nucleotides, and each nucleotide contains one of the four possible nitrogenous bases – adenine (a), cytosine (c), guanine (g), and thymine (t). Therefore for computational purposes, each strand can be represented in the form of a string over the alphabet {a, c, g, t}, expressed always in the direction from ′ to ′ . Furthermore, the base at a given position in one strand is related to the base at the corresponding position in the other strand by the following base-pairing rule (referred to as “base complementarity”): a ↔ t, c ↔ g. Therefore, the sequence of one strand can be directly deduced from the sequence of the other. For example, if one strand is ′ agaccagttac ′ , then the other is ′ gtaactggtct ′ . The length of a genome is measured in the number of its base pairs (“bp”). Genomes range in length from being just a few million base pairs in microbes to several billions of base pairs in many eukaryotic organisms. However, all sequencing technologies available till date, since the invention of Sanger sequencing in late s, have been limited to accurately sequencing DNA molecules no longer than ∼ Kbp. Consequently, scientists have had to deploy alternative sequencing strategies for extending the reach of technology to genome scale. The most popular strategy is the whole genome shotgun (or WGS) strategy, where multiple copies of a single long target genome are shredded randomly into numerous fragments of sequenceable length and the corresponding reads sequenced individually using any standard technology. Another popular albeit more expensive strategy is the hierarchical approach, where a library of smaller molecules called Bacterial Artificial Chromosomes (or BACs) is constructed. Each BAC is ∼ Kbp in length and they collectively provide a minimum tiling path over the entire length of the genome. Subsequently, the BACs are sequenced individually using the shotgun approach. In both approaches, the information of the relative ordering among the sequenced reads is lost during sequencing either completely or almost completely. Therefore, the primary information that genome assemblers should rely upon is the end-to-end sequence
overlap preserved between reads that were sequenced from overlapping regions along the target genome. To increase the chance of overlap, the target genome is typically sequenced in a redundant fashion. This is referred to as genome coverage. A higher coverage typically tends to provide information for a more accurate assembly, although at increased costs of generation and analysis. The fact that reads could have originated from an arbitrary strand adds another dimension to the complexity of the reconstruction process. The process is further complicated by other factors such as errors introduced during read sequencing and the presence of genomic repeats that could ambiguate overlap detection and read placement. To partially aid the resolving of the genomic repeat regions, sequencing is sometimes performed in pairs from clonal insert libraries, where the genomic distance between the two reads of a pair can be estimated, typically in the – Kbp range. This technique is called pair-end sequencing and genome assemblers could take advantage of this information to resolve repeats that are smaller than the specified range. Genome assembly is a classical computational problem in the field of bioinformatics and computational biology. More than two decades of research has yielded a number of approaches and algorithms and yet the problem continues to be actively pursued. This is because of several reasons. While there are different ways of formulating the genome assembly problem, all known formulations are NP-Hard. Therefore, researchers continue to work on developing improved, efficient approximation and heuristic algorithms. However, even the best serial algorithms take tens of thousands of CPU hours for eukaryotic genomes owing to their: large genomic sizes, which increase the number of reads to be analyzed; and high genomic repeat complexity, which adds substantial processing overheads. For instance, in the Celera’s WGS version of the human genome assembly published in , over million reads representing .x coverage over the genome were assembled in about , CPU h. A paradigm shift in sequencing technologies has further aggravated the data-intensiveness of the problem and thereby the need for continued algorithmic development. Until the mid-s, the only available technology for use in genome assembly projects was Sanger sequencing, which produces reads of approximate length Kbp each. Since then, a slew of
Genome Assembly
high-throughput sequencing technologies, collectively referred to as the next-generation sequencing technologies, have emerged, significantly revitalizing the sequencing community. Examples of next-generation technologies include Roche Life Sciences system’s pyrosequencing (read length ∼ bp), Illumina Genome Analyzer and HiSeq (read length ∼– bp); Life Technologies SOLiD (read length ∼ bp) and Helicos HeliScope (read length ∼ bp). While these instruments generate much shorter reads than Sanger, they do so at a much faster rate effectively producing several hundred millions of reads in a single experiment, and at significantly reduced costs (about – times cheaper). These attractive features are essentially democratizing the sequencing process and broadening community contribution to sequenced data. From a genome assembly perspective, a shorter read length could easily deteriorate the assembly quality because the reads are more likely to exhibit false or insufficient overlaps. To offset this shortcoming, sequencing is required at a much higher coverage (x–x) than for Sanger sequencing. The higher coverage has another desirable effect in that, because of its built-in redundancy, it could aid in a more reliable identification of real genomic variations and their differentiation from experimental artifacts. Detecting genomic variations such as single nucleotide polymorphisms (SNPs) is of prime importance in comparative and population genomics. This combination of high coverage and short read lengths makes the short read genome assembly problem significantly more data-intensive than for Sanger reads. For instance, any project aiming to reconstruct a mammalian genome (e.g., human genome) de novo using any of the current next-generation technologies would have to contend with finding a way to assemble several billion short reads. This changing landscape in technology and application has led to the development of a new generation of short read assemblers. Although traditional developmental efforts have been targeting serial computers, the increasing data-intensiveness and the increasing complexity of genomes being sequenced have gradually pushed the community toward parallel processing, for what has now become an active branch within the area. In what follows, an overview of the assembly problem, with its different formulations and algorithmic solutions is presented. This entry is not intended to be
G
a survey of tools for genome assembly. Rather, it will focus on the parallelism in the problem and related efforts in parallel algorithm development. In order to set the stage, the key ideas from the corresponding serial approach will also be presented as necessary. The bulk of the discussion will be on de novo assembly, which is typically the harder task of reconstructing a genome from its reads assuming no prior knowledge about the sequence of the genome except for its reads and as available, their pair-end sequencing information.
Algorithmic Formulations of Genome Assembly The genome assembly problem: Let R = {r , r . . . rm } denote a set of m reads sequenced from an unknown target genome G of length g. The problem of genome assembly is to reconstruct the sequence of genome G from R. As a caveat, for nearly all input scenarios, the outcome expected is not a single assembled sequence but a set of assembled sequences for a couple of reasons. Typically, the genome of a species comprises of multiple chromosomes (e.g., pairs in the human genome) and therefore each chromosome can be treated as an individual sequence target. Note that, however, the information about the source chromosome for a read is lost during a sequencing process such as WGS and it is left for the assembler to detect and sort these reads by chromosomes. Furthermore, any shotgun sequencing procedure tends to leave out “gaps” along the chromosomal DNA during sampling, and therefore it is possible to reconstruct the genome only for those sampled sections. It is expected that, through incorporation of pair-end information, at least a subset of these assembled products (called “contigs” in genome assembly parlance) can be partially ordered and oriented. Notation and terminology: Let s denote a sequence over a fixed alphabet Σ. Unless otherwise specified, a DNA alphabet is assumed – i.e., Σ = {a, c, g, t}. Let ∣s∣ denote the length of s; and s[i . . . j] denote the substring of s starting and ending at indices i and j respectively, with the convention that string indexing starts at . A prefix i (alternatively, suffix i) of s is s[ . . . i] (alternatively, n s[i . . . ∣s∣]). Let n = Σ m i= ∣ri ∣ and ℓ = m . The sequencing g coverage c is given by n . The terms string and sequence are used interchangeably. Let p denote the number of processors in a parallel computer.
G
G
Genome Assembly
The most simplistic formulation of genome assembly is that of the Shortest Superstring Problem (SSP). A superstring is a string that contains each input read as a substring. The SSP formulation is NP-complete. Furthermore, its assumptions are not realistic in practice. Reads can contain sequencing errors and hence they need not always appear preserved as substrings in the genome; and genomes typically are longer than the shortest superstring due to presence of repeats and nearly identical genic regions (called paralogs). Among the more realistic models for genome assembly, graph theoretic formulations have been in the forefront. In broad terms, there are three distinct ways to model the problem (refer to Fig. ):
were gaps during sequencing, then one Hamiltonian Path is sought for every connected component in G (if it exists) and the genome can be recovered as an unordered set of pieces corresponding to the sequenced sections. This formulation using the overlap graph model has been shown to be NP-Hard. () De Bruijn graph: Let a k-mer denote a string of length k. Construct a De Bruijn graph, where the vertex set is the set of all k-mers contained inside the reads of R; and a directed edge is drawn from vertices i to j, iff the k − length suffix of the k-mer i is identical to the k − length prefix of the k-mer j. Put another way, each edge in E represents a unique k + -mer that is found in at least one of the reads in R. Given a De Bruijn graph G, genome assembly is the problem of finding a shortest Euler tour in which each read is represented by a sub-path in the tour. While finding an Euler tour is polynomially solvable, the optimization problem is still NP-Hard by reduction from SSP. () String graph: This model is a variant of the overlap graph in which the edges represent reads, and vertices represent branching of end-to-end overlaps of the adjoining reads. For example, if a read ri overlaps with both rj and rk then it is represented by a vertex
() Overlap graph: Construct graph G(V, E) from R, where each read is represented by a unique vertex in V and an edge is drawn between (ri , rj ) iff there is a significant suffix–prefix overlap between ri and rj . Overlap is typically defined using a pairwise sequence alignment model (e.g., semi-global alignment) and related metrics to assess the quality of an alignment. Given G, the genome assembly problem can be reduced to the problem of finding a Hamiltonian Path, if one exists, provided that there were no breaks (called sequencing gaps) along the genome during sequencing. If there Unknown genome Repeat
Repeat
Reads
Overlap graph
cagtcaac agtcagt cagttcgg tcaa caac
De Bruijn graph agtc gtca tcag cagt
agtt gttc ttcg tcgg String graph r1
r3
r2
r4
Genome Assembly. Fig. Illustration of the three different models for genome assembly. The top section of the figure shows a hypothetical unknown genome and the reads sequenced from it. The dotted box shows a repetitive region within the genome. In the overlap graph, solid lines indicate true overlaps and dotted lines show overlaps induced by the presence of the repeat. The De Bruijn graph shows the graph for an arbitrary section involving three reads. The string graph shows an example of a branching vertex with multiple possible entry and exit paths
Genome Assembly
which has an inbound edge corresponding to ri and two outbound edges representing rj and rk . Furthermore, the graph is pruned off transitively inferable edges – e.g., if both ri and rj overlaps with rk , whereas ri also overlaps with rj , then the overlap from rj to rk is inferable and is therefore removed. Given a string graph G, the assembly problem becomes one that is equivalent of finding a constrained least cost Chinese Postman tour of G. This formulation has also been shown to be NP-Hard. All the three formulations incorporate features to handle sequencing errors. In the overlap graph model, the pairwise alignment formulation used to define edges automatically takes into account character substitutions, insertions, and deletions. In the other two models, an “error correction” procedure is run as a preprocessing step to mask off anomalous portions of the graph prior to performing the assembly tours.
Parallelization for the Overlap Graph Model Algorithms that use the overlap graph model deploy a three-stage assembly approach of overlap–layout– consensus. In the first stage, the overlap graph is constructed by performing all-against-all pairwise comparison of the reads in R. As a result, a layout for the assembly is prepared using the pairwise overlaps and in the third stage, a multiple sequence alignment (MSA) of the reads is performed as dictated by the layout. Given the large number of reads generated in a typical sequencing project, the overlap computation phase dominates the run-time and hence is the primary target for parallelization and performance optimization. A brute-force way of implementing this step will be to compare all (m ) possible pairs. This approach can also be parallelized easily, as individual alignment tasks can be distributed in a load-balanced fashion given the uniformity expected in the read lengths. However, an implementation may not be practically feasible for larger number of reads due to the quadratic increase in alignment workload. Each alignment task executed serially could take milliseconds of run-time. In fact, for the assembly problem, the nature of sampling done during sequencing guarantees that performing all-againstall comparisons is destined to be highly wasteful. This is because of the random shotgun approach to sequencing
G
which is expected to sample roughly equal number of reads covering each genomic base and it is typically only reads that originate from the same locus that tend to overlap with one another, with the exception of reads from repetitive regions of the genome. A more scalable approach to detect pairwise overlaps is to first shortlist pairs of sequences based on the presence of sufficiently long exact matches (i.e., a filter built on a necessary-but-not-a-sufficient condition) and then compute pairwise comparisons only on them. Methods vary in the manner in which these “promising pairs” are shortlisted. One way of generating promising pairs is to use a string lookup table data structure that is built to record all k-mer occurrences within the reads in linear time. Sequence pairs that share one or more k-mers can be subsequently considered for alignmentbased evaluation. While this approach supports a simple implementation, it poses a few notable limitations. The size complexity of the lookup table data structure contains an exponential term O(∣Σ∣k ). This restricts the value of k in practice; for DNA alphabet, physical memory constraints demand that k ≤ . Whereas, reads could share arbitrarily long matches, especially if the sequencing error rates are low. Another disadvantage of the lookup table is that it is only suited to detect pairs with fixed length matches. An exact match of an arbitrary length q will reveal itself as q − k + smaller k-mer matches, thus increasing computation cost. Furthermore, the distribution of entries within the lookup table is highly data dependent and could scatter the same sequence pair in different locations of the lookup table, making it difficult for parallelization on distributed memory machines. Methods that use lookup tables simply use a high end shared memory machine for serial generation and storage, and focus on parallelizing only the pairwise sequence comparison step. An alternative albeit more efficient way of enumerating promising pairs based on exact matches is to use a string indexing data structure such as suffix tree or suffix array that will allow detection of variable length matches. A match α between two reads ri and rj is called left-maximal (alternatively, right-maximal) if the characters preceding (alternatively, following) it in both strings, if any, are different. A match α between reads ri and rj is said to be a maximal match between the two reads if it is both left- and right-maximal. Pairs containing maximal matches of a minimum length cutoff, say
G
G
Genome Assembly
ψ, can be detected in optimal run-time and linear space using a suffix tree data structure. A generalized suffix tree of a set of strings is the compacted trie of all suffixes in the strings. The pair enumeration algorithm first builds a generalized suffix tree over all reads in R and then navigates the tree in a bottom-up order.
Parallel Generalized Suffix Tree Construction For constructing a suffix tree, there are several linear time construction serial algorithms and optimal algorithms for the CRCW-PRAM model. However, optimal construction under the distributed memory machine model, where memory per processor is assumed to be too small to store the input sequences in R, is an open problem. Of the several approaches that have been studied, the following approach is notable as it has demonstrated linear scaling to thousands of processors on distributed memory supercomputers. The major steps of the underlying algorithm are as follows. Each step is a parallel step. The steps marked as comm are communication-bound and the steps marked comp are computation-bound. S. (comp) Load R from I/O in a distributed fashion such that each processor receives O( np ) characters and no read is split across processor boundaries. S. (comp) Slide a window of length k ≤ ψ over the set of local reads and bucket sort locally all suffixes of reads based on their k length prefix. Note that for DNA alphabet, even a value as small as for k is expected to create over a million buckets, sufficient to support parallel distribution. An alternative to local generation of buckets is to have a master– worker paradigm where the reads are scanned in small-sized batches and have a dedicated master distribute batches to workers in a load-balanced fashion. S. (comm) Parallel sort the buckets such that each processor receives approximately ∼ np suffixes and no bucket is split across processor boundaries. While the latter cannot be theoretically guaranteed, real-world genomic data sets typically distribute the suffixes uniformly across buckets thereby virtually ensuring that no bucket exceeds the O( np ) bound. If a bucket size becomes too big to fit in the local memory, then an alternative strategy can be deployed
whereby the prefix length for bucketing is iteratively extended until the size of the bucket fits in the local memory. S. (comm, comp) Each bucket in the output corresponds to a unique subtree in the generalized suffix tree rooted exactly k characters below the root. The subtree corresponding to each bucket is then locally constructed using a recursive bucket sort–based method that compares characters between suffixes. However, this step cannot be done strictly locally because it needs the sequences of reads whose suffixes are being compared. This can be achieved by building the local set of subtrees in small-sized batches and performing one round of communication using Alltoallv() to load the reads required to build each batch from other processors. In order to reduce the overhead due to read fetches, observe that the aggregate memory on a relatively small group of processors is typically sufficient to accommodate the read set for most inputs in practice. For example, even a set of million reads each of length Kbp needs only of the order of GB. Even assuming GB per processor, this means the read set will fit within processors. Higher number of processors is required only to speedup computation. One could take advantage of this observation by partitioning the processor space into subgroups of fixed, smaller size, determined by the input size, and then have the on-the-fly read fetches to seek data from processors from within each subgroup. This improved scheme could distribute the load of any potential hot spots and also the overall communication could be faster as the collective transportation primitive now operates on a smaller number of processors. The above approach can be implemented to run in time and O( np ) space, where ℓ is the mean read length. The output of the algorithm is a distributed representation of the generalized suffix tree, with each processor storing roughly np suffixes (or leaves in the tree). ) O( nℓ p
The cluster-then-assemble Approach A potential stumbling block in the overlap–layout– consensus approach is the large amount of memory required to store and retrieve the pairwise overlaps so that they can be used for generating an assembly layout. For genomes which have been sequenced at a uniform
Genome Assembly
coverage c, the expectation is that each base along the genome is covered by c reads on an average, thereby implying (c ) overlapping read pairs for every genomic base. However, this theoretical linear expectation holds for only a fraction of the target genome, whereas factors such as genomic repeats and oversampled genic regions could increase the number of overlapping pairs arbitrarily beyond the expected level. An example case in point is the gene-enriched sequencing of the highly repetitive ∼. billion bp maize genome. One way to overcome this scalability bottleneck is to use clustering as a preprocessing step to assembly. This approach, referred to as cluster-then-assemble, builds upon the assumption that any genome-scale sequencing effort is likely to undersample the genome, leaving out sequencing gaps along the genome length and thereby allowing the assemblers to reconstruct the genome in pieces – one piece for every contiguous stretch of the genome (aka “contig” or “genomic island”) that is fully sampled through sequencing. This assumption is not unrealistic as tens of thousands of sequencing gaps have always resulted in almost all eukaryotic genomes sequenced so far using the WGS approach. The cluster-then-assemble takes advantage of this assumption as follows: The set of m reads is first partitioned into groups using a single-linkage transitive closure clustering method, such that there exists a sequence of overlapping pairs connecting every read to every other read in the same cluster. Ideally, each cluster
G
should correspond to one genomic island, although the presence of genomic repeats could collapse reads belonging to different islands together. Once clustered, each cluster represents an independent subproblem for assembly, thereby breaking a large problem with m reads into numerous, disjoint subproblems of significantly reduced size small enough to fit in a serial computer. In practice, tens of thousands of clusters are generated making the approach highly suited for trivial parallelization after clustering. The primary challenge is to implement the clustering step in parallel, in a time- and space-efficient manner. In graph-theoretic terms, the output produced by a transitive closure clustering is representative of connected components in the overlap graph, thereby allowing the problem to be reduced to one of connected component detection. However, generating the entire graph G prior to detection would contradict the purpose of clustering from a space complexity standpoint. Figure outlines an algorithm to perform sequence clustering. The algorithm can be described as follows: Initialize each read in a cluster of its own. In an iterative process, generate promising pairs based on maximal matches in a non-increasing order of their lengths using suffix trees (as explained in the previous section). Before assigning a promising pair for further alignment processing, a check is made to see whether the constituent reads are in the same cluster. If they are part of the same cluster already, then the pair is discarded; otherwise it
Algorithm 1 Read Clustering Input: Read set R = {r1 ; r2 ; : : : rm } Output: A partition C = {C1 ; C2 ; : : : Ck } of R, 1 ≤ k ≤ m 1. Initialize Clusters: C ← { {ri } | 1 ≤ i ≤ m} 2. FOR each pair (ri ; rj ) with a maximal match of length ≥ Ã generated in non-increasing order of maximal match length Cp ← F ind(ri ) Cq ← F ind(rj ) IF Cp = Cq THEN overlap quality ← Align(ri ; rj ) IF overlap quality is significant THEN U nion(Cp ; Cq ) 3. Output C
⇒ (master) ⇒ (worker) ⇒ (master) ⇒ (master) ⇒ (master) ⇒ (worker) ⇒ (master) ⇒ (worker) ⇒ (master)
Genome Assembly. Fig. Pseudocode for a read clustering algorithm suited for parallelization based on the master-worker model. The site of computation is shown in brackets. Operations on the set of clusters are performed using the Union-Find data structure
G
G
Genome Assembly
is aligned. If the alignment results show a satisfactory overlap (based on user-defined alignment parameters), then merge the clusters containing both reads into one larger cluster. The process is repeated until all promising pairs are exhausted or until no more merges are possible. Using the union–find data structure would ensure the Find and Union calls to run in amortized time proportional to the Inverse Ackerman function – a small constant for all practical inputs. There are several advantages to this clustering strategy. Clustering can be achieved in at most m − merging steps. Checking if a read pair is already clustered before alignment is aimed at reducing the number of pairs aligned. Generating promising pairs in a nonincreasing order of their maximal match lengths is a heuristic that could help identify pairs that are more likely to succeed the alignment test sooner during execution. Furthermore, promising pairs are processed as they are generated, obviating the need to store them. This coupled with the use of the suffix tree data structure implies an O(n) serial space complexity for the clustering algorithm. The parallel algorithm can be implemented using a master–worker paradigm. A dedicated master processor can be responsible for initializing and maintaining the clusters and also for distributing alignment workload to the workers in a load-balanced fashion. The workers at first can generate a distributed representation of the generalized suffix tree in parallel (as explained in the previous section). Subsequently, they can generate promising pairs from their local portion of the tree and send them to the master. To reduce communication overheads, pairs can be sent in arbitrary-sized batches as demanded by the situation of the work queue buffer at the master. The master processor can check the pairs against the current set of clusters, filter out pairs that do not need alignment, and add only those pairs that need alignment to its work queue buffer. The pairs in the work queue buffer can be redistributed to workers in fixed size batches, to have the workers compute alignments and send back the results. Communication overheads can be masked by overlapping alignment computation with communication waits using non-blocking calls. The PaCE software suite implements the above parallel algorithm, and it has demonstrated linear scaling to thousands of processors on a distributed memory supercomputer.
Short Read Assembly The landscape of genome assembly tools has significantly transformed itself over the last few years with the advent of next-generation sequencing technologies. A plethora of new-generation assemblers that collectively operate under the banner of “short read assemblers” is continuing to emerge. Broadly speaking, these tools can be grouped into two categories: () those that follow or extend from the classical overlap graph model; and () those that deploy either the De Bruijn graph or the string graph formulation. The former category includes programs such as Edena, Newbler, PE-Assembler, QSRA, SHARCGS, SSAKE, and VCAKE. The programs that fall under the latter category are largely inspired by two approaches – the De Bruijn graph formulation used in the EULER assembler; and the string graph formulation proposed by Myers. As of this writing, these programs include EULER-SR, ALLPATHS, Velvet, SOAPdenovo, ABySS, and YAGA. Either of these lists is likely to expand in the coming years, as new implementations of short read assemblers are continuing to emerge as are new technologies. In principle, the techniques developed for assembling Sanger reads do not suit direct application for short read assembly due to a combination of factors that include shorter read length, higher sequencing coverage, an increased reliance on pair-end libraries, and idiosyncrasies in sequencing errors. A shorter read length implies that the method has to be more sensitive to differences to avoid potential mis-assemblies. This also means providing a robust support for incorporating the knowledge available from pair-end libraries. In case of next-generation sequencing, pair-end libraries are typically available for different clone insert sizes, which translates to a need to implement multiple sets of distance constraints. A high sequencing coverage introduces complexity at two levels. The number of short reads to assemble increases linearly with increased coverage, and this number can easily reach hundreds of millions even for modest-sized genomes because of shorter read length. For instance, a mid-size genome such as that of Arabidopsis (∼ Mbp) or fruit fly (∼ Mbp) sequenced at x coverage either using Illumina or SOLiD would generate a couple of hundred million reads. Secondly, the average number of overlapping read
Genome Assembly
pairs at every given genomic location is expected to grow quadratically (∝ (c )) with the coverage depth (c). This particularly affects the time and memory scalability of methods that use the overlap graph model not only because they operate at the level of pairwise read overlaps, but also because many tools assume that such overlaps can be stored and retrieved from local memory. Consequently, such methods take between – h and several gigabytes of memory even for assembling small bacterial genomes. From a parallelism perspective, a high coverage in sampling could also potentially mean fewer sequencing gaps, as the probability of a genomic base being captured by a read improves with coverage by the Lander–Waterman model. Therefore, divide-and-conquer–based techniques such as the cluster-then-assemble may not be as effective in breaking the initial problem size down. Short read assemblers also have to deal with technology specific errors. The error rates associated with next-generation technologies are typically in the –% range, and cannot be ignored as differentiating them from real differences could be key to capturing natural variations such as near identical paralogs. In the current suite of tools specifically built for short read assembly, only a handful of tools support some degree of parallelism. These tools include PEAssembler, ABySS, and YAGA. Others are serial tools that work on desktop computers and rely on highmemory nodes for larger inputs. Even the tools that support parallelism do so to varying degrees. PEAssembler, which implements a variant of the overlap graph model, limits parallelism to node level and assumes that the number of parallel threads is small (< ). It deploys a “seed-extend” strategy in which a subset of “reliable” reads are selected as seeds and other reads that overlap with each seed are incrementally added to extend and build into a consensus sequence in the ′ direction. Parallelism is supported by selecting multiple seeds and launching multiple threads that assume responsibility of extending these different reads in parallel. While the algorithm has the advantage of not having to build and manage large graphs, it performs a conservative extension primarily relying on pair-end data to collapse reads into contigs. From a parallelism perspective, the algorithm relies on shared memory access for managing the read set, which works well if the number of threads is very small. Furthermore, not
G
all steps in the method are parallel and some steps are disk-bound. Experimental results show that the parallel efficiency drops drastically beyond three threads. ABySS and YAGA are two parallel methods that implement the De Bruijn graph model and work for distributed memory computers. The De Bruijn and string graph formulations are conceptually better suited for short read assembly. These formulations allow the algorithm to operate at a read or read’s subunit (i.e., k-mer) level rather than the pairwise overlap level. Repetitive regions manifest themselves in the form of special graph patterns, and error correction mechanism could detect and reconcile anomalous graph substructures prior to performing assembly tours, thereby reducing the chance of misassemblies. The assembly itself manifests in the form of graph traversals. Constructing these graphs and traversing them to produce assembly tours (although not guaranteed to be optimal due to intractability of the problem) are problems with efficient heuristic solutions on a serial computer. The methods ABySS and YAGA provide two different approaches to implement the different stages in parallel using De Bruijn graphs. While these methods differ in their underlying algorithmic details and in the degree offered for parallelism, the structural layout of their algorithms is similar consisting of these four major steps: () parallel graph construction; () error correction; () incorporation of distance constraints due to pair-end reads; and () assembly tour and output. In what follows, approaches to parallelize each of these major steps are outlined, with the bulk of exposition closely mirroring the algorithm in YAGA because its algorithm more rigorously addresses parallelization for all the steps, and takes advantage of techniques that are more standard and well understood among distributed memory processing codes (e.g., sorting, list ranking). By contrast, parallelization is described only for the initial phase of graph construction in ABySS. Therefore, where appropriate, variations in the ABySS algorithmic approach will be highlighted in the text. Other than differences in their parallelization strategies, the two methods also differ particularly in the type of De Bruijn graph they construct (directed vs. bidirected) and their ability to handle multiple read lengths. Such details are omitted in an attempt to keep the text focused on the parallel aspects of the problem. An interested reader can refer to the individual papers for details
G
G
Genome Assembly
pertaining to the nuances of the assembly procedure and the output quality of these assemblies.
Parallel De Bruijn Graph Construction and Compaction Given m reads in set R, the goal is to construct in parallel a distributed representation of the corresponding De Bruijn graph built out of k-mers, for a userspecified value of k. Note that, once a De Bruijn graph representation is generated, it can be transformed into a corresponding string graph representation by compressing paths whose label when concatenated spells out the characters in a read, and by accordingly introducing branch nodes that capture overlap continuation between adjoining reads. This can be done in a successive stage of graph compaction, a minor variant of which is described in the later part of this section. The computing model assumed is p processors (or equivalently, processes), each with access to a local RAM, and connected through a network interconnect and with access to a shared file system where the input is made available. To construct the De Bruijn graph, the reads in R are initially partitioned and loaded in a distributed manner such that each processor receives O( np ) input characters. Let Ri refer to the subset of reads in processor pi . Through a linear scan of the reads in Ri , each pi enumerates the set of k-mers present in Ri . However, instead of storing each such k-mer as a vertex of the De Bruijn graph, the processor equivalently generates and stores the corresponding edges connecting those vertices in the graph. In other words, there is a bijection between the set of edges and the set of distinct k + -mers in Ri . In this edge-centric representation, the vertex information connecting each edge is implicit and the count of reads containing a given k + -mer is stored internally at that edge. Note that after this generation process, the same edge could be potentially generated at multiple processor locations. To detect and merge such duplicates, the algorithm simply performs a parallel sort of the edges using the k + -mers as the key. Therefore, in one sorting step, a distributed representation of the De Bruijn graph is constructed. Standard parallel sort routines such as sample sort can be used here to ensure even redistribution of the edges across processors due to sorting.
An alternative to this edge-centric representation is a vertex-centric representation, which is followed in ABySS and in an earlier version of YAGA. Here, the k-mer set corresponding to each Ri is generated by the corresponding pi . The vertices adjacent to a given vertex could be generated remotely by one or more processors. Therefore, this approach necessitates processors to communicate with one another in order to check the validity of all edges that could be theoretically drawn from its local set of vertices. While the number of such edge validation queries is bounded by per vertex ({a, c, g, t} on either strand orientation), the method runs the risk of generating false edges, e.g., if a k − -mer, say α, occurs in exactly two places along the genome as aαc and gαt, then the vertex-centric approach will erroneously generate edges for aαt and gαc, which are k + -mers nonexistent in the input. Besides this risk, care must be taken to guarantee an even redistribution of vertices among processors by the end of the construction process. For instance, a static allocation scheme in which a hash function is used to map each k-mer to a destination processor, as it is done in ABySS, runs the risk of producing unbalanced distribution as the k-mer concentration within reads is input dependent. Graph compaction: Once a distributed edge-centric representation of the De Bruijn graph is generated, the next step is to simplify it by identifying maximal segments of simple paths (or “chains”) and compact each of them into a single longer edge. Edges that belong to a given chain could be distributed and therefore this problem of removing chains becomes a two-step process: First, to detect the edges (or equivalently, the vertices) which are part of chains and then perform compaction on the individual chains. To mark the vertices that are part of some chain, assume without loss of generality that each edge is stored twice as < u, v > and < v, u >. Sorting the edges in parallel by their first vertex it would bring all edges incident on a vertex together on some processor. Only those vertices that have a degree of two can be part of chains and such vertices can be easily marked with a special label. In the next step, the problem of compacting the vertices along each chain can be treated as a variant of segmented parallel list ranking. A distributed version of the list ranking algorithm can be used to calculate the relative distances of each marked vertex from the start and end of its respective chain. The same procedure can also be used to determine the
Genome Assembly
vertex identifiers of those two boundary vertices for each marked vertex. Compaction then follows through the operations of concatenating the edge labels along the chain at one of the terminal edges, removing all internal edges, and aggregating an average k + -mer frequency of the constituent edges along the chain. All of these operators are binary associative to allow being implemented using calls to a segmented parallel prefix routine. The output of this step is a distributed representation of the compacted graph.
Error Correction and Variation Detection In the De Bruijn graph representation, errors due to sequencing manifest themselves as different subgraph motifs which could be detected and pruned. For ease of exposition, let us informally call an edge in the compacted De Bruijn graph as being “strongly supported” (alternatively, “weakly supported”) if its average k + -mer frequency is relatively high (alternatively, low). Examples of motifs are follows: () Tips are weakly supported dead ends in the graph created due to a base miscall occurring in a read at one of its end positions. Because of compaction such tips occur as single edges branching out of a strongly supported path, and can be easily removed; () Bubbles are detours that provide alternative paths between two terminal vertices. Due to compaction, these detours will also be single edges. There are two types of bubbles – weakly supported bubbles are manifestations of single base miscall occurring internal to a read and need to be removed; whereas, bubbles that originate from the same vertex and are supported roughly to the same degree could be the result of natural variations such as near identical paralogs (i.e., copies of the same gene occurring at different genomic loci). Such bubbles need to be retained. () Spurious links connect two otherwise disparate paths in the graph. Weakly supported links are manifestations of erroneous k + -mers that happen to match the k + -mer present at a valid genomic locus, and they can be severed by examining the supports of the other two paths. The first pass of error correction on the compacted graph could reveal new instances of motifs that can be pruned through iterative passes subsequently until no new instances are observed.
G
Incorporation of Distance Constraints Using Pair-End Information The goal of this step is to map the information provided by read pairs that are linked by the pair-end library onto the compacted and error corrected De Bruijn graph. Once mapped, this information can be used to guide the assembly tour of the graph consistent (to the extent possible) with the distance constraints imposed by the pair-end information. To appreciate the value added by this step to the assembly procedure, recall that the pairend information consists of a list of read pairs of the form < ri , rj > that have originated from the same clonal insert during sequencing. In genomic distance parlance, this implies that the number of bases separating the two reads is bounded by a minimum and maximum. Also recall that an assembly from a De Bruijn graph corresponds to a tour of the graph (possibly multiple tours if there were gaps in the original sequencing). Now consider a scenario where a path in the De Bruijn graph branches into multiple separate subpaths. A correct assembly tour would have to decide which of those branches reflect the sequence of characters along the unknown genome. It should be easy to see how pair-end information can be used to resolve such situations. To incorporate the information provided in the form of read pairs by such pair-end libraries, the YAGA algorithm uses a cluster summarization procedure, which can be outlined as follows: First, the list of read pairs provided as input by the pair-end information is delineated into a corresponding list of constituent k + -mer pairs. Note that two k + -mers from two reads of a pair can map to two different edges on the distributed graph. Furthermore, different positions along the same edge could be paired with positions emanating from different edges. To this effect, the algorithm attempts to compute a grouping of edge pairs based on their best alignment with the imposed distance constraints. To achieve this, an observed distance interval is computed between every edge pair on the graph linked by a read pair, and then overlapping intervals that capture roughly the same distances along the same orientation are incrementally clustered using a greedy heuristic. This last step is achieved using a two-phase clustering step that primarily relies on several rounds of parallel sorting tuples containing edge pair and interval distance information. The formal details pertaining to this algorithmic step has been omitted here for brevity.
G
G
Genome Assembly
Completing the Assembly Tour An important offshoot of the above summarization procedure is that redundant distance information captured by edge pairs in the same cluster can be removed, thereby allowing significant compression in the level of information needed to store relative to the original graph. This compression, in most practical cases, would allow for the overall tour to be performed sequentially on a single node. The serial assembly touring procedure begins by using edges that have significantly longer edge labels than the original read length as “seeds,” and then by extending them in both directions as guided by the pair-end summarization traversal constraints. The YAGA assembler has demonstrated scaling on IBM BlueGene/L processors and performs assembly of over a billion synthetically generated reads in under h. The run-time is dominated by the pair-end cluster summarization phase. As of this writing, performancerelated information is not available for ABySS.
Future Trends Genome sequencing and assembly is an actively pursued, constantly evolving branch of bioinformatics. Technological advancements in high-throughput sequencing have fueled algorithmic innovation and along with a compelling need to harness new computing paradigms. With the adoption of ever faster, cheaper, and massively parallel sequencing machines, this application domain is becoming increasingly dataand compute-intensive. Consequently, parallel processing is destined to play a critical role in genomic discovery. Numerous large-scale sequencing projects including the assembly of the human genome to the most recent assembly of the maize genome have benefited from the use of parallel processing, although in different ad hoc ways. Heterogeneous clusters comprising of a mixture of a few high-end shared memory machines along with numerous compute nodes have been used to “farm” out tasks and accelerate the overlap computation phase in particular. This is justified because nearly all of these large-scale projects used the more traditional Sanger sequencing. A few specialpurpose projects such as the maize gene-enriched sequencing used more strongly coupled parallel codes such as PaCE. However, with an aggressive adoption of next-generation sequencing for genome sequencing
and more complex genomes in the pipeline for sequencing (e.g., wheat, pine, metagenomic communities), this scenario is about to change, and more strongly coupled parallel codes are expected to become part of the mainstream computing in genome assembly. In addition to de novo sequencing, next-generation sequencing is also increasingly being used in genome resequencing projects, where the goal is to assemble a genome of a particular variant strain (or subspecies) of an already sequenced genome. The type of computation that originates is significantly different from that of de novo assembly. For resequencing, the reads generated from a new strain are compared against a fully assembled sequence of a reference strain. This process, sometimes called read mapping, only requires comparison of reads against a much larger reference. Approaches that capitalize on advanced string data structures such as suffix trees are likely to play an active role in these algorithms. Also, as this branch of science becomes more data-intensive, reaching the petascale range, different parallel paradigms such as MapReduce need to be explored, in addition to distributed memory and shared memory models. The developments in genome sequencing can be carried over to other related applications that also involve large-scale sequence analysis, e.g., in transcriptomics, metagenomics, and proteomics. Fine-grain parallelism in the area of string matching and sequence alignment has been an active pursued topic over the last decade and there are numerous hardware accelerators for performing sequence alignment on various platforms including General Purpose Graphical Processing Units, Cell Broadband Engine, Field-Programmable Gate Arrays, and Multicores. These advances are yet to take their place in mainstream sequencing projects. Genome sequencing is at the cusp of revolutionary possibilities. With a rapidly advancing technology base, the possibility of realizing landmark goals such as personalized medicine and “$, genome” do not look distant or far-fetched. The well-advertised $ million Archon Genomics X PRIZE will be awarded to the first team that sequences human genomes in days, at a recurring cost of no more than $, per genome. In , the cost for sequencing a human genome plummeted below $, using technologies from Illumina and SOLiD. Going past these nextgeneration technologies, however, companies such as
Genome Assembly
Pacific Biosciences are now releasing a third-generation (“gen-”) sequencer that uses an impressive approach called single-molecule sequencing (or “SMS”) [], and have proclaimed grand goals such as sequencing a human genome in min for less than $ by . These are exciting times for genomics, and the field is likely to continue serving as a rich reservoir for new problems that pose interesting compute- and data-intensive challenges which can be addressed only through a comprehensive embrace of parallel computing.
Related Entries Homology to Sequence Alignment, From Suffix Trees
Bibliographic Notes and Further Reading Even though DNA sequencing technologies have been available since the late s, it was not until the s that they were applied at a genome scale. The first genome to be fully sequenced and assembled was the ∼. Mbp bacterial genome of H. influenzae in []. The sequencing of the more complex, ∼ billion bp long human genome followed []. Several other notable large-scale sequencing initiatives followed in the new millennium including that of chimpanzee, rice, and maize (to cite a few examples). All of these used either the WGS strategy or hierarchical strategy coupled with Sanger sequencing, and their assemblies were performed using programs that followed the overlap–layout–consensus model. The National Center for Biotechnology Information (NCBI) (http://www. ncbi.nlm.nih.gov) maintains a comprehensive database of all sequenced genomes. More than a dozen programs exist to perform genome assembly under the overlap– layout–consensus model. Notable examples include PCAP [], Phrap (http://www.phrap.org/), Arachne [], Celera [], and TIGR assembler []. For a detailed review of fragment assembly algorithms, refer to [, ]. The parallel algorithm that uses the clusterthen-assemble approach along with the suffix tree data structure for assembly was implemented in a program called PaCE and was first described in [] in the context of clustering Expressed Sequence Tag data and then later adapted for genome assembly []. Experiments
G
conducted as part of the maize genome sequencing consortium demonstrated scaling of this method to over a million reads generated from gene-enriched fractions of the maize genome on a , node BlueGene/L supercomputer []. The parallel suffix tree construction algorithm described in this entry was first described in [] and a variant of this method was later presented in []. An optimal algorithm to detect maximal matching pairs of reads in parallel using the suffix tree data structure is presented in []. The NP-completeness of the Shortest Superstring Problem (SSP) was shown by Gallant et al. []. The De Bruijn graph formulation for genome assembly was first introduced by Idury and Waterman [] in the context of a sequencing technique called sequencingby-hybridization, and later extended to WGS based approaches in the EULER program by Pevzner et al. []. The string graph formulation was developed by Myers []. The proof of NP-Hardness for the overlap– layout–consensus is due to Kececioglu and Myers []. The proofs of NP-Hardness for the De Bruijn and string graphs models of genome assembly are due to Medvedev et al. []. Since the later part of s, various next-generation sequencing technologies such as Roche (http://www.genome-sequencing.com/), SOLiD (http:// www.appliedbiosystems.com/), Illumina (http://www. illumina.com/), and HeliScope (http://www.helicosbio. com/) have emerged along side serial assemblers. A “third” generation of machines that promise a brand new way of sequencing (by single-molecule sequencing) are also on their way (e.g., Pacific Biosciences (http://www.pacificbiosciences.com/)). Consequently, the development of short read assemblers continue to be in hot pursuit. Edena, Newbler (http:// www..com), PE-Assembler [], QSRA [], SHARCGS [], SSAKE [], and VCAKE [] are all examples of programs that operate using the overlap graph model. EULER-SR [], ALLPATHS [], Velvet [], SOAPdenovo [], ABySS [], and YAGA [] are programs that use the De Bruijn graph formulation. Of these tools, PE-Assembler, ABySS, and YAGA are parallel implementations, although to varying degrees as described in the main text.
Acknowledgment Study supported by NSF grant IIS-.
G
G
Genome Sequencing
Bibliography . Ariyaratne P, Sung W () PE-assembler: de novo assembler using short paired-end reads. Bioinformatics ():– . Batzoglou S, Jaffe DB, Stanley K, Butler J et al () ARACHNE: a whole-genome shotgun assembler. Genome Res ():– . Bryant D, Wong W, Mockler T () QSRA – a quality-value guided de novo short read assembler. BMC Bioinform (): . Butler J, MacCallum L, Kleber M, Shlyakhter IA et al () ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res :– . Chaisson MJ, Pevzner PA () Short read fragment assembly of bacterial genomes. Genome Res :– . Dohm J, Lottaz C, Borodina T, Himmelbaurer H () SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res ():– . Emrich S, Kalyanaraman A, Aluru S () Chapter : algorithms for large-scale clustering and assembly of biological sequence data. In: Handbook of computational molecular biology. CRC Press, Boca Raton . Fleischmann R, Adams M, White O, Clayton R et al () Wholegenome random sequencing and assembly of Haemophilus influenzae rd. Science ():– . Flusberg BA, Webster DR, Lee JH, Travers KJ et al () Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods :– . Gallant J, Maier D, Storer J () On finding minimal length superstrings. J Comput Syst Sci :– . Ghoting A, Makarychev K () Indexing genomic sequences on the IBM blue gene. In: Proceedings ACM/IEEE conference on supercomputing. Portland . Huang X, Wang J, Aluru S, Yang S, Hiller L () PCAP: a wholegenome assembly program. Genome Res :– . Idury RM, Waterman MS () A new algorithm for DNA sequence assembly. J Comput Biol ():– . Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S () Parallel de novo assembly of large genomes from highthroughput short reads. In: IEEE international symposium on parallel distributed processing, pp – . Jeck W, Reinhardt J, Baltrus D, Hickenbotham M et al () Extending assembly of short DNA sequences to handle error. Bioinformatics :– . Kalyanaraman A, Aluru S, Brendel V, Kothari S () Space and time efficient parallel algorithms and software for EST clustering. IEEE Trans Parallel Distrib Syst ():– . Kalyanaraman A, Emrich SJ, Schnable PS, Aluru S () Assembling genomes on large-scale parallel computers. J Parallel Distrib Comput ():– . Kececioglu J, Myers E () Combinatorial algorithms for DNA sequence assembly. Algorithmica (–):– . Li R, Zhu H, Ruan J, Qian W et al () De novo assembly of human genomes with massively parallel short read sequencing. Genome Res ():– . Medvedev P, Georgiou K, Myers G, Brudno M () Computability of models for sequence assembly. Lecture notes in computer science, vol . Springer, Heidelberg, pp –
. Myers EW () The fragment assembly string graph. Bioinformatics, (Suppl ):ii–ii . Myers EW, Sutton GG, Delcher AL, Dew IM et al () A WholeGenome assembly of drosophila. Science ():– . Pevzner PA, Tang H, Waterman M () An eulerian path approach to DNA fragment assembly. In: Proceedings of the national academy of sciences of the United States of America, vol , pp – . Pop M () Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics ():– . Simpson J, Wong K, Jackman S, Schein J et al () ABySS: a parallel assembler for short read sequence data. Genome Res :– . Sutton GG, White O, Adams MD, Kerlavage AR () TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol ():– . Venter C, Adams MD, Myers EW, Li P et al () The sequence of the human genome. Science ():– . Warren P, Sutton G, Holt R () Assembling millions of short DNA sequences using SSAKE. Bioinformatics :– . Zerbino DR, Velvet BE () Algorithms for de novo short read assembly using de bruijn graphs. Genome Res :–
Genome Sequencing Genome Assembly
GIO PCI Express
Glasgow Parallel Haskell (GpH) Kevin Hammond University of St. Andrews, St. Andrews, UK
Synonyms GpH (Glasgow Parallel Haskell)
Definition Glasgow Parallel Haskell (GpH) is a simple parallel dialect of the purely functional programming language, Haskell. It uses a semi-explicit model of parallelism, where possible parallel threads are marked by the programmer, and a sophisticated runtime system then
Glasgow Parallel Haskell (GpH)
decides on the timing of thread creation, allocation to processing elements, migration. There have been several implementations of GpH, covering platforms ranging from single multicore machines through to computational grids or clouds. The best known of these is the GUM implementation that targets both sharedmemory and distributed-memory systems using a sophisticated virtual shared-memory abstraction built over a common message-passing layer, but there is a new implementation, GHC-SMP, which directly exploits shared-memory systems, and which targets multicore architectures.
Discussion Purely Functional Languages and Parallelism Because of the absence of side effects in purely functional languages, it is relatively straightforward to identify computations that can be run in parallel: any sub-expression can be evaluated by a dedicated parallel task. For example in the following very simple function definition, each of the two arguments to the addition operation can be evaluated in parallel. f x = fibonacci x + f act o ri al x
This property was already realized by the mid-s, when there was a surge of interest both in parallel evaluation in general, and in the novel architectural designs that it was believed could overcome the sequential “von-Neumann bottleneck.” In fact, the main issue in a purely functional language is not extracting enough parallelism – it is not unusual for even a short-running program to produce many tens of thousands of parallel threads – but is rather one of identifying sufficiently large-grained parallel tasks. If this is not done, then thread creation and communication overheads quickly eliminate any benefit that can be obtained from parallel execution. This is especially important on conventional processor architectures, which historically provided little, if anything, in the way of hardware support for parallel execution. The response of some parallel functional language designers has therefore been to provide very explicit parallelism mechanisms. While this usually avoids the problem of excessive parallelism, it places a significant burden on the programmer, who must understand the details of parallel execution as well as
G
the application domain, and it can lead to code that is specialized to a specific parallel architecture, or class of architectures. It also violates a key design principle for most functional languages, which is to provide as much isolation as possible from the underlying implementation. GpH is therefore designed to allow programmers to provide information about parallel execution while delegating issues of placement, communication, etc. to the runtime system.
History and Development of GpH Glasgow Parallel Haskell (GpH) was first defined in . GpH was designed to be a simple parallel extension to the then-new nonstrict, purely functional language Haskell [], adding only two constructs to sequential Haskell: par and seq. Unlike most earlier lazy functional languages, Haskell was always intended to be parallelizable. The use of the term “non-strict” rather than lazy reflects this: while lazy evaluation is inherently sequential, since it fixes a specific evaluation order for sub-expressions, under Haskell’s non-strict evaluation model, any number of sub-expressions can be evaluated in parallel provided that they are needed by the result of the program. The original GpH implementation targeted the GRIP novel parallel architecture, using direct calls to low-level GRIP communications primitives. GRIP was a shared-memory machine, using custom microcoded “intelligent memory units” to share global program data between off-the-shelf processing elements (Motorola processors, each with a private MB memory), developed in an Alvey research project which ran from to . Initially, GpH built on the prototype Haskell compiler developed by Hammond and Peyton Jones in . It was subsequently ported to the Glasgow Haskell Compiler, GHC [], that was developed at Glasgow University from onward, and which is now the de-facto standard compiler for Haskell (since the focus of the maintenance effort was moved to Microsoft Research in , GHC has also become known as the Glorious Haskell Compiler). In , the communications library was redesigned to give what became the highly portable GUM implementation []. This allowed a single parallel implementation to target both commercial shared-memory systems and the then-emerging class of cost-effective loosely coupled networks of workstations. Initially, GUM targeted
G
G
Glasgow Parallel Haskell (GpH)
system-specific communication libraries, but it was subsequently ported to PVM. There are now UDP, PVM, MPI, and MPICH-G instances of GUM, as well as system-specific implementations, and the same GUM implementation runs on multicore machines, shared-memory systems, workstation clusters, computational grids, and is being ported to large-scale highperformance systems, such as the ,-core HECToR system at the Edinburgh Parallel Computer Centre. Key parts of the GUM implementation have also been used to implement the Eden [] parallel dialect of Haskell, and GpH is being incorporated into the latest mainstream version of GHC. Although the two parallelism primitives that GpH uses are very simple, it became clear that they could be packaged using higher-order functions to give much higher-level parallel abstractions, such as parallel pipelines, data-parallelism, etc. This led ultimately to the development of evaluation strategies []: high-level parallel structures that are built from the basic par and seq primitives using standard higher-order functions. Because they are built from simple components and standard language technology, evaluation strategies are highly flexible: they can be easily composed or nested; the parallelism structure can change dynamically; and the applications programmer can define new strategies on an as-need basis, while still using standard strategies.
The GpH Model of Parallelism GpH is unusual in using only two parallelism primitives: par and seq. The design of the par primitive dates back to the late s [], where it was used in a parallel implementation of the Lazy ML (LML) compiler. The primitive, higher-order, function, par is used by the programmer to mark a sub-expression as being suitable for parallel evaluation. For example, a variant of the function f above can be parallelized using two instances of par. f x = let r = fibonacci x ; r = f a c t o r i a l x in l e t r e s u l t = ( r , r ) in r ‘ par ’ ( r ‘ par ’ r e s u l t )
r1 and r2 can now both be evaluated in parallel with the construction of the result pair (r1, r2). There is no need to specify any explicit communication, since
the results of each computation are shared through the variables r1 and r2. The runtime system also decides on issues such as when a thread is created, where it is placed, how much data is communicated, etc. Very importantly, it can also decide whether a thread is created. The parallelism model is thus semi-explicit: the programmer marks possible sites of parallelism, but the runtime system takes responsibility for the underlying parallel control based on information about system load, etc. This approach therefore eliminates significant difficulties that are commonly experienced with more explicit parallel approaches, such as raw MPI. The programmer needs to make sure there is enough scope for parallelism, but does not need to worry about issues of deadlock, communication, throttling, load-balancing, etc. In the example above, it is likely that only one of r1 or r2 (or neither) will actually be evaluated in parallel, since the current thread will probably need both their values. Which of r1 or r2 is evaluated first by the original thread will, however, depend on the context in which it is called. It is therefore left unspecified to avoid unnecessary sequentialization.
Lazy Thread Creation Several different versions of the par function have been described in the literature. The version used in GpH is asymmetric in that it marks its first argument as being suitable for possible execution (the expression is sparked), while continuing sequential execution of its second argument, which forms the result of the expression. For example, in par s e, the expression s will be sparked for possible parallel evaluation, and the value of e will be returned as the result of the current thread. Since the result expression is always evaluated by the sparking thread, and since there can be no side effects, it follows that it is completely safe to ignore any spark. That is, unlike many parallel systems, the creation of threads from sparks is entirely optional. This fact can be used to throttle the creation of threads from sparks in order to avoid swamping the parallel machine. This is a lazy thread creation approach (the term “lazy task creation” was coined later by Mohr et al. [] to describe a similar mechanism in MultiLisp). A corollorary is that, since sparks do not carry any execution state, they can be very lightweight. Generally, a single pointer is adequate to record a sparked expression in most implementations of GpH.
Glasgow Parallel Haskell (GpH)
Sparks may be chosen for conversion to threads using a number of different strategies. One common approach is to use the oldest spark first. If the application is a divide-and-conquer program, this means that the spark representing the largest amount of work will be chosen. Alternatively, an approach may be used where the youngest (smallest) spark is executed locally and the oldest (largest) spark is offloaded for remote execution. This will improve locality, but may increase thread creation costs, since many of the locally created threads might otherwise be subsumed into their parent thread.
Parallel Graph Reduction A second key issue that must be dealt with is that of thread synchronization. Efficient sequential non-strict functional language implementations generally use an evaluation technique called graph reduction, where a program builds a graph data structure representing the work that is needed to give the result of a program and gradually rewrites this using the functional rules defined in the program until the graph is sufficiently complete to yield the required result. This rewriting process is known as reducing the graph to some normal form (in fact weak head normal form). Each node in the graph represents an expression in the original program. Initially, this will be a single unevaluated expression corresponding to the result of the program (a “thunk”). As execution proceeds, thunks are evaluated, and each graph node is overwritten with its result. In this way, results are automatically shared between several consumers. Only graph nodes that actually contribute to the final result need to be rewritten. This means that unnecessary work can be avoided, a process that, in the sequential world, allows lazy evaluation. The same mechanism naturally lends itself to parallel evaluation. Each shared node in the graph represents a possible synchronization point between two parallel threads. The first thread to evaluate a graph node will lock it. Any thread that evaluates the node in parallel with the thread that is evaluating it will then block when it attempts to read the value of the node. When the evaluating thread produces the result, the graph node will be updated and any blocked threads will be awoken. In this way, threads will automatically synchronize through the graph representing the computation, not
G
just at the root of each thread, but whenever they share any sub-expressions. Figure shows a simple example of a divide-andconquer parallel program, where the root of the computation, f 9, is rewritten using three threads: one for the main computation and two sub-threads, one to evaluate f 8 and one to evaluate f 7. These thunks are linked into the addition nodes in the main computation. Having evaluated the sparked thunks, the second and third threads update their root nodes with the corresponding result. Once the main thread has evaluated the remaining thunk (another call to f 8, which we have assumed is not shared with Thread #), it will incorporate these results into its own result. The final stage of the computation is to rewrite the root of the graph (the result of the program) with the value 89.
Evaluate-and-Die As a thread evaluates its program graph, it may encounter a node that has been sparked. There are three possible cases, depending on the evaluation status of the sparked node. If the thunk has already been evaluated, then its value can be used as normal. If the thunk has not yet been evaluated, then the thread will evaluate it as normal, first annotating it to indicate that it is under evaluation. Once a result is produced, the node will be overwritten with the result value, and any blocked threads will be reawakened. Finally, if the node is actually under evaluation, then the thread must be blocked until the result is produced and the node updated with this value. This is the evaluate-and-die execution model []. The advantage of this approach is that it automatically absorbs sparks into already executing threads, so increasing their granularity and avoiding thread creation overheads. If a spark refers to a thunk that has already been evaluated then it may be discarded without a thread ever being created. The seq Primitive While par is entirely adequate for identifying many forms of parallelism, Roe and Peyton Jones discovered [] that by adding a sequential combining function, called seq, much tighter control could be obtained over parallel execution. This could be used both to reduce the need for detailed knowledge of the underlying parallel implementation and to encode more sophisticated patterns of parallel execution. For example, one
G
G
Glasgow Parallel Haskell (GpH)
f
+
10 Thread #1
f
Thread #2
9
f
8
+ Thread #2
Thread #1
+
f
8
+
f
7
Thread #3
+
+ Thread #2
Thread #1
+
f
8
+
+
f
34
7
Thread #3
55
34
21 Thread #3
+ Thread #1
Thread #2
Thread #1
89 Thread #2
34
Glasgow Parallel Haskell (GpH). Fig. Parallel graph reduction
of the par constructs in the example above can be replaced by a seq construct, as shown below. f x = let r = fibonacci x ; r = f a c t o r i a l x in l e t r e s u l t = ( r , r ) in r ‘ par ’ ( r ‘ seq ’ r e s u l t ) −− was r
‘ par ’ r ‘ par ’ r e s u l t
The seq construct acts like ; in a conventional language such as C: it first evaluates its first argument (here r2), and then returns its second argument (here result). So in the example above, rather than creating two sparks as before, now only one is created. Previously, depending on the order in which threads were scheduled, either the parent thread would have blocked when it evaluated r1 or r2, because these
were already under evaluation, or one or both of the sparked threads would have blocked, because they were being evaluated (or had been evaluated) by the parent thread. Now, however, the only possible synchronization is between the thread evaluating r1 and the parent thread. Note that it is not possible to use either of the simpler forms of par r1 (r1, r2) or par r2 (r1, r2) to achieve the same effect, as might be expected. Because the implementation is free to evaluate the pair (r1, r2) in whichever order it prefers, there is a % probability that the spark will block on the parent thread. This is not the only use of seq. For example, evaluation strategies (discussed below) also make heavy use of this primitive to give precise control over evaluation order. While not absolutely essential for parallel evaluation, it is thus very useful to provide a finer
Glasgow Parallel Haskell (GpH)
degree of control than can be achieved using seq alone. However, there are two major costs to the use of seq. The first is that the strictness properties of the program may be changed – this means that some thunks may be evaluated that were previously not needed, and that the termination properties of the program may therefore be changed. The second is that parallel programs may become speculative.
Speculative Evaluation As described above, in GpH it is safe to spark any subexpession that is needed by the result of the parallel computation. However, nothing prevents the programmer from sparking a sub-expression that is not known to be needed. This therefore allows speculative evaluation. Although there was some early experimentation with mechanisms to control speculation, these were found to be very difficult to implement, and current implementations of GpH do not provide facilities to kill created threads, change priorities dynamically, etc., as is necessary to support safe speculation. For example, s p e c x y = x ‘ par ’ y
defines a new function spec that sparks x and continues execution of y. If x is not definitely needed by y, or is a completely independent expression, this will spark x speculatively. If a thread is created to evaluate x, it will terminate either when it has completed evaluation of x, or when the main program terminates, having produced its own result, or if it evaluates an undefined value. In the latter case, it may cause the entire program to fail. Moreover, it is theoretically possible to prevent any progress on the main computation by flooding the runtime system with useless speculative threads. For these reasons, any speculative evaluation has to be treated carefully: if the program is to terminate, speculative sub-expressions should terminate in finite time without an error, and there should not be too many speculative sparks.
Evaluation Strategies As discussed above, it is possible to construct higherlevel parallel abstractions from basic seq/par constructs using standard Haskell programming constructs. These parallel abstractions can be associated with computation functions using higher-order definitions. This is the evaluation strategy approach. The key idea of evaluation
G
strategies to separate what is to be evaluated from how it could be evaluated in parallel []. For example, a simple sequential ray tracing function could be defined as follows: r a y t r a c e r : : I n t −> I n t −> S c e n e −> [ L i g h t ] −> [ [ V e c t o r ] ] r a y t r a c e r xlim ylim scene l i g h t s = map t r a c e l i n e [ . . ylim − ] where t r a c e l i n e y = [ t r a c e p i x e l s c e n e l i g h t s x y | x <− [ . . xlim − ] ]
Given a visible image with maximum x and y dimensions of xlim and ylim, the raytracer function applies the traceline function to each line in the image (by mapping it across each value of y in the range 0..ylim-1), and hence applies the tracepixel function to each pixel using the specified scene and lighting model. The raytracer function may be parallelized, for example, by adding the parMap strategy to parallelize each line as follows: r a y t r a c e r : : I n t −> I n t −> S c e n e −> [ L i g h t ] −> [ [ V e c t o r ] ] r a y t r a c e r x l i m y l i m s c e n e l i g h t s = map t r a c e l i n e [ . . ylim − ] ‘ u s i n g ’ parMap r n f
Here, parMap is a strategy that specifies that each element of its value argument should be evaluated in parallel. It is parameterized on another strategy that specifies what to do with each element. Here, rnf indicates that each element should be evaluated as far as possible (rnf stands for “reduce to normal form”). The using function simply applies its second argument (an evaluation strategy) to its first argument (a functional value), and returns the value. It can be easily defined using higher-order functions and the seq primitive as follows: u s i n g : : a −> S t r a t e g y a −> a x ‘ u s i n g ’ s = s x ‘ seq ’ x
So, when a strategy is applied to an expression by the using operation, the strategy is first applied to the value of the expression, and once this has completed, the value is returned. This allows the use of both parallel and sequential strategies. In fact, any list strategy may be used to parallelize the raytracer function instead of parMap. For example, a task farm could be used, or the list could be divided into equally sized chunks as shown below.
G
G
Glasgow Parallel Haskell (GpH)
r a y t r a c e r : : I n t −> I n t −> S c e n e −> [ L i g h t ] −> [ [ V e c t o r ] ] r a y t r a c e r xlim ylim scene l i g h t s = . . . ‘ using ’ parListChunk chunkSize r nf
Evaluation strategies have proved to be very powerful and flexible, having been successfully applied to several large programs, including symbolic programs with irregular patterns of parallelism []. Typically, only a few lines need to be changed at key points by adding appropriate using clauses. In abstracting parallel patterns from sequential computations, evaluation strategies have some similarities with algorithmic skeletons. The key differences between evaluation strategies and typical skeleton approaches is that evaluation strategies do not mandate a specific parallel implementation (since they rely on semi-explicit parallelism, they provide hints to the runtime system rather than directives); and that they are completely user-programmable – everything above the par and seq primitives is programmed using standard Haskell code. Unlike most skeleton approaches, they may also be easily composed to give irregular nested parallel structures or phased parallel computations, for example.
shared with other PEs. These nodes are given global addresses, which identify the owner of the graph node. The advantage of this model is that it allows completely independent memory management: by keeping tables of global in-pointers, which are used as garbage collection roots into the local heap, each PE can garbage-collect its own local heap independently of all other PEs. This local garbage collection is conservative, as required, since even if a global in-pointer is no longer referenced by any other PE, it will still be treated as a garbage collection root. In order to collect global garbage, a separate scheme is used, based on distributed reference counting. Since the majority of the program graph never needs to be shared, this approach brings major efficiency gains over typical single-level memory management. In addition to the reduced need for synchronization during garbage collection, there is also no need to maintain global locks across purely local graph. Within each PE, GUM uses the same efficient (and sequential) garbage collector as the standard GHC implementation. This is currently a stop-and-copy generational collector based on Appel’s collector.
Visualizing the Behavior of GpH Programs The GUM Implementation of GpH GUM is the original and most general implementation of GpH. It provides a virtual shared-memory abstraction for parallel graph reduction, using a message-passing implementation to target both physically shared-memory and distributed-memory systems. In the GUM model, the graph representing the program that is being evaluated is distributed among the set of processor elements (PEs) that are available to execute the program. These PEs usually correspond to the cores or processors that are available to execute the parallel program, but it is possible to map PEs onto multiple processes on the same core/processor, if required. Each PE manages its own execution environment, with its own local pools of sparks and threads, which it schedules as required. Sparks and threads are offloaded on demand to maintain good work balance. A key feature of GUM is that it uses a two-level memory model: each PE has its own private heap that contains unshared program graph created by local threads. Within this heap, some graph nodes may be
Good visualization is an essential part of understanding parallel behavior. Runtime information can be visualized at a variety of levels to give progressively more detailed information. For example, Fig. visualizes the overall and per-PE parallel activity for a -PE system running the soda application (a simple crosswordpuzzle solver). Apart from the clear phase-transition about % into the execution, very few threads are blocked; and there are very few runnable, but not running threads. Those that are runnable are migrated to balance overall system load. The profile also reveals that relatively little time is spent fetching nonlocal data. The per-PE profile gives a more detailed view. It is obvious that the workload is well balanced and that the workstealing mechanism is effective in distributing work. The example also shows where PEs are idle for repeated periods, and that only the main PE is active at the end of the computation (collating results to be written as the output of the program). It thus clearly identifies places where improvements could be made to the parallel program.
Glasgow Parallel Haskell (GpH)
tasks
GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim
soda_mg +RTS -bP -bp32 -bM -b-G -by1 -bl400
G
Average Parallelism = 14.8
45
40
35
30
25
G
20
15
10
5
0
0
50.0 k
100.0 k
running
GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim
150.0 k
runnable
200.0 k
250.0 k
fetching
300.0 k
blocked
350.0 k
400.0 k
migrating
soda_mg +RTS -bP -bp32 -bM -b-G -by1 -bl400
450.0 k
Runtime = 493.2 k cycles
Mon Oct 21 22:59 1996
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0
50.0 k
100.0 k
150.0 k
200.0 k
250.0 k
300.0 k
350.0 k
400.0 k
450.0 k
Glasgow Parallel Haskell (GpH). Fig. Sample overall and per-PE Activity Profiles for a -PE machine
493251
G
Glasgow Parallel Haskell (GpH)
The GHC-SMP Implementation of GpH A recent development is the GHC-SMP [] implementation of GpH from Microsoft Research Labs, Cambridge, UK, which is integrated into the standard GHC distribution. This provides an implementation of GpH that specifically targets shared-memory and multicore systems. It uses a similar model of PEs and threads to the GUM implementation, including spark pools and runnable thread pools, and implements a similar sparkstealing model. The key difference from GUM is that GHC-SMP uses a physically shared heap, rather than a virtual shared heap with an underlying message-passing implementation. This heap is garbage-collected using a global stop-and-copy collector that requires the synchronization of all PEs, but which may itself be executed in parallel using the available processor cores. Also, unlike GUM, GHC-SMP exports threads to remote PEs based on the load of the local core, rather than on demand.
GRID-GUM and the SymGrid System By replacing the low-level communications library with MPICH-G, it is possible to execute GUM (or Eden – see below) not only on standard clusters of workstations, but also on wide-area computational grids, coordinated by the standard Globus grid middleware. The GRID-GUM and grid-enabled Eden implementations form the basis for the SymGrid-Par middleware [], which aims to provide high-level support for computational grids for a variety of symbolic computing systems. The middleware coordinates symbolic computing engines into a coherent parallel system, providing high-level skeletons. The system uses a high-level data-exchange protocol (SCSCP) based on the standard OpenMath XML format for mathematical data. Results have been very promising to date, with superlinear performance being achievable for some mathematical applications, without changing any of the sequential symbolic computing engines.
The GranSim Simulator The GranSim parallel simulator was developed in [] as an efficient and accurate simulator for GpH running on a sequential machine, specifically to expose granularity issues. It is unusual in modifying the sequential GHC runtime system so that evaluating graph nodes also triggers the addition of sparks,
and simulates inter-PE communication. It thus follows the actual GUM implementation very precisely. It allows a range of communication costs to be simulated for a specific parallel application, ranging from zero (ideal communication) to costs that are similar to those Program execution times are also simulated, using a cost model that takes architectural characteristics into account. One of the key uses of GranSim is as the core of a parallel program development methodology, where a GpH program is first simulated under ideal parallel conditions, and then under communication cost settings for specific parallel machines: shared-memory, distributed-memory, etc. This allows the program to be gradually tuned for a specific parallel setting without needing access to the actual parallel machine, and in a way that is repeatable, can be easily debugged, and which provides detailed metrics.
GdH and Mobile Haskell GdH [] is a distributed Haskell dialect that builds on GpH. It adds explicit constructs for task creation on specific processors, with inter-processor communication through explicit mutable shared variables, that can be written on one processor and read on another. Internally, each processor runs a number of threads. The GdH implementation extends the GUM implementation with a number of explicit constructs. The main construct is revalIO, which constructs a new task on a specific procesor. In conjunction with mutable variables, this can be used to construct higher-level constructs. For example, the showSystem definition below prints the names of all available PEs. The getAllHostNames function maps the hostName function over the list of all processor identifiers, returning an IO action that obtains the environment variable HOST on the specified PE. This is used in the showSystem IO operation which first obtains a list of all host names, then outputs them as a sorted list, with duplicates eliminated using the standard nub function. g e t A l l H o s t N a m e s : : IO [ S t r i n g ] g e t A l l H o s t N a m e s = mapM hostName a l l P E I d hostName : : PEId −> IO S t r i n g hostName pe = r e v a l I O ( g e t E n v ‘ ‘ HOST ’ ’ ) pe showSystem = do { h o s t n a m e s <=
Glasgow Parallel Haskell (GpH)
g e t A l l H o s t N a m e s ; showHostNames h o s t n a m e s } where showHostNames names = p u t S t r L n ( show ( s o r t ( nub names ) ) )
At the language level, GdH is similar to Concurrent Haskell, which is designed to execute explicit concurrent threads on a single PE. The key language difference is the inclusion of an explicit PEId to allocate tasks to PEs, supported by a distributed runtime environment. GdH is also broadly similar to the industrial Erlang language, targeting a similar area of distributed programming, but is non-strict rather than strict and, since it is a research language, does not have the range of telecommunications-specific libraries that Erlang supports in the form of the Erlang/OTP development platform. Mobile Haskell [] similarly extends Concurrent Haskell, adding higher-order communication channels (known as Mobile Channels), to support mobile computations across dynamically evolving distributed systems. The system uses a bytecode implementation to ensure portability, and serializes arbitrary Haskell values including (higher-order) functions, IO actions, and even mobile channels.
Eden Eden [] (described in a separate encyclopedia entry) is closely related to GpH. Like GpH, it uses an implicit communication mechanism, sharing values through program graph nodes. However, it uses an explict process construct to create new processes and explicit process application to pass streams of values to each process. Unlike GpH, all data that is passed to an Eden process must be evaluated before it is communicated. There is also no automatic work-stealing mechanism, task migration, granularity agglomeration, or virtual shared graph mechanism. Eden, does, however, provide additional threading support: each output from the process is evaluated using its own parallel thread. It also provides mechanisms to allow explicit communication channels to be passed as first-class objects between processes. Eden has been used to implement both algorithmic skeletons and evaluation strategies, where it has the advantage of providing more controllable parallelism but the disadvantage of losing adaptivity. One
G
particularly useful technique is to use Eden to implement a master–worker evaluation strategy, where a set of dynamically generated worker functions is allocated to a fixed set of worker processors using an explicitly programmed scheduler. This allows Eden to deal with varying thread granularities, without using a lazy thread creation mechanism, as in GpH. While it does not need the advanced adaptivity mechanisms that GUM provides, the Eden implementation shares several basic components, in particular, the communication library and scheduler have very similar implementations.
Current Status GpH has been adopted as a significant part of the Haskell community effort on parallelism. The evaluation strategies library has just been released as part of the mainstream GHC compiler release, building on the GHC-SMP implementation, and there has also been significant recent effort on visualization with the release of the EdenTV and ThreadScope visualizers for Eden and GHC-SMP, respectively. Work is currently underway to integrate GUM and GHC-SMP to give a widespectrum implementatation for GpH, and Eden will also form part of this effort. The SymGrid-Par system is being developed as part of a major UK project to provide support for high-performance computing on massively parallel machines, such as HECToR.
Future Directions The advent of multicore computers and the promise of manycore computers have changed the nature of parallel computing. The GpH model of lightweight multithreaded parallelism is well suited to this new order, offering the ability to easily generate large amounts of parallelism. The adaptivity mechanisms built into the GUM implementation mean that the parallel program can dynamically, and automatically, change its behavior to increase parallelism, improve locality, or throttle back parallelism as required. Future parallel systems are likely to be built more hierarchically than present ones, with processors built from heteregeneous combinations of general-purpose cores, graphics processing units, and other specialist units, using several communication networks and a complex memory hierarchy. Proponents of manycore architectures anticipate that a single processor
G
G
Glasgow Parallel Haskell (GpH)
will involve hundreds or even thousands of such units. Because of the cost of maintaining a uniform memory model across large numbers of systems, when deployed on a large scale these processors are likely to be combined hierarchically into multiple levels of clusters. These systems may then be used to form high performance “clouds.” Future parallel languages and implementations must therefore be highly flexible and adaptable, capable of dealing with multiple levels of communication latency and internal parallel structure, and perhaps fault tolerance. The GpH model with evaluation strategies, supported by adaptive implementations such as GUM forms a good basis for this, but it will be necessary to extend the existing models and implementations to cover more heterogeneous processor types, to deal with multiple levels of parallelism and additional program structure, and to focus more directly on locality issues.
Related Entries Eden Fortress (Sun HPCS Language) Functional Languages Futures
(TFP). Papers also frequently appear in the Journal of Functional Programming (JFP). A survey of Haskell-based parallel languages and implementations, as of , can be found in [], and a general introduction to research in parallel functional programming, as of , can be found in []. Some examples of the use of GpH in larger applications can be found in []. Much of this entry is based on private notes, e-mails, and final reports on the various research projects that have used GpH. The main paper on evaluation strategies is []. The main paper on the GUM implementation is [], and the main paper on the GranSim similator is []. Many subsequent papers have used these systems and ideas. For example, one recent paper describes the SymGrid-Par system []. Further material may be found on the GpH web page at http://www.macs.hw.ac.uk/~dsg/gph/, on the GdH web page at http://www.macs.hw.ac.uk/~dsg/ gdh/, on the GHC web page at http://www.haskell. org/ghc, on the SCIEnce project web page at http:// www.symbolic-computation.org, on the Eden web page at http://www.mathematik.uni-marburg.de/~eden, and on the contributor’s web page at http://www-fp.cs. st-andrews.ac.uk/~kh.
MPI (Message Passing Interface) MultiLisp NESL Parallel Skeletons Processes, Tasks, and Threads Profiling PVM (Parallel Virtual Machine) Shared-Memory Multiprocessors Sisal Speculation, Thread-Level
Bibliographic Notes and Further Reading The main venues for publication on GpH and other parallel functional languages are the International Conference on Functional Programming (ICFP) and its satellite events including the Haskell Symposium; the International Symposium on Implementation and Application of Functional Languages (IFL); the International Conference on Programming Language Design and Implementation (PLDI); and the Symposium on Trends in Functional Programming
Bibliography . Hammond K, Loidl H-W, Partridge A () Visualising granularity in parallel programs: a graphical winnowing system for Haskell. Proceedings of the HPFC’ – Conference on High Performance Functional Computing, Denver, pp –, – April . Hammond K, Michaelson G (eds) () Research directions in parallel functional programming. Springer, Heidelberg . Hammond K, Zain AA, Cooperman G, Petcu D, Trinder PW () SymGrid: a framework for symbolic computation on the grid. Proceedings of the EuroPar ’: th International EuroPar Conference, Springer LNCS , Rennes, France, pp – . Loidl H-W, Trinder P, Hammond K, Junaidu SB, Morgan RG, Peyton Jones SL () Engineering parallel symbolic programs in GpH. Concurrency Pract Exp ():– . Loogen R, Ortega-Mallén Y, Peña Mar R () Parallel functional programming in Eden. J Functional Prog ():– . Marlow S, Jones SP, Singh S () Runtime support for multicore Haskell. Proceedings of the ICFP ’: th ACM SIGPLAN International Conference on Functional Programming, ACM Press, New York . Mohr E, Kranz DA, Halstead RH Jr () Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans Parallel Distrib Syst ():– . Peyton Jones S, Clack C, Salkild J () High-performance parallel graph reduction. Proceedings of the PARLE’ – Conference
Global Arrays Parallel Programming Toolkit
.
.
.
.
.
. .
.
on Parallel Architectures and Languages Europe, Springer LNCS , pp – Peyton Jones S, Hall C, Hammond K, Partain W, Wadler P () The Glasgow Haskell compiler: a technical overview. Proceedings of the JFIT (Joint Framework for Information Technology) Technical Conference, Keele, UK, pp – Peyton Jones S (ed), Hughes J, Augustsson L, Barton D, Boutel B, Burton W, Fasel J, Hammond K, Hinze R, Hudak P, Johnsson T, Jones M, Launchbury J, Meijer E, Peterson J, Reid A, Runciman C, Wadler P () Haskell language and libraries. The revised report. Cambridge University Press, Cambridge Pointon R, Trinder P, Loidl H-W () The design and implementation of Glasgow distributed Haskell. Proceedings of the IFL’ – th International Workshop on the Implementation of Functional Languages, Springer LNCS , Aachen, Germany, pp – Rauber Du Bois A, Trinder P, Loidl H () mHaskell: mobile computation in a purely functional language. J Univ Computer Sci ():– Roe P () Parallel programming using functional languages. PhD thesis, Department of Computing Science, University of Glasgow Trinder P, Hammond K, Loidl H-W, Peyton Jones S () Algorithm + strategy = parallelism. J Funct Prog ():– Trinder P, Hammond K, Mattson J Jr, Partridge A, Peyton Jones S () GUM: a portable parallel implementation of Haskell. Proceedings of the PLDI’ – ACM Conference on Programming Language Design and Implementation, Philadelphia, pp – Trinder P, Loidl H-W, Pointon R () Parallel and distributed Haskells. J Funct Prog (&):–. Special Issue on Haskell
Global Arrays Global Arrays Parallel Programming Toolkit
G
Definition Global Arrays is a high-performance programming model for scalable, distributed-memory, parallel computer systems. Global Arrays is based on the concept of globally accessible dense arrays that are logically shared, yet physically distributed onto the memories of a parallel distributed computer system (Fig. illustrates this concept).
Discussion Introduction Global Arrays (GA) is a high-performance programming model for scalable, distributed-memory, parallel computer systems. GA is a library-based Partitioned Global Address Space (PGAS) programming model. The underlying supported sequential languages are Fortran, C, C++, and Python. GA provides global view access to very large dense arrays through API functions implemented for those languages, under a Single Program Multiple Data (SPMD) execution environment. GA was originally developed as part of the underlying software infrastructure for the US Department of Energy’s NWChem computational chemistry software package. Over time, it has been developed into a standalone package with a rich set of API functions (+) that cater to many needs in scientific application development. GA has been used to enable scalable Physically distributed data
Global Arrays Parallel Programming Toolkit Jarek Nieplocha,† , Manojkumar Krishnan , Bruce Palmer , Vinod Tipparaju , Robert Harrison , Daniel Chavarría-Miranda Pacific Northwest National Laboratory, Richland, WA, USA Oak Ridge National Laboratory, Oak Ridge, TN, USA
Synonyms Global arrays; GA † deceased.
Single, shared data structure
Global Arrays Parallel Programming Toolkit. Fig. Dual view of Global Arrays data structures
G
G
Global Arrays Parallel Programming Toolkit
parallel execution for several major scientific applications including NWChem (computational chemistry, specifically electronic structure calculation), STOMP (Subsurface Transport Over Multiple Phases, a subsurface flow and transport simulator), ScalaBLAST (a more scalable, higher-performance version of BLAST), Molpro (quantum chemistry), TE THYS (unstructured, implicit CFD and coupled fluid/solid mechanics finite volume code), Pagoda (Parallel Analysis of Geodesic Data), COLUMBUS (computational chemistry), and GAMESS-UK (computational chemistry). GA’s development has occurred over the last two decades. For this reason, the number of people involved and their contributions is large. GA’s original development occurred as a co-design effort between the NWChem team and the computer science team focused on GA. The main designer and original developer of GA was Jarek Nieplocha. Robert Harrison led the main effort in the development of NWChem.
Basic Global Arrays There are three classes of operations in Global Arrays: core operations, task parallel operations, and dataparallel operations. These operations have multiple language bindings, but provide the same functionality independent of the language. The current GA library contains approximately operations that provide a rich set of functionality related to data management and computations involving distributed arrays. GA is interoperable with MPI, enabling the development of hybrid programs that use both programming models. The basic components of the Global Arrays toolkit are function calls to create global arrays, copy data to, from, and between global arrays, and identify and access the portions of the global array data that are held locally. There are also functions to destroy arrays and free up the memory originally allocated to them. The basic function call for creating new global arrays is nga_create. The arguments to this function include the dimension of the array, the number of indices along each of the coordinate axes, and the type of data (integer, float, double, etc.) that each array element represents. The function returns an integer handle that can be used to reference the array in all subsequent operations. The allocation of data can be left completely to the toolkit, but if it is desirable to control the distribution
of data for load balancing or other reasons, additional versions of the nga_create function are available that allow the user to specify in detail how data is distributed between processors. The basic nga_create call provides a simple mechanism to control data distribution via the specification of an array that indicates the minimum dimensions of a block of data on each processor. One of the most important features of Global Arrays is the ability to easily move blocks of data between global arrays and local buffers. The data in the global array can be referred to using a global indexing scheme and data can be moved in a single function call, even if it represents data distributed over several processors. The nga_get function can be used to move a block of distributed data from a global array to a local buffer. The arguments consist of the array handle for the array that data is being taken from, two integer arrays representing the lower and upper indices that bound the block of distributed data that is going to be moved, a pointer to the local buffer or a location in the local buffer that is to receive the data, and an array of strides for the local data. The nga_put call is similar and can be used to move data in the opposite direction. The number of basic GA operations is fairly small and many parallel programs can be written with just the following ten routines: ● ● ●
●
● ● ●
●
GA_Initialize(): Initialize the GA library. GA_Terminate(): Release internal resources and finalize execution of a GA program. GA_Nnodes(): Return the number of GA compute processes (corresponds to the SPMD execution environment). GA_Nodeid(): Return the GA process ID of the calling compute process, this is a number between and GA_Nnodes() – 1. NGA_Create(): Create an n-dimensional globally accessible dense array (global array instance). NGA_Destroy(): Deallocate memory and resources associated with a global array instance. NGA_Put(): Copy data from a local buffer to an array section within a global array instance in a onesided manner. NGA_Get(): Copy data from an array section within a global array instance to a local buffer in a
Global Arrays Parallel Programming Toolkit
P2 GA_Get
P0
P1
P3
P4
G
P2 calls GA Get to get part of the global array into its local buffer P0
P1
P2 Determine ownership and data locality
P3
P4
G P0 Issue multiple non-blocking ARMCI_NbGet’s, one for data on each remote destination
Waits for completion of all data transfers
P1
P2
P3
P2
P4
P2 has the data, GA Get is complete
Global Arrays Parallel Programming Toolkit. Fig. Left: GA_Get flow chart. Right: An example: Process P issues GA_Get to get a chunk of data, which is distributed (partially) among P, P, P, and P (owners of the chunk)
one-sided manner (see Fig. for a detailed description of how get() operates). ● GA_Sync(): Synchronize compute processes via a barrier and ensure that all pending GA operations are complete (in accordance to the GA consistency model). ● NGA_Distribution(): Returns the array section owned by a specified compute process.
Example GA Program We present a parallel matrix multiplication program written in Global Arrays using the Fortran language interface. It uses most of the basic GA calls described before, in addition to some more advanced calls to create global array instances with specified data distributions. The program computes the result of C = A × B. Some variable declarations have been omitted for brevity.
Discussion on the Example Program The program is (mostly) a fully functional GA code, except for omitted variable declarations. It creates global array instances with specific data distributions, illustrates the use of the nga_put() and nga_get() primitives, as well as locality information through the nga_distribution() call. The code includes calls to initialize and terminate the MPI library, which are needed to provide the SPMD execution environment to the GA application. (It is possible to write a GA application that does not call the MPI library through the use of the TCGMSG simple message-passing environment included with GA.) The code includes the creation and use of local buffers to be used as sources and targets for put and get operations (lA, lB, lC), which in this case were allocated as Fortran dynamic arrays. Lines – contain the principal part of the example code and illustrate several of the features of
G 1 2 3 4 5 6 7 8 9
Global Arrays Parallel Programming Toolkit
program matmul integer :: sz integer :: i, j, k, pos, g_a, g_b, g_c integer :: nproc, me, ierr integer, dimension(2) :: dims, nblock, chunks integer, dimension(1) :: lead double precision, dimension(:,:), pointer :: lA, lB double precision, dimension(:,:), pointer :: lC call mpi_init(ierr)
10 call ga_initialize() 11 nproc = ga_nnodes() 12 me = ga_nodeid() 13 if (me .eq. 0) then 14 write (∗ , ∗ ) ’Running on: ’, nproc, ’ processors’ 15 end if 16 dims(:) = sz 17 chunks(:) = sz/sqrt(dble(nproc)) ! only runs on a perfect square number of processors 18 nblock(1) = sz/chunks(1) 19 nblock(2) = sz/chunks(2) 20 allocate(dmap(nblock(1) + nblock(2))) 21 pos = 1 22 do i = 1, sz - 1, chunks(1) ! compute beginning coordinate of each partition in the 1st dimension 23 dmap(pos) = i 24 pos = pos + 1 25 end do 26 do j = 1, sz - 1, chunks(2) ! compute beginning coordinate of each partition in the 2nd dimension 27 dmap(pos) = j 28 pos = pos + 1 29 end do 30 ret = nga_create_irreg(MT_DBL, ubound(dims), dims, ’A’, dmap, nblock, g_a)! create a global array instance with specified data distribution 31 ret = ga_duplicate(g_a, g_b, ’B’) ! duplicate same data distribution for array B 32 ret = ga_duplicate(g_a, g_c, ’C’) ! and C 33 allocate(lA(chunks(1), chunks(2)), lB(chunks(1), chunks(2)), lC(chunks(1), chunks(2))) 34 lA(:, :) = 1.0
Global Arrays Parallel Programming Toolkit
G
35 lB(:, :) = 2.0 36 lC(:, :) = 0.0 37 lead(1) = chunks(2) 38 call nga_distribution(g_a, me, tcoordsl, tcoordsh) 39 40 41 42
! initialize global array instances to respective values call nga_put(g_a, tcoordsl, tcoordsh, lA(1, 1), lead) call nga_put(g_b, tcoordsl, tcoordsh, lB(1, 1), lead) call nga_put(g_c, tcoordsl, tcoordsh, lC(1, 1), lead)
43 44 45 46 47
! obtain all tcoordsl1(1) tcoordsl1(2) tcoordsh1(1) tcoordsh1(2)
blocks in the row of the A matrix = 1 = tcoordsl(2) = chunks(1) = tcoordsh(2)
48 49 50 51 52
! obtain all tcoordsl2(1) tcoordsl2(2) tcoordsh2(1) tcoordsh2(2)
blocks in the column of the B matrix = tcoordsl(1) = 1 = tcoordsh(1) = chunks(2)
53 do pos = 1, nblock(1) ! matrix is square 54 call nga_get(g_a, tcoordsl1, tcoordsh1, lA(1, 1), lead) 55 call nga_get(g_b, tcoordsl2, tcoordsh2, lB(1, 1), lead) 56 57 58 59 60 61
do j = 1, n do k = 1, n do i = 1, n lC(i, j) = lC(i, j) + lA(i, k) * lB(k, j) end do end do
62 63 64
! advance coordinates for blocks tcoordsl1(1) = tcoordsl1(1) + chunks(1) tcoordsh1(1) = tcoordsh1(1) + chunks(1)
65 tcoordsl2(2) = tcoordsl2(2) + chunks(2) 66 tcoordsh2(2) = tcoordsh2(2) + chunks(2) 67 end do 68 ! lC contains the final result for the block owned by the process 69 call nga_put(g_c, tcoordsl, tcoordsh, lC(1, 1), lead) 70 ! do something with the result 71 call ga_print(g_c) 72 deallocate(dmap, lA, lB, lC)
G
G 73 74 75 76 77
Global Arrays Parallel Programming Toolkit
ret = ga_destroy(g_a) ret = ga_destroy(g_b) ret = ga_destroy(g_c) call ga_terminate() call mpi_finalize(ierr)
78 end program matmul
GA: data is being accessed using global coordinates (lines – and –) that correspond to the global size of the global array instances that were created previously; in addition, the access to the global array data for arrays g_a and g_b in lines and is done without requiring the participation of the process where that data is allocated (one-sided access). Figure illustrates the concept of one-sided access.
Global Arrays Concepts GA allows the programmer to control data distribution and makes the locality information readily available to be exploited for performance optimization. For example, global arrays can be created by: () allowing the
Process Z ga_get(a, 175, 185, 19, 70, buf, 10) Process Y ga_get(a, 180, 210, 23, 40, buf, 30) Process X ga_get(a, 100, 200, 17, 20, buf, 100)
Global Arrays Parallel Programming Toolkit. Fig. Any part of GA data can be accessed independently by any process at any time
library to determine the array distribution, () specifying the decomposition for only one array dimension and allowing the library to determine the others, () specifying the distribution block size for all dimensions, or () specifying an irregular distribution as a Cartesian product of irregular distributions for each axis. The distribution and locality information is always available through interfaces that allow the application developer to query: () which data portion is held by a given process, () which process owns a particular array element, and () a list of processes and the blocks of data owned by each process corresponding to a given section of an array. The primary mechanisms provided by GA for accessing data are block copy operations that transfer data between layers of memory hierarchy, namely, global memory (distributed array) and local memory. Further, extending the benefits of using blocked data accesses and copying remote locations into contiguous local memory can improve cache performance by reducing both conflict and capacity misses []. In addition, each process is able to access directly the data held in a section of a Global Array that is locally assigned to that process. Data representing sections of the Global Array owned by other processes on SMP clusters can also be accessed directly using the GA interface, if desired. Atomic operations are provided that can be used to implement synchronization and assure correctness of an accumulate operation (floating-point sum reduction that combines local and remote data) executed concurrently by multiple processes and targeting overlapping array sections. GA is extensible as well. New operations can be defined exploiting the low-level interfaces dealing with distribution, locality, and providing direct memory access (nga_distribution, nga_locate_ region, nga_access, nga_release, nga_ release_update). These, e.g., were used to provide
Global Arrays Parallel Programming Toolkit
additional linear algebra capabilities by interfacing with third-party libraries, e.g., ScaLAPACK [].
Global Arrays Memory Consistency Model In shared-memory programming, one of the issues central to performance and scalability is memory consistency. Although the sequential consistency model [] is straightforward to use, weaker consistency models [] can offer higher performance on modern architectures and they have been implemented on actual hardware. GA’s nature as a one-sided, global-view programming model requires similar attention to memory consistency issues. The GA approach is to use a weakerthan-sequential consistency model that is still relatively straightforward to understand by an application programmer. The main characteristics of the GA approach include: ●
GA distinguishes two types of completion of the store operations (i.e., put, scatter) targeting global shared memory: local and remote. The blocking store operation returns after the operation is completed locally, i.e., the user buffer containing the source of the data can be reused. The operation completes remotely after either a memory fence operation or a barrier synchronization is called. The fence operation is required in critical sections of the user code, if the globally visible data is modified. ● The blocking operations (get/put) are ordered only if they target overlapping sections of global arrays. Operations that do not overlap or access different arrays can complete in arbitrary order. ● The nonblocking get/put operations complete in arbitrary order. The programmer uses wait/test operations to order completion of these operations, if desired.
Global Arrays Extensions To allow the user to exploit data locality, the toolkit provides functions identifying the data from the global array that is held locally on a given processor. Two functions are used to identify local data. The first is the nga_distribution function, which takes a processor ID and an array handle as its arguments and returns a set of lower and upper indices in the global address space representing the local data block. The second is the nga_access function, which returns an array index and an array of strides to the locally held
G
data. In Fortran, this can be converted to an array by passing it through a subroutine call. The C interface provides a function call that directly returns a pointer to the local data. In addition to the communication operations that support task parallelism, the GA toolkit includes a set of interfaces that operate on either entire arrays or sections of arrays in the data-parallel style. These are collective data-parallel operations that are called by all processes in the parallel job. For example, movement of data between different arrays can be accomplished using a single function call. The nga_copy_patch function can be used to move a patch, identified by a set of lower and upper indices in the global index space, from one global array to a patch located within another global array. The only constraints on the two patches are that they contain equal numbers of elements. In particular, the array distributions do not have to be identical, and the implementation can perform, as needed, the necessary data reorganization (the so-called MxN problem []). In addition, this interface supports an optional transpose operation for the transferred data. If the copy is from one patch to another on the same global array, there is an additional constraint that the patches do not overlap.
Historical Development and Comparison with Other Programming Models The original GA package [–] offered basic one-sided communication operations, along with a limited set of collective operations on arrays in the style of BLAS []. Only two-dimensional arrays and two data types were supported. The underlying communication mechanisms were implemented on top of vendor-specific interfaces. In the course of years, the package evolved substantially and the underlying code was completely rewritten. This included separation of the GA internal one-sided communication engine from the high-level data structure. A new portable, general, and GAindependent communication library called ARMCI was created []. New capabilities were later added to GA without the need to modify the ARMCI interfaces. The GA toolkit evolved in multiple directions: ●
Adding support for a wide range of data types and virtually arbitrary array ranks. ● Adding advanced or specialized capabilities that address the needs of some new application areas,
G
G ●
●
●
●
Global Arrays Parallel Programming Toolkit
e.g., ghost cells or operations for sparse data structures. Expansion and generalization of the existing basic functionality. For example, mutex and lock operations were added to better support the development of shared-memory-style application codes. They have proven useful for applications that perform complex transformations of shared data in task parallel algorithms, such as compressed data storage in the multireference configuration interaction calculation in the COLUMBUS package []. Increased language interoperability and interfaces. In addition to the original Fortran interface, C, Python, and a C++ class library were developed. Developing additional interfaces to third-party libraries that expand the capabilities of GA, especially in the parallel linear algebra area: ScaLAPACK [] and SUMMA []. Interfaces to the TAO optimization toolkit have also been developed []. Developed support for multilevel parallelism based on processor groups in the context of a sharedmemory programming model, as implemented in GA [, ].
These advances generalized the capabilities of the GA toolkit and expanded its appeal to a broader set of applications. At the same time, the programming model, with its emphasis on a shared-memory view of the data structures in the context of distributed memory systems with a hierarchical memory, is as relevant today as it was in when the project started.
Comparison with Other Programming Models The two predominant classes of programming models for parallel computers are distributed-memory, sharednothing, and Uniform Memory Access (UMA) sharedeverything models. Both the shared-everything and fully distributed models have advantages and shortcomings. The UMA shared-memory model is easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers, this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing finegrained load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message passing [].
However, this performance comes at the cost of compromising the ease of use that the UMA shared-memory model posits. Distributed, shared-nothing memory models, such as message-passing or one-sided communication, offer performance and scalability but they are more difficult to program. The classic message-passing paradigm not only transfers data but also synchronizes the sender and receiver. Asynchronous (nonblocking) send/receive operations can be used to diffuse the synchronization point, but cooperation between sender and receiver is still required. The synchronization effect is beneficial in certain classes of algorithms, such as parallel linear algebra, where data transfer usually indicates completion of some computational phase; in these algorithms, the synchronizing messages can often carry both the results and a required dependency. For other algorithms, this synchronization can be unnecessary and undesirable, and a source of performance degradation and programming complexity. The Global Arrays toolkit [–] attempts to offer the best features of both models. It implements a globalview programming model, based on one-sided communication, in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide, e.g., an explicit acquire/release protocol []. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC [], Titanium [], and, to a lesser extent, Co-Array Fortran []. In addition, by providing a set of dataparallel operations, GA is also related to data-parallel languages such as HPF [], ZPL [], and Data Parallel C []. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving parallel efficiency. It also supports a combination of task and data parallelism and is fully interoperable with the message-passing (MPI) model. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems [], and by recognizing the communication overhead for remote data
Global Arrays Parallel Programming Toolkit
transfers, it promotes data reuse and locality of reference. Virtually all scalable architectures possess nonuniform memory access characteristics that reflect their multilevel memory hierarchies. These hierarchies typically comprise processor registers, multiple levels of cache, local memory, and remote memory. Over time, both the number of levels and the cost (in processor cycles) of accessing deeper levels have been increasing. Scalable programming models must address memory hierarchy since it is critical to the efficient execution of applications.
Related Entries Coarray Fortran MPI (Message Passing Interface) PGAS (Partitioned Global Address Space) Languages UPC
Bibliographic Notes and Further Reading A more detailed version of this entry has been published in the International Journal of High Performance Computing Applications, vol. , no. , May by SAGE Publications, Inc., All rights reserved. © .
Bibliography . Lam MS, Rothberg EE, Wolf ME () Cache performance and optimizations of blocked algorithms. In: Proceedings of the th international conference on architectural support for programming languages and operating systems, Santa Clara, – Apr . Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLAPACK: a linear algebra library for message-passing computers. In: Proceedings of eighth SIAM conference on parallel processing for scientific computing, Minneapolis . Scheurich C, Dubois M () Correct memory operation of cache-based multiprocessors. In: Proceedings of th annual international symposium on computer architecture, Pittsburgh . Dubois M, Scheurich C, Briggs F () Memory access buffering in multiprocessors. In: Proceedings of th annual international symposium on Computer architecture, Tokyo, Japan . CCA-Forum. Common component architecture forum. http:// www.cca-forum.org . Nieplocha J, Harrison RJ, Littlefield RJ () Global arrays: a portable shared memory programming model for distributed memory computers. In: Proceedings of Supercomputing, Washington, DC, pp –
G
. Nieplocha J, Harrison RJ, Littlefield RJ () Global arrays: A nonuniform memory access programming model for highperformance computers. J Supercomput :– . Nieplocha J, Harrison RJ, Krishnan M, Palmer B, Tipparaju V () Combining shared and distributed memory models: Evolution and recent advancements of the Global Array Toolkit. In: Proceedings of POHLL’ workshop of ICS-, New York . Dongarra JJ, Croz JD, Hammarling S, Duff I () Set of level basic linear algebra subprograms. ACM Trans Math Softw :– . Nieplocha J, Carpenter B () ARMCI: a portable remote memory copy library for distributed array libraries and compiler runtime systems. In: Proceedings of RTSPP of IPPS/SDP’, San Juan, Puerto Rico . Dachsel H, Nieplocha J, Harrison RJ () An out-of-core implementation of the COLUMBUS massively-parallel multireference configuration interaction program. In: Proceedings of high performance networking and computing conference, SC’, Orlando . VanDeGeijn RA, Watts J () SUMMA: Scalable universal matrix multiplication algorithm. Concurr Pract Exp :– . Benson S, McInnes L, Moré JJ Toolkit for Advanced Optimization (TAO). http://www.mcs.anl.gov/tao . Nieplocha J, Krishnan M, Palmer B, Tipparaju V, Zhang Y () Exploiting processor groups to extend scalability of the GA shared memory programming model. In: Proceedings of ACM computing frontiers, Italy . Krishnan M, Alexeev Y, Windus TL, Nieplocha J () Multilevel parallelism in computational chemistry using common component architecture and global arrays. In: Proceedings of Supercomputing, Seattle . Shan H, Singh JP () A comparison of three programming models for adaptive applications on the origin. In: Proceedings of supercomputing, Dallas . Zhou Y, Iftode L, Li K () Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In: Proceedings of operating systems design and implementation symposium, Seattle, pp – . Carlson WW, Draper JM, Culler DE, Yelick K, Brooks E, Warren K () Introduction to UPC and language specification. Center for Computing Sciences CCS-TR--, IDA Center for Computing Sciences, Bowie . Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, Hilfinger P, Graham S, Gay D, Colella P, Aiken A () Titanium: A high-performance Java dialect. Concurr Pract Exp :– . Numrich RW, Reid JK () Co-array Fortran for parallel programming. ACM Fortran Forum :– . High Performance Fortran Forum () High Performance Fortran Language Specification, version .. Sci Program ():– . Snyder L () A programmer’s guide to ZPL. MIT Press, Cambridge . Hatcher PJ, Quinn MJ () Data-parallel programming on MIMD computers. MIT Press, Cambridge . Nieplocha J, Harrison RJ, Foster I () Explicit management of memory hierarchy. Adv High Perform Comput –
G
G
GpH (Glasgow Parallel Haskell)
Gossiping Allgather
GpH (Glasgow Parallel Haskell) Glasgow Parallel Haskell (GpH)
GRAPE Junichiro Makino National Astronomical Observatory of Japan, Tokyo, Japan
Definition GRAPE (GRAvity PipE) is the name of a series of special-purpose computers designed for the numerical simulation of gravitational many-body systems. Most of GRAPE machines consist of hardwired pipeline processors to calculate the gravitational interaction between particles and programmable computers to handle all other works. GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) replaced the hardwired pipeline by simple SIMD programmable processors.
Discussion Introduction The improvement in the speed of computers has been a factor of in every decade, for the last years. In these years, however, the computer architecture has become more and more complex. Pipelined architecture were introduced in s, and vector architectures became the mainstream in s. In s, a number of parallel architectures appeared, but in the s and s, distributed memory parallel computers built from microprocessors have taken over. The technological driving force of this evolution of computer architecture has been the increase of the number of available transistors in integrated circuits, at least after the invention of integrated circuits in s. In the case of CMOS LSIs, the number of transistors in
a chip doubles in every months. Under the assumption of so-called CMOS scaling, this means that the switching speed doubles in every months. In the last years, the number of transistors available on an LSI chip increased by roughly a factor of one million. One way to make use of this huge number of transistors is to implement application-specific pipeline processors into a chip. The GRAPE series of special-purpose computers is one of such efforts to make efficient use of large number of transistors available on LSIs. In many scientific simulations, it is necessary to solve N-body problems numerically. The gravitational N-body problem is one such example, which describes the evolution of many astronomical objects from the solar system to the entire universe. In some cases, it is important to treat non-gravitational effects such as the hydrodynamical interaction, radiation, and magnetic fields, but the gravity is the primary driving force that shapes the universe. To solve the gravitational N-body problem, one needs to calculate the gravitational force on each body (particle) in the system from all other particles in the system. There are many ways to do so, and if relatively low accuracy is sufficient, one can use the Barnes– Hut tree algorithm [] or FMM []. Even with these schemes, the calculation of the gravitational interaction between particles (or particles and multipole expansions of groups of particles) is the most time-consuming part of the calculation. Thus, one can greatly improve the speed of the entire simulation, just by accelerating the speed of the calculation of particle–particle interaction. This is the basic idea behind GRAPE computers. The basic idea is shown in Fig. . The system consists of a host computer and special-purpose hardware, and the special-purpose hardware handles the calculation of gravitational interaction between particles. The host computer performs other calculations such as the time integration of particles, I/O, and diagnostics.
Host computer
Position, mass GRAPE Acceleration, potential
GRAPE. Fig. Basic structure of a GRAPE system
GRAPE
This architecture accelerates not only the simple algorithm in which the force on a particle is calculated by taking the summation of forces from all other particles in the system, but also the Barnes–Hut tree algorithms and FMM. Moreover, it can be used with individual timestep algorithms [], in which particles have their own times and timesteps and integrated in an event-driven fashion. The use of individual timestep is critical in simulations of many systems including star clusters and planetary systems, where close encounters and physical collisions of two particles require very small timesteps for a small number of particles.
History The GRAPE project started in . The first machine completed, the GRAPE- [], was a single-board unit on which around IC and LSI chips were mounted and wire-wrapped. The pipeline processor of GRAPE- was implemented using commercially available IC and LSI chips. It was a natural consequence of the fact that project members lacked both money and experience to design custom LSI chips. In fact, none of the original design and development team of GRAPE- had the knowledge of electronic circuit more than what was learned in basic undergraduate course for physics students. For GRAPE-, an unusually short word format was used, to make the hardware as simple as possible. The input coordinates are expressed in -bit fixed point format. After subtraction, the result is converted to -bit logarithmic format, in which bit are used for the “fractional” part. This format is used for all following operations except for the final accumulation. The final accumulation was done in -bit fixed point, to avoid overflow and underflow. The advantage of the short word format is that ROM chips can be used to implement complex functions that require two inputs. Any function of two -bit words can be implemented by one ROM chip with -bit address input. Thus, all operations other than the initial subtraction of the coordinates and final accumulation of the force were implemented by ROM chips. The use of extremely short word format in GRAPE- was based on the detailed theoretical analysis of error propagation and numerical experiment []. There are three dominant sources of error in numerical simulations of gravitational many-body systems. The first one
G
is the error in the numerical integration of orbits of particles. The second one is the error in the calculated accelerations themselves. The third one comes from the fact that in many cases, the number of particles used is much smaller than the number of stars in the real systems such as a galaxy. Whether or not the third one should be regarded as the source of error depends on the problem one wants to study. If the problem is, for example, merging of two galaxies, which takes place in relatively short timescale (compared to the orbital period of typical stars in galaxies), the effect of small number of particles can be, and should be, regarded as the numerical error. On the other hand, if we want to study long-term evolution of a star cluster, which takes place in the timescale much longer than the orbital timescale, the evolution of orbits of individual stars is driven by close encounters with other stars. In this case, the effect of small (or finite) number of particles is not an numerical error but what is there in real systems. Thus, the required accuracy of pairwise force calculation depends on the nature of the problem. In the case of the study of the merging of two galaxies, the average error can be as large as % of the pairwise force, if the error is guaranteed to be random. The average error of interaction calculated by GRAPE- is less than %, which was good enough for many problems. In the number format used in GRAPE-, the positions of particles are expressed in the -bit fixed-point format, so that the force between two nearby particles, both far from the origin of coordinates, is still expressed with sufficient accuracy. Also, the accumulation of the forces from different particles is done in the -bit fixedpoint format, so that there is no loss of effective bits during the accumulation. Thus, the primary source of error with GRAPE- is the low-accuracy pairwise force calculation. It can be regarded as random error. Strictly speaking, the error due to the short word format cannot always be regarded as random, since it introduces correlation both in space and time. One could eliminate this correlation by applying random coordinate transformation, but the quantitative study of such transformation has not done yet. GRAPE- used the GPIB (IEEE-) interface for the communication with the host computer. It was fast enough for the use with simple direct summation. However, when combined with the tree algorithm, faster
G
G
GRAPE
communication was necessary. GRAPE-A used VME bus for communication, to improve the performance. The speed of GRAPE- and A was around Mflops, which is around / of the speed of fastest vector supercomputers of the time. Hardware cost of them was around /, of supercomputers, or roughly equal to the cost of low-end workstations. Thus, these firstgeneration GRAPEs offered price-performance two orders of magnitude better than that of general-purpose computers. GRAPE- is similar to GRAPE-A, but with much higher numerical accuracy. In order to achieve higher accuracy, commercial LSI chips for floating-point arithmetic operations such as TI SNACT and Analog Devices ADSP/ were used. The pipeline of GRAPE- processes the three components of the interaction sequentially. So it accumulates one interaction in every three clock cycles. This approach was adopted to reduce the circuit size. Its speed was around Mflops, but it is still much faster than workstations or minicomputers at that time. GRAPE- was the first GRAPE computer with a custom LSI chip. The number format was the combination of the fixed point and logarithmic format similar to what were used in GRAPE-. The chip was fabricated using μm design rule by National Semiconductor. The number of transistors on chip was K. The chip operated at MHz clock speed, offering the speed of about . Gflops. Printed-circuit boards with eight chips were mass-produced, for the speed of . Gflops per board. Thus, GRAPE- was also the first GRAPE computer to integrate multiple pipelines into a system. Also, GRAPE- was the first GRAPE computer to be manufactured and sold by a commercial company. Nearly copies of GRAPE- have been sold to more than institutes (more than outside Japan). With GRAPE-, a high-accuracy pipeline was integrated into one chip. This chip calculates the first time derivative of the force, so that fourth-order Hermite scheme [] can be used. Here, again, the serialized pipeline similar to that of GRAPE- was used. The chip was fabricated using μm design rule by LSI Logic. Total transistor count was about K. The completed GRAPE- system consisted of , pipeline chips ( PCB boards each with pipeline chips). It operated on MHz clock, delivering the speed of . Tflops. Completed in , GRAPE- was
the first computer for scientific calculation to achieve the peak speed higher than Tflops. Also, in and , it was awarded the Gordon Bell Prize for peak performance, which is given to a real scientific calculation on a parallel computer with the highest performance. Technical details of machines from GRAPE- through GRAPE- can be found in [] and references therein. GRAPE- [] was an improvement over GRAPE-. It integrated two full pipelines which operate on MHz clock. Thus, a single GRAPE- chip offered the speed eight times more than that of the GRAPE- chip, or the same speed as that of an eight-chip GRAPE- board. GRAPE- was awarded the Gordon Bell Prize for price-performance. The GRAPE- chip was fabricated with . μ m design rule by NEC. Table summarizes the history of GRAPE project. Figure shows the evolution of GRAPE systems and general-purpose parallel computers. One can see that evolution of GRAPE is faster than that of generalpurpose computers. The GRAPE- was essentially a scaled-up version of GRAPE- [], with the peak speed of around Tflops. The peak speed of a single pipeline chip was Gflops. In comparison, GRAPE- consists of , pipeline chips, each with Mflops. The increase of a factor of in speed was achieved by integrating six pipelines into one chip (GRAPE- chip has one pipeline which needs three cycles to calculate the force from one particle) and using three times higher clock frequency. The advance of the device technology (from μm to . μm) made these improvements possible. Figure shows the processor chip delivered in early . The six pipeline units are visible. Starting with GRAPE-, the concept of virtual multiple pipeline (VMP) is used. VMP is similar to simultaneous multithreading (SMT), in the sense that a single pipeline processor behaves as multiple processors. However, what is achieved is quite different. In the case of SMT, the primary gain is in the latency tolerance, since one can execute independent instructions with different threads. In the case of hardwired pipeline processors, there is no need to reduce the latency. With VMP, the bandwidth to the external memory is reduced, since the data of one particle which exerts force are shared by multiple virtual pipelines, each of which calculates the force on its own particle. This sharing of the
GRAPE
G
GRAPE. Table History of GRAPE project GRAPE-
(/–/)
Mflops, low accuracy
GRAPE-
(/–/)
Mflops, high accuracy( bit/ bit)
GRAPE-A
(/–/)
Mflops, low accuracy
GRAPE-
(/–/)
Gflops, high accuracy
GRAPE-A
(/–/)
Mflops, high accuracy
HARP-
(/–/)
Mflops, high accuracy
GRAPE-A
(/–/)
Hermite scheme Gflops/board some copies are used all over the world GRAPE-
(/–/)
Tflops, high accuracy Some copies of small machines
MD-GRAPE
(/–/)
Gflops/chip, high accuracy programmable interaction
GRAPE-
(/–/)
Gflops/chip, low accuracy
GRAPE-
(/–/)
Tflops, high accuracy
1,000
Peak speed (Tflops)
100 10
Vectors MPPS GRAPEs
1 0.1 0.01 10-3 10-4 10-5
1980
1990
2000
2010
Year
GRAPE. Fig. The evolution of GRAPE and generalpurpose parallel computers. The peak speed is plotted against the year of delivery. Open circles, crosses, and stars denote GRAPEs, vector processors, and parallel processors, respectively
particle can be extended to physical multiple pipelines, as far as the total number of pipeline is not too large. Thus, special-purpose computers based on GRAPE-like pipeline have a unique advantage that their requirement of external memory bandwidth is much smaller than that of general-purpose computers with similar peak performance.
GRAPE. Fig. The GRAPE- processor chip
In the case of GRAPE-, each of six physical pipelines is implemented as eight virtual pipelines. Thus, one GRAPE- chip calculates forces on particles in parallel. The required memory bandwidth was MB/s, for the peak speed of Gflops. A traditional vector processor with the peak speed of Gflops would require the memory bandwidth of
G
G
GRAPE
GB/s. Thus, GRAPE- requires the memory bandwidth around / of that of a traditional vector processor. One processor board of GRAPE- housed GRAPE- chips. Each GRAPE- chip has its own memory to store particles which exert the force. Thus, different processor chips on one processor board calculate the forces on the same particles from different particles, and the partial results are summed up by an hardwired adder tree when the result is sent back to the host. Thus, the summation over chips added only a small startup overhead (less than μs) per one force calculation, which typically requires several milliseconds. The completed GRAPE- system consisted of processor boards, grouped into clusters with boards each. Within a cluster, boards are organized in a × matrix, with host computers. They are organized so that the effective communication speed is proportional to the number of host computers. In a simple configuration, the effective communication speed becomes independent of the number of host computers. The details of the network used in GRAPE- are given in [].
from image particles. The direct Ewald method is rather well suited for the implementation in hardware. In , WINE- was developed. It is a pipeline to calculate the wave-space part of the direct Ewald method. The real-space part can be handled by GRAPE-A or MDGRAPE hardware. In , a group led by Toshikazu Ebisuzaki in RIKEN started to develop MDM [], a massively parallel machine for large-scale MD simulations. Their primary goal was the simulation of protein molecules. MDM consists of two special-purpose hardware, massively parallel version of MD-GRAPE (MDGRAPE) and that of WINE (WINE-). The MDGRAPE- part consisted of , custom chips with four pipelines, for the theoretical peak speed of Tflops. The WINE- part consists of , custom pipeline chips, for the peak speed of Tflops. The MDM effort was followed up by the development of MDGRAPE- [], led by Makoto Taiji of RIKEN. MDGEAPE- achieved the peak speed of Pflops in .
Machines for Molecular Dynamics Classical MD calculation is quite similar to astrophysical N-body simulations since, in both cases, we integrate the orbit of particles (atoms or stars) which interact with other particles with simple pairwise force. In the case of Coulomb force, the force law itself is the same as that of the gravitational force, and the calculation of Coulomb force can be accelerated by GRAPE hardware. However, in MD calculations, the calculation cost of van der Waals force is not negligible, though van der Waals force decays much faster than the Coulomb force (r− compared to r− ). It is straightforward to design a pipelined processor which can handle particle–particle force given by some arbitrary function of the distance between particles. In GRAPE-A and its successors, a combination of table lookup and polynomial approximation is used. GRAPE-A and MD-GRAPE were developed in the University of Tokyo, following these lines of idea. GRAPE-A was built using commercial chips and MDGRAPE used a custom-designed pipeline chip. Another difference between astrophysical simulations and MD calculations is that in MD calculations, usually the periodic boundary condition is applied. Thus, we need some way to calculate Coulomb forces
Related Projects The GRAPE project is not the first project to implement the calculation of pairwise force in particle-based simulations in hardware. Delft Molecular Dynamics Processor [] (DMDP) is one of the earliest efforts. It was completed in early s. For the calculation of interaction between particles, it used the hardwired pipeline similar to that of GRAPE systems. However, in DMDP, time integration of orbits and other calculations are all done in the hardwired processors. Thus, in addition to the force calculation pipeline, DMDP had pipelines to update position, select particles for interaction calculation, and calculate diagnostics such as correlation function. FASTRUN [] has the architecture similar to that of DMDP, but designed to handle more complex systems such as protein molecule. To some extent, this difference in the designs of GRAPE computers and that of machines for molecular dynamics comes from the difference in the nature of the problem. In astronomy, the wide ranges in the number density of particles and timescale make it necessary to use adaptive schemes such as treecode and individual timesteps. With these schemes, the calculation cost per timestep per particle is generally higher than
GRAPE
that of fastest scheme optimized to shared timestep and near-uniform distribution of particles. The approach in which only the force calculation is done in hardware is more advantageous for astronomical N-body problems than for molecular dynamics. Anton [] is the latest effort to speed up the molecular dynamics simulation of proteins by specialized hardware. It is essentially the revival of the basic idea of DMDP, except that pipeline processors for operations other than force calculation were replaced by programmable parallel processors. It achieved the speed almost two orders of magnitude faster than that of general-purpose parallel computers for the simulation of protein molecules in water.
LSI Economics and GRAPE GRAPE has achieved the cost performance much better than that of general-purpose computers. One reason for this success is simply that with GRAPE architecture, one can use practically all transistors for arithmetic units, without being limited by the memory wall problem. Another reason is the fact that arithmetic units can be optimized to their specific uses in the pipeline. For example, in the case of GRAPE-, the subtraction of two positions is performed in -bit fixed point format, not in the floating-point format. Final accumulation is also done in fixed point. In addition, most of arithmetic operations to calculate the pairwise interactions are done in single precision. These optimizations made it possible to pack more than arithmetic units into a single chip with less than M transistors. The first microprocessor with fully pipelined double-precision floating-point unit, Intel , required . M transistors for two (actually one and half) operations. Thus, the number of transistors per arithmetic unit of GRAPE is smaller by more than a factor of . When compared with more recent processors, the difference becomes even larger. The Fermi processor from NVIDIA integrates arithmetic unit (adder and multiplier) with G transistors. Thus, it is five times less efficient than Intel , and nearly times less efficient than GRAPE-. The difference in the power efficiency is even larger, because the requirement for the memory bandwidth is lower for GRAPE computers. As a result, performance per watt of GRAPE- chip, fabricated with the nm design rule, is comparable to that of GPGPU chips fabricated with nm design rule. Thus, as silicon
G
technology advances, the relative advantage of specialpurpose architecture such as GRAPE becomes bigger. However, there is another economical factor. As the silicon semiconductor technology advances, the initial cost to design and fabricate custom chip increases. In , the initial cost for a custom chip was around K USD. By , it has become higher than M USD. By , the initial cost of a nm chip is around M USD. Roughly speaking, initial cost has been increasing as n. , where n is the number of transistors one can fit into a chip. The total budget for GRAPE- and GRAPE- projects is and M USD, respectively. Thus, a similar budget had become insufficient by early s. The whole point of special-purpose computer is to be able to outperform “expensive” supercomputers, with the price of – M USD. Even if a special-purpose computer is –, times faster, it is not practical to spend the cost of a supercomputer for a special-purpose computer which can solve only a narrow range of problems. There are several possible solutions. One is to reduce the initial cost by using FPGA (Field-Programmable Gate Array) chips. An FPGA chip consists of a number of “programmable” logic blocks (LBs) and also “programmable” interconnections. A LB is essentially a small lookup table with multiple inputs, augmented with one flip-flop and sometimes full-adder or more additional circuits. The lookup table can express any combinatorial logic for input data, and with flip-flop, it can be part of a sequential logic. Interconnection network is used to make larger and more complex logic, by connecting LBs. The design of recent FPGA chips has become much more complex, with large functional units like memory blocks and multiplier (typically × bit) blocks. Because of the need for the programmability, the size of the circuit that can be fit into an FPGA chip is much smaller than that for a custom LSI, and the speed of the circuit is also slower. Roughly speaking, the price of an FPGA chip per logic gate is around times higher than that of a custom chip with the same design rule. If the relative advantage of a specialized architecture is much larger than this factor of , its implementation based on FPGA chips can outperform general-purpose computers. In reality, there are quite a number of projects to use FPGAs for scientific computing, but most of them
G
G
GRAPE
turned out to be not competitive with general-purpose computers. The primary reason for this result is the relative cost of FPGA discussed above. Since the logic gate of FPGAs is much more expensive than that of general-purpose computers, the design of a specialpurpose computer with FPGA must be very efficient in gate usage. FPGA-based systems which use standard double- or single-precision arithmetic are generally not competitive with general-purpose computers. In order to be competitive, it is necessary to use much shorter word length. GRAPE architecture with reduced accuracy is thus an ideal target for FPGA-based approach. Several successful approaches have been reported [, ].
GRAPE-DR Another solution for the problem of the high initial cost is to widen the application range by some way to justify the high cost. GRAPE-DR project [] followed that approach. With GRAPE-DR, the hardwired pipeline processor of previous GRAPE systems was replaced by a collection of simple SIMD programmable processors. The internal network and external memory interface was designed so that it could emulate GRAPE processor efficiently and could be used for several other important applications, including the multiplication of dense matrices. GRAPE-DR is an acronym of “Greatly Reduced Array of Processor Elements with Data Reduction.” The last part, “Data Reduction,” means that it has an on-chip tree network which can do various reduction operations such as summation, max/min, and logical and/or. The GRAPE-DR project was started in FY , and finished in FY . The GRAPE-DR processor chip consists of simple processors, which can operate at the clock cycle of MHz, for the Gflops of single precision peak performance ( Gflops double precision). It was fabricated with TSMC nm process and the size is around mm . The peak power consumption is around W. The GRAPE-DR processor board houses four GRAPE-DR chips, each with its own local DRAM chips. It communicates with the host computer through Gen -lane PCI-Express interface. To some extent, the difference between GRAPE and GRAPE-DR is similar to that between traditional GPUs and GPGPUs. In both cases, hardwired pipelines are replaced by simple programmable processors. The main
differences between GRAPE-DR and GPGPUs are (a) processor element of GRAPE-DR is much simpler, (b) external memory bandwidth of GRAPE-DR is much smaller, and (c) GRAPE-DR is designed to achieve near-peak performance in real scientific applications such as gravitational N-body simulation and molecular dynamics simulation, and also dense matrix multiplication. These differences made GRAPE-DR significantly more efficient in both transistor usage and power usage. GRAPE-DR chip, which was fabricated with nm design rule and has mm area, integrates processing elements. The NVIDIA Fermi chip, which is fabricated with nm design rule and has > mm area, integrates the same processing elements. Thus, there is about a factor of difference in the transistor efficiency. This difference resulted in more than a factor of difference in the power efficiency. Whether or not the approach like GRAPE-DR will be competitive with other approaches, in particular GPGPUs, is at the time of writing rather unclear. The reason is simply that the advantage of a factor of is not quite enough, because of the difference in other factors, among which the most important is the development cycle. New GPUs are announced roughly every year, while it is somewhat unlikely that one develops the special-purpose computers every year, even if there is sufficient budget. In years, general-purpose computers become ten times faster, and GPGPUs will also become faster by a similar factor. Thus, a factor of advantage will disappear while the machine is being developed. On the other hand, the transistor efficiency of general-purpose computers, and that of GPUs, has been decreasing for the last years and probably will continue to do so for the next years or so. GRAPEDR can retain its efficiency when it is implemented with more advanced semiconductor technology, since, as in the case of GRAPE, one can use the increased number of transistors to increase the number of processor element. Thus, it might remain competitive.
Future Directions Future of Special-Purpose Processors In hindsight, s was a very good period for the development of special-purpose architecture such as GRAPE, because of two reasons. First, the semiconductor technology reached the point where many
GRAPE
floating-point arithmetic units can be integrated into a chip. Second, the initial design cost of a chip was still within the reach of fairly small research projects in basic science. By now, semiconductor technology reached to the point that one could integrate thousands of arithmetic units into a chip. On the other hand, the initial design cost of a chip has become too high. The use of FPGAs and the GRAPE-DR approach are two examples of the way to tackle the problem of increasing initial cost. However, unless one can keep increasing the budget, GRAPE-DR approach is not viable, simply because it still means exponential increase in the initial, and therefore total, cost of the project. On the other hand, such increase in the budget might not be impossible, since the field of computational science as a whole is becoming more and more important. Even though a supercomputer is expensive, it is still much less expensive compared to, for example, particle accelerators or space telescopes. Of course, computer simulation cannot replace the real experiments of observations, but computer simulations have become essential in many fields of science and technology. In addition, there are several technologies available in between FPGAs and custom chips. One is what is called “structured ASIC.” It requires customization of typically just one metal layer, resulting in large reduction in the initial cost. The number of gates one can fit into the given silicon area falls between those of FPGAs and custom chips. Another possibility is just to use the technology one or two generations older.
Application Area of Special-Purpose Computers Primary application area of GRAPE and GRAPE-DR has been the particle-based simulation, in particular that requires the evaluation of long-range interaction. It is suited to special-purpose computers because they are compute-intensive. In other words, the necessary bandwidth to the external memory is relatively small. Grid-based simulations based on schemes like finite-difference or finite-element methods are less compute-intensive and thus not suited to specialpurpose computers.
G
However, the efficiency of large-scale parallel computers based on general-purpose microprocessors for these grid-based simulation has been decreasing rather quickly. There are two reasons for this decrease. One is the lack of the memory bandwidth. Currently, the memory bandwidth of microprocessors normalized by the calculation speed is around . bytes/flops, which is not enough for most of grid-based simulations. Even so, this ratio will become smaller and smaller in the future. The other reason is the latency of the communication between processors. One possible solution for these two problems is to integrate the main memory, processor cores, and communication interface into a single chip. This integration gives practically unlimited bandwidth to the memory, and the communication latency is reduced by one or two orders of magnitude. Obvious disadvantage of this approach is that the total amount of memory would be severely limited. However, in many application of grid-based calculation, very long time integrations of relatively small systems are necessary. Many of such applications requires memory much less than one TB, which can be achieved by using several thousand custom processors each with around GB of embedded DRAM.
Bibliography . Aarseth SJ () Dynamical evolution of clusters of galaxies, I. MN : . Bakker AF, Gilmer GH, Grabow MH, Thompson K () A special purpose computer for molecular dyanamics calculations. J Comput Phys : . Barnes J, Hut P () A hiearchical O(NlogN) force calculation algorithm. Nature : . Fine R, Dimmler G, Levinthal C () FASTRUN: a special purpose, hardwired computer for molecular simulation. Proteins Struct Funct Genet : . Greengard L, Rokhlin V () A fast algorithm for particle simulations. J Comput Phys : . Hamada T, Fukushige T, Kawai A, Makino J () PROGRAPE-: a programmable, multi-purpose computer for many-body simulations. PASJ :– . Ito T, Makino J, Ebisuzaki T, Sugimoto D () A special-purpose N-body machine GRAPE-. Comput Phys Commun : . Kawai A, Fukushige T () $/GFLOP Astrophysical N-body simulation with a reconfigurable add-in card and a hierarchical tree algorithm. In: Proceedings of SC, ACM (Online) . Kawai A, Fukushige T, Makino J, Taiji M () GRAPE-: a special-purpose computer for N-body simulations. PASJ :
G
G
Graph Algorithms
. Makino J, Aarseth SJ () On a Hermite integrator with AhmadCohen scheme for gravitational many-body problems. PASJ : . Makino J, Fukushige T, Koga M, Namura K () GRAPE-: massively-parallel special-purpose computer for astrophysical particle simulations. PASJ : . Makino J, Hiraki K, Inaba M () GRAPE-DR: -Pflops massively-parallel computer with -core, -Gflops processor chips for scientific computing. In: Proceedings of SC, ACM (Online) . Makino J, Ito T, Ebisuzaki T () Error analysis of the GRAPE- special-purpose N-body machine. PASJ : . Makino J, Taiji M () Scientific simulations with specialpurpose computers – The GRAPE systems. Wiley, Chichester . Makino J, Taiji M, Ebisuzaki T, Sugimoto D () GRAPE: a massively parallel special-purpose computer for collisional N-body simulations. ApJ : . Narumi T, Susukita R, Ebisuzaki T, McNiven G, Elmegreen B () Mol Simul : . Shaw DE, Denero MM, Dror RO, Kuskin JS, Larson RH, Salmon JK, Young C, Batson B, Bowers KJ, Chao JC, Eastwood MP, Gagliardo J, Grossman JP, Ho CR, Ierardi DJ, Kolossváry I, Klepeis JL, Layman T, McLeavey C, Moraes MA, Mueller R, Priest EC, Shan Y, Spengler J, Theobald M, Towles B, Wang SC () Anton: a special-purpose machine for molecular dynamics simulation. In: Proceedings of the th annual international symposium on computer architecture (ISCA ’), ACM, San Diego, pp – . Taiji M, Narumi T, Ohno Y, Futatsugi N, Suenaga A, Takada N, Konagaya A () Protein explorer: a petaflops special-purpose computer system for molecular dynamics simulations. In: The SC Proceedings, IEEE, Los Alamitos, CD–ROM
synchronous, and in each unit of time, each processor either executes one instruction or stays idle.
Techniques Graph problems are diverse. Given a graph G = (V, E), where ∣V∣ = n and ∣E∣ = m, several techniques are frequently used in designing parallel graph algorithms. The basic techniques are described as follows (Detailed descriptions can be found in []). Prefix Sum
Given a sequence of n elements s , s , ⋯, sn with a binary associative operator denoted by ⊕, the prefix sums of the sequence are the partial sums defined by S i = s ⊕ s ⊕ ⋯ ⊕ si , ≤ i ≤ n Using the balanced tree technique, the prefix sums of n elements can be computed in O(log n) time with O(n/ log n) processors. Fast prefix sum is of fundamental importance to the design of parallel graph algorithms as it is frequently used in algorithms for more complex problems. Pointer Jumping
Discussion
Pointer jumping, also sometimes called path doubling, is useful for handling computation on rooted forests. For a rooted forest, there is a parent function P defined on the set of vertices. P(r) is set to r when r does not have a parent. When finding the root of each tree, the pointer jumping technique updates the parent of each node by that node’s grandparent, that is, set P(r) = P(P(r)). The algorithm runs in O(log n) time with O(n) processors. Pointer jumping can also be used to compute the distance between each node to its root. Pointer jumping is used in several connectivity and tree algorithms (e.g., see [, ]).
Parallel Graph Algorithms
Divide and Conquer
Relationships in real-world situations can often be represented as graphs. Efficient parallel processing of graph problems has been a focus of many algorithm researchers. A rich collection of parallel graph algorithms have been developed for various problems on different models. The majority of them are based on the parallel random access machine (PRAM). PRAM is a shared-memory model where data stored in the global memory can be accessed by any processor. PRAM is
The divide-and-conquer strategy is recognized as a fundamental technique in algorithm design (not limited to parallel graph algorithms). The frequently used quicksort algorithm is based on divide and conquer. The strategy partitions the input into partitions of roughly equal size and recursively works on each partition concurrently. There is usually a final step to combine the solutions of the subproblems into a solution for the original problem.
Graph Algorithms David A. Bader , Guojing Cong Georgia Institute of Technology, Atlanta, GA, USA IBM, Yorktown Heights, NY, USA
Graph Algorithms
Pipelining
Like divide-and-conquer, the use of pipelining is not limited to parallel graph algorithm design. It is of critical importance in computer hardware and software design. Pipelining breaks up a task into a sequence of smaller tasks (subtasks). Once a subtasks is complete and moves to the next stage, the ones following it can be processed in parallel. Insertion and deletion with – trees demonstrate the pipelining technique []. Deterministic Coin Tossing
Deterministic coin tossing was proposed by Cole and Vishkin [] to break the symmetry of a directed cycle without using randomization. Consider the problem of finding a three-coloring of graph G. A k-coloring of G is a mapping c : V → {, , ..., k − } such that c(i) ≠ c(j) if < i, j > ∈ E. Initially, the cycle has a trivial coloring with n = ∣V∣ colors. The apparent symmetry in the problem (i.e., the vertices cannot be easily distinguished) presents the major difficulty to reducing the number of colors needed for the coloring. Deterministic coin tossing uses the binary representation of an integer i = ⋯ik ⋯i i to break the symmetry. For example, suppose t is the least significant bit position in which c(i) and c(S(i)) differ (S(i) is the successor of i in the cycle), then the new coloring for i can be set as t + c(i)t . Accelerating Cascades
Accelerating cascades was presented together with deterministic coin tossing in [] for the design of faster algorithms for list ranking and other problems. This technique combines two algorithms, one optimal and the other super fast, for the same problem to get a optimal and very fast algorithm. The general strategy is to start with the optimal algorithm to reduce the problem size and then apply the fast but nonoptimal algorithm. Other Building Blocks for Parallel Graph Algorithms
List ranking determines the position, or rank, of the list items in a linked list. A generalized list-ranking problem is defined as follows. Let X be an array of n elements stored in arbitrary order. For each element i, let X(i).value be its value and X(i).next be the index of its successor. Then, for any binary associative operator ⊕, compute X(i).prefix such that X(head).prefix = X(head).value and X(i).prefix =
G
X(i).value ⊕ X(predecessor).prefix, where head is the first element of the list, i is not equal to head, and predecessor is the node preceding i in the list. Pointer jumping can be applied to solve the list-ranking problem. An optimal list-ranking algorithm is given in []. The Euler tour technique is another of the basic building blocks for designing parallel algorithms, especially for tree computations. For example, postorder/preorder numbering, computing the vertex level, computing the number of descendants, etc., can be done work-time optimally on EREW PRAM by applying the Euler tour technique. As suggested by its name, the power of the Euler tour technique comes from defining a Eulerian circuit on the tree. In Tarjan and Vishkin’s biconnected components paper [] that originally introduced the Euler tour technique, the input to their algorithm is an edge list with the cross-pointers between twin edges < u, v > and < v, u > established. With these cross-pointers it is easy to derive an Eulerian circuit. The Eulerian circuit can be treated as a linked list, and by assigning different values to each edge in the list, list ranking can be used for many tree computation. For example, when rooting a tree, the value is associated with each edge. After list ranking, simply inspecting the list rank for < u, v > and < v, u > can set the correct parent relationship for u and v. Tree contraction systematically shrinks a tree into a single vertex by successively shrinking parts of the tree. It can be used to solve the expression evaluation problem. It is also used in other algorithm, for example, computing the biconnected components [].
Classical Algorithms The use of the basic techniques are demonstrated in several classical graph algorithms for spanning tree, minimum spanning tree, and biconnected components. Various deterministic and randomized techniques have been given for solving the spanning tree problem (and the closely related connected components problem) on PRAM models. A brief survey of these algorithms can be found in []. The Shiloach–Vishkin algorithm (SV) algorithm is representative of several connectivity algorithms in that it adapts the widely used graft-and-shortcut approach. Through carefully designed grafting schemes, the algorithm achieves complexities of O(log n) time and O((m + n) log n)
G
G
Graph Algorithms
work under the arbitrary CRCW PRAM model. The algorithm takes an edge list as input and starts with n isolated vertices and m processors. Each processor Pi ( ≤ i ≤ m) inspects edge ei =< vi , vi > and tries to graft vertex vi to vi under the constraint that vi < vi . Grafting creates k ≥ connected components in the graph, and each of the k components is then shortcut to to a single super-vertex. Grafting and shortcutting are iteratively applied to the reduced graphs G′ = (V ′ , E′ ) (where V ′ is the set of super-vertices and E′ is the set of edges among super-vertices) until only one super-vertex is left. Minimum spanning tree (MST) is one of the most studied combinatorial problems with practical applications. While several theoretic results are known for solving MST in parallel, many are considered impractical because they are too complicated and have large constant factors hidden in the asymptotic complexity. See for a survey of MST algorithms. Many parallel algorithms are based on the Bor˚uvka algorithm. Bor˚uvka’s algorithm is comprised of Bor˚uvka iterations that are used in many parallel MST algorithms. A Bor˚uvka iteration is characterized by three steps: find-min, connectedcomponents, and compact-graph. In find-min, for each vertex v the incident edge with the smallest weight is labeled to be in the MST; connect-components identifies connected components of the induced graph with the labeled MST edges; compact-graph compacts each connected component into a single supervertex, removes self-loops and multiple edges, and relabels the vertices for consistency. A connected graph is said to be separable if there exists a vertex v such that removal of v results in two or more connected components of the graph. Given a connected, undirected graph G, the biconnected components problem finds the maximal-induced subgraphs of G that are not separable. Tarjan [] presents an optimal O(n + m) algorithm that finds the biconnected components of a graph based on depth-first search (DFS). Eckstein [] gave the first parallel algorithm that takes O(d log n) time with O((n + m)/d) processors on CREW PRAM, where d is the diameter of the graph. Tarjan and Vishkin [] present an O(log n) time algorithm on CRCW PRAM that uses O(n + m) processors. This algorithm utilizes many of the fundamental primitives including prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations.
Communication Efficient Graph Algorithms Communication-efficient parallel algorithms were proposed to address the “bottleneck of processor-toprocessor communication” (e.g., see []). Goodrich [] presented a communication-efficient sorting algorithm on weak-CREW BSP that runs in O(log n/ log(h + )) communication rounds (with at most h data transported by each processor in each round) and O((n log n)/p) local computation time, for h = Θ(n/p). Goodrich’s sorting algorithm is frequently used in communication-efficient graph algorithms. Dehne et al. designed an efficient list-ranking algorithm for coarse-grained multicomputers (CGM) and BSP that takes O(log p) communication rounds with O(n/p) local computation. In the same study, a series of communication-efficient graph algorithms such as connected components, ear decomposition, and biconnected components are presented using the list-ranking algorithm as a building block. On the BSP model, Adler et al. [] presented a communication-optimal MST algorithm. The list-ranking algorithm and the MST algorithm take similar approaches to reduce the number of communication rounds. They both start by simulating several (e.g., O(log p) or O(log log p) ) steps of the PRAM algorithms on the target model to reduce the input size so that it fits in the memory of a single node. A sequential algorithm is then invoked to process the reduced input of size O(n/p), and finally the result is broadcast to all processors for computing the final solution.
Practical Implementation Many real-world graphs are large and sparse. These instances are especially hard to process due to the characteristics of the workload. Although fast theoretic algorithms exist in the literature, large and sparse graph problems are still challenging to solve in practice. There remains a significant gap between algorithmic model and architecture. The mismatch between memory access pattern and cache organization is the most outstanding barrier to high-performance graph analysis on current systems.
Graph Workload Compared with traditional scientific applications, graph analysis is more memory intensive. Graph algorithms put tremendous pressure on the memory subsystem
G
Graph Algorithms
to deliver data to the processor. Table shows the percentages of memory instructions executed in the SPLASH benchmarks [] and several graph algorithms. SPLASH represents typical scientific applications for shared-memory environments. On IBM Power, on average, about % more memory instructions are executed in the graph algorithms. For biconnected components, the number is over %. For memory-intensive applications, the locality behavior is especially crucial to the performance. Unfortunately, most graph algorithms exhibit erratic memory access patterns that result in poor performance, as shown in Table . Table is the cycles per instruction (CPI) construction [] for three graph algorithms on IBM Power. The algorithms studied are betweenness centrality (BC), biconnected components (BiCC), and minimum spanning tree (MST). CPI construction attributes in percentage cycles to categories such as completion, instruction cache miss penalty, and stall. In Table , for all algorithms, significant amount of cycles are spent on pipeline stalls. About –% of the cycles are wasted on the load-store unit stalls, and more than % of the cycles are spent on stalls due to data cache misses. Table clearly shows that graph algorithms perform poorly on current cache-based architectures, and the culprit is the memory access pattern. The floating-point stalls (FPU STL) column is revealing about one other prominent feature of graph algorithms.
FPU STL is the percentage of cycles wasted on floatingpoint unit stalls. In Table , BiCC and MST do not incur any FPU stalls. Unfortunately, this does not mean that the workload fully utilizes the FPU. Instead, as there are no floating-point operations in these algorithms, the elaborately designed floating point units lay idle. CPI construction shows that graph workloads spent most of the time waiting for data to be delivered, and there are not many other operations to hide the long latency to main memory. In fact, of the three algorithms, only BC incurs execution of a few floating-point instructions.
Implementation on Shared-Memory Machines It is relatively straightforward to map PRAM algorithms on to shared-memory machines such as symmetric multiprocessors (SMPs), multi-core and many core systems. While these systems are of shared-memory architecture, they are by no means the PRAM used in theoretical work – synchronization cannot be taken for granted, memory bandwidth is limited, and performance requires a high degree of locality. Practical design choices need to be made to achieve high performance on such systems. Adapting to the Available Parallelism
Nick’s Class (N C) is defined as the set of all problems that run in polylog-time with a polynomial number of processors. Whether a problem P is in N C is a fundamental question. The PRAM model assumes an
Graph Algorithms. Table Percentages of load-store instructions for the SPLASH benchmark and the graph problems. ST, MST, BiCC, and CC stand for spanning tree, minimum spanning tree, biconnected components, and connected components, respectively SPLASH
Graph problems
Benchmark
Barnes
Cholesky
Ocean
Raytrace
ST
MST
BiCC
CC
Load
.%
.%
.%
.%
.%
.%
.%
.%
Store
.%
.%
.%
.%
%
.%
.%
.%
Load+store
.%
.%
.%
.%
.%
.%
.%
.%
Graph Algorithms. Table CPI construction for three graph algorithms. Base cycles are for “useful” work. The “Stall” columns show the percentages of cycles on pipeline stalls followed by stalls due to load-store unit, data cache miss, floating-point unit, fix point unit, respectively Algorithm
Base
GCT
Stall
LSU STL
BC
.
.
.
.
BiCC
.
.
.
MST
.
.
.
DCache STL
FPU STL
FXU STL
.
.
.
.
.
.
.
.
.
.
.
G
G
Graph Algorithms
unlimited number of processors and explores the maximum inherent parallelism of P. Acknowledging the practical restriction of limited parallelism provided by real computers, Kruskal et al. [] argued that nonpolylogarithmic time algorithms (e.g., sublinear time algorithms) could be more suitable than polylog algorithms for implementation with practically large input size. EP (short for efficient parallel) algorithms, by the definition in [], is the class of algorithms that achieve a polynomial reduction in running time with a polylogarithmic inefficiency. With EP algorithms, the design focus is shifted from reducing the complexity factors to solving problems of realistic sizes efficiently with a limited number of processors. Reducing Synchronization
When adapting a PRAM algorithm to shared-memory machines, a thorough understanding of the algorithm usually suffices to eliminate unnecessary synchronization such as barriers. Reducing synchronization thus is an implementation issue. Asynchronous algorithm design, however, is more aggressive in reducing synchronization. It sometimes allows nondeterministic intermediate results but deterministic solutions. In a parallel environment, to ensure correct final results oftentimes a total ordering on all the events is not necessary, and a partial ordering in general suffices. Relaxed constraints on ordering reduce the number of synchronization primitives in the algorithm. Bader and Cong presented a largely asynchronous spanning tree algorithm in [] that employs a constant number of barriers. The spanning tree algorithm for shared-memory machines has two main steps: () stub spanning tree and () work-stealing graph traversal. In the first step, one processor generates a stub spanning tree, that is, a small portion of the spanning tree by randomly walking the graph for O(p) steps. The vertices of the stub spanning tree are evenly distributed into each processor’s queue, and each processor in the next step will traverse from the first element in its queue. After the traversals in step , the spanning subtrees are connected to each other by this stub spanning tree. In the graph traversal step, each processor traverses the graph (by coloring the nodes) similar to the sequential algorithm in such a way that each processor finds a subgraph of the final spanning tree. Work-stealing is used to balance the load for graph traversal.
One problem related to synchronization is that there could be portions of the graph traversed by multiple processors and be in different subgraphs of the spanning tree. The immediate remedy is to synchronize using either locks or barriers. With locks, coloring the vertex becomes a critical section, and a processor can only enter the critical section when it gets the lock. Although the nondeterministic behavior is now prevented, it does not perform well on large graphs due to an excessive number of locking and unlocking operations. In the proposed algorithm, no barriers are introduced in graph traversal. The algorithm runs correctly without, barriers even when two or more processors color the same vertex. In this situation, each processor will color the vertex and set as its parent the vertex it has just colored. Only one processor succeeds at setting the vertex’s parent to a final value. For example, using Fig. , processor P colored vertex u, and processor P colored vertex v, and at a certain time they both find w unvisited and are now in a race to color vertex w. It makes no difference which processor colored w last because w’s parent will be set to either u or v (and it is legal to set w’s parent to either of them; this will not change the validity of the spanning tree, only its shape). Further, this event does not create cycles in the spanning tree under sequential consistency model. Both P and P record that w is connected to each processor’s own tree. When each of w’s unvisited children are visited by various processors, its parent will be set to w, independent of w’s parent.
P2
P1
v
u w
Graph Algorithms. Fig. Two processors P and P see vertex w as unvisited, so each is in a race to color w and set w’s parent pointer. The shaded area represents vertices colored by P , the black area represents those marked by P , and the white area contains unvisited vertices
Graph Algorithms
Choosing Between Barriers and Locks
Locks and barriers are two major types of synchronization primitives. In practice, the choice of using locks or barriers may not be very clear. Take the “graft and shortcut” spanning tree algorithm for example. For graph G = (V, E) represented as an edge list, the algorithm starts with n isolated vertices and m processors. For edge ei = ⟨u, v⟩, processor Pi ( ≤ i ≤ m) inspects u and v, and if v < u, it grafts vertex u to v and labels ei to be a spanning tree edge. The problem here is that for a certain vertex v, its multiple incident edges could cause grafting v to different neighbors, and the resulting tree may not be valid. To ensure that v is only grafted to one of the neighbors, locks can be used. Associated with each vertex v is a flag variable protected by a lock that shows whether v has been grafted. In order to graft v a processor has to obtain the lock and check the flag, thus race conditions are prevented. A different solution uses barriers [] in a two-phase election. No checking is needed when a processor grafts a vertex, but after all processors are done (ensured with barriers), a check is performed to determine which one succeeds and the corresponding edge is labeled as a tree edge. Whether to use a barrier or lock is dependent on the algorithm design as well as the barrier and lock implementations. Locking typically introduces large memory overhead. When contention among processors is intense, the performance degrades significantly. Cache Friendly Design
The increasing speed difference between processor and main memory makes cache and memory access patterns important factors for performance. The fact that modern processors have multiple levels of memory hierarchy is generally not reflected by most of the parallel models. As a result, few parallel algorithm studies have touched on the cache performance issue. The SMP model proposed by Helman and JáJá is the first effort to model the impact of memory access and cache over an algorithm’s performance []. The model forces an algorithm designer to reduce the number of noncontiguous memory accesses. However, it does not give hints to the design of cache-friendly parallel algorithms. Chiang et al. [] presented a PRAM simulation technique for designing and analyzing efficient externalmemory (sequential) algorithms for graph problems. This technique simulates the PRAM memory by
G
keeping a task array of O(N) on disk. For each PRAM step, the simulation sorts a copy of the contents of the PRAM memory based on the indices of the processors for which they will be operands, and then scans this copy and performs the computation for each processor being simulated. The following can be easily shown: Theorem Let A be a PRAM algorithm that uses N processors and O(N) space and runs in time T. Then A can be simulated in O(T ⋅sort(N)) I/Os []. Here, sort(N) represents the optimal number of I/Os needed to sort N items striped across the disks, and scan(N) represents the number of I/Os needed to read N items striped across the disks. Specifically, x x sort(x) = log M B DB B x scan(x) = DB where M = # of items that can fit into main memory, B = # of items per disk block, and D = # of disks in the system. A similar technique can be applied to the cachefriendly parallel implementation of PRAM algorithms for large inputs. I/O efficient algorithms exhibit good spatial locality behavior that is critical to good cache performance. Instead of having one processor simulate the PRAM step, p ≪ n processors may perform the simulation concurrently. The simulated PRAM implementation is expected to incur few cache block transfers between different levels. For small input sizes, it would not be worthwhile to apply this technique as most of the data structures can fit into cache. As the input size increases, the cost to access memory becomes more significant, and applying the technique becomes beneficial. Algorithmic Optimizations
For most problems, parallel algorithms are inherently more complicated than the sequential counterparts, incurring large overheads with many algorithm steps. Instead of lowering the asymptotic complexities, in many cases, reducing the constant factors improves performance. Cong and Bader demonstrates the benefit of such optimizations with their biconnected components algorithm []. The algorithm eliminates edges that are not essential in computing the biconnected components. For any
G
G
Graph Algorithms
input graph, edges are first eliminated before the computation of biconnected components is done so that at most min(m, n) edges are considered. Although applying the filtering algorithm does not improve the asymptotic complexity, in practice, the performance of the biconnected components algorithm can be significantly improved. An edge e is considered as nonessential for biconnectivity if removing e does not change the biconnectivity of the component to which it belongs. Filtering out nonessential edges when computing biconnected components (these edges are put back in later) yields performance advantages. The Tarjan–Vishkin algorithm (TV) is all about finding the equivalence relation R′∗ c [, ]. Of the three conditions for R′c , it is trivial to check for condition which is for a tree edge and a non-tree edge. Conditions and , however, are for two tree edges, and checking involves the computation of high and low values. To compute high and low, every non-tree edge of the graph is inspected, which is very time consuming when the graph is not extremely sparse. The fewer edges the graph has, the faster the Low-high step. Also, when building the auxiliary graph, the fewer edges in the original graph means the smaller the auxiliary graph and the faster the Label-edge and Connected-components steps. Combining the filtering algorithm for eliminating nonessential edges and TV, the new biconnected components algorithm runs in max(O(d), O(log n)) time with O(n) processors on CRCW PRAM, where d is the diameter of the graph. Asymptotically, the new algorithm is not faster than TV. In practice, however, parallel speedups upto with processors are achieved on SUN Enterprise using the filtering technique.
Implementation on Multithreaded Architectures Graph algorithms have been observed to run well on multi-threaded architectures such as the CRAY MTA- [] and its successor, the Cray XMT. The Cray MTA [] is a flat, shared-memory multiprocessor system. All memory is accessible and equidistant from all processors. There is no local memory and no data caches. Parallelism, and not caches, is used to tolerate memory and synchronization latencies. An MTA processor consists of hardware streams and one instruction pipeline. Each stream can have up to outstanding memory operations. Threads from the
same or different programs are mapped to the streams by the runtime system. A processor switches among its streams every cycle, executing instructions from nonblocked streams in a fair manner. As long as one stream has a ready instruction, the processor remains fully utilized. Bader et al. compared the performance of listranking algorithm on SMPs and MTA []. For list ranking, they used two classes of list to test the algorithms: Ordered and Random. Ordered places each element in the array according to its rank; thus, node i is the ith position of the array and its successor is the node at position (i + ). Random places successive elements randomly in the array. Since the MTA maps contiguous logical addresses to random physical addresses, the layout in physical memory for both classes is similar, the performance on the MTA is independent of order. This is in sharp contrast to SMP machines which rank Ordered lists much faster than Random lists. On the SMP, there is a factor of – difference in performance between the best case (an ordered list) and the worst case (a randomly-ordered list). On the ordered lists, the MTA is an order of magnitude faster than the SMP, while on the random list, the MTA is approximately times faster.
Implementation on Distributed Memory Machines As data partitioning and explicit communication are required, implementing highly irregular algorithms is hard on distributed-memory machines. As a result, although many fast theoretic algorithms exist in the literature, few experimental results are known. As for performance, the adverse impact of irregular accesses is magnified in the distributed-memory environment when memory requests served by remote nodes experience long network latency. Two studies have demonstrated reasonable parallel performance with distributed-memory machines [, ]. Both studies implement parallel breadth-first search (BFS), one on BlueGene/L [] and the other on CELL/BE []. The CELL architecture resembles a distributed-memory setting as explicit data transfer is necessary between the local storage on an SPE and the main memory. Neither study establishes a strong evidence for fast execution of parallel graph algorithms on distributed-memory systems. The individual CPU in BlueGene and the PPE
Graph Algorithms
of CELL are weak compared with other Power processors or the SPE. It is hard to establish a meaningful baseline to compare the parallel performance against. Indeed, in both studies, either only wall clock times or speedups compared with other reference architectures are reported. Partitioned global address space (PGAS) languages such as UPC and X [, ] have been proposed recently that present a shared-memory abstraction to the programmer for distributed-memory machines. They allow the programmer to control the data layout and work assignment for the processors. Mapping shared-memory graph algorithms onto distributedmemory machines is straightforward with PGAS languages. Performance wise, straightforward PGAS implementation for irregular graph algorithms does not usually achieve high performance due to the aggregate startup cost of many small messages. Cong, Almasi, and Saraswat presented their study in optimizing the UPC implementation of graph algorithm in []. They improve both the communication efficiency and the cache performance of the algorithm through improving the locality behavior.
Some Experimental Results Greiner [] implemented several connected components algorithms using NESL on the Cray Y-MP/C and TMC CM-. Hsu, Ramachandran, and Dean [] also implemented several parallel algorithms for connected components. They report that their parallel code runs times slower on a MasPar MP- than Greiner’s results on the Cray, but Hsu et al.’s implementation uses one-fourth of the total memory used by Greiner’s approach. Krishnamurthy et al. [] implemented a connected components algorithm (based on ShiloachVishkin []) for distributed memory machines. Their code achieved a speedup of using a -processor TMC CM- on graphs with underlying D and D regular mesh topologies, but virtually no speedup on sparse random graphs. Goddard, Kumar, and Prins [] implemented a connected components algorithm for a mesh-connected SIMD parallel computer, the processor MasPar MP-. They achieve a maximum parallel speedup of less than two on a random graph with , vertices and about one-million edges. For a
G
random graph with , vertices and fewer than a halfmillion edges, the parallel implementation was slower than the sequential code. Chung and Condon [] implemented a parallel minimum spanning tree (MST) algorithm based on Bor˚uvka’s algorithm. On a -processor CM-, for geometric graphs with , vertices and average degree and graphs with fewer vertices but higher average degree, their code achieved a parallel speedup of about , on -processors, over the sequential Bor˚uvka’s algorithm, which was – times slower than their sequential Kruskal algorithm. Dehne and Götz [] studied practical parallel algorithms for MST using the BSP model. They implemented a dense Bor˚uvka parallel algorithm, on a -processor Parsytec CC-, that works well for sufficiently dense input graphs. Using a fixed-sized input graph with , vertices and , edges, their code achieved a maximum speedup of . using processors for a random dense graph. Their algorithm is not suitable for sparse graphs. Woo and Sahni [] presented an experimental study of computing biconnected components on a hypercube. Their test cases are graphs that retain and % edges of the complete graphs, and they achieved parallel efficiencies up to . for these dense inputs. The implementation uses adjacency matrix as input representation, and the size of the input graphs is limited to less than K vertices. Bader and Cong presented their studies [, , ] of the spanning tree, minimum spanning tree, and biconnected components algorithms on SMPs. They achieved reasonable parallel speedups on the large, sparse inputs compared with the best sequential implementations.
Bibliography . Adler M, Dittrich W, Juurlink B, Kutyłowski M, Rieping I () Communication-optimal parallel minimum spanning tree algorithms (extended abstract). In: SPAA’: proceedings of the tenth annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp – . Allen F, Almasi G et al () Blue Gene: a vision for protein science using a petaflop supercomputer. IBM Syst J ():– . Bader DA, Cong G () A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). In: Proceedings of the th international parallel and distributed processing symposium (IPDPS ), Santa Fe
G
G
Graph Algorithms
. Bader DA, Cong G () Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. In: Proceedings of the th international parallel and distributed processing symposium (IPDPS ), Santa Fe . Bader DA, Cong G, Feo J () On the architectural requirements for efficient execution of graph algorithms. In: Proceeding of the international conference on parallel processing, Oslo, pp – . Charles P, Donawa C, Ebcioglu K, Grothoff C, Kielstra A, Praun CV, Saraswat V, Sarkar V () X: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the ACM SIGPLAN conference on object-oriented programming systems, languages and applications (OOPSLA), San Diego, pp – . Chiang Y-J, Goodrich MT, Grove EF, Tamassia R, Vengroff DE, Vitter JS () External-memory graph algorithms. In: Proceedings of the symposium on discrete algorithms, San Francisco, pp – . Chung S, Condon A () Parallel implementation of Bor˚uvka’s minimum spanning tree algorithm. In: Proceedings of the tenth international parallel processing symposium (IPPS’), Honolulu, pp – . Cole R, Vishkin U () Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In: STOC’: proceedings of the eighteenth annual ACM symposium on theory of computing. ACM, New York, pp – . Cole R, Vishkin U () Approximate parallel scheduling. part II: applications to logarithmic-time optimal graph algorithms. Info Comput :– . Cong G, Bader DA () An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs). In: Proceedings of the th international parallel and distributed processing symposium (IPDPS ), Denver . Cong G, Almasi G, Saraswat V () Fast PGAS implementation of distributed graph algorithms. In: Proceedings of the ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC), SC’. IEEE Computer Society, Washington, DC, pp – . CPI analysis on Power. On line, . http://www.ibm.com/ developerworks/linux/library/pacpipower/index.html . Cray, Inc. () The CRAY MTA- system. www.cray.com/ products/programs/mta_/ . Culler DE, Dusseau AC, Martin RP, Schauser KE () Fast parallel sorting under LogP: from theory to practice. In: Portability and performance for parallel processing. Wiley, New York, pp – (Chap ) . Dehne F, Götz S () Practical parallel algorithms for minimum spanning trees. In: Workshop on advances in parallel and distributed systems, West Lafayette, pp – . Eckstein DM () BFS and biconnectivity. Technical Report –, Department of Computer Science, Iowa State University of Science and Technology, Ames . Goddard S, Kumar S, Prins JF () Connected components algorithms for mesh-connected parallel computers. In: Bhatt
.
.
.
.
.
. .
. .
.
. . . . .
.
SN (ed) Parallel algorithms: third DIMACS implementation challenge, – October . DIMACS series in discrete mathematics and theoretical computer science, vol . American Mathematical Society, Providence, pp – Goodrich MT () Communication-efficient parallel sorting. In STOC’: proceedings of the twenty-eighth annual ACM symposium on theory of computing. ACM, New York, pp – Greiner J () A comparison of data-parallel algorithms for connected components. In: Proceedings of the sixth annual symposium on parallel algorithms and architectures (SPAA-), Cape, May, pp – Helman DR, JáJá J () Designing practical efficient algorithms for symmetric multiprocessors. In: Algorithm engineering and experimentation (ALENEX’). Lecture notes in computer science, vol . Springer, Baltimore, pp – Hofstee HP () Power efficient processor architecture and the cell processor. In: International symposium on high-performance computer architecture, San Francisco, pp – Hsu T-S, Ramachandran V, Dean N () Parallel implementation of algorithms for finding connected components in graphs. In: Bhatt SN (ed) Parallel algorithms: third DIMACS implementation challenge, – October . DIMACS series in discrete mathematics and theoretical computer science, vol . American Mathematical Society, Providence, pp – JáJá J () An introduction to parallel algorithms. AddisonWesley, New York Krishnamurthy A, Lumetta SS, Culler DE, Yelick K () Connected components on distributed memory machines. In Bhatt SN (ed) Parallel algorithms: third DIMACS implementation challenge, – October . DIMACS series in discrete mathematics and theoretical computer science, vol . American Mathematical Society, Providence, pp – Kruskal CP, Rudolph L, Snir M () Efficient parallel algorithms for graph problems. Algorithmica ():– Paul WJ, Vishkin U, Wagener H () Parallel dictionaries in – trees. In: Tenth colloquium on automata, languages and programming (ICALP), Barcelona. Lecture notes in computer science. Springer, Berlin, pp – Scarpazza DP, Villa O, Petrini F () Efficient breadth-first search on the Cell/BE processor. IEEE Trans Parallel Distr Syst ():– Shiloach Y, Vishkin U () An O(logn) parallel connectivity algorithm. J Algorithms ():– Tarjan RE () Depth-first search and linear graph algorithms. SIAM J Comput ():– Tarjan RE, Vishkin U () An efficient parallel biconnectivity algorithm. SIAM J Comput ():– Unified Parallel C, URL: http://en.wikipedia.org/wiki/Unified_ Parallel_C Woo J, Sahni S () Load balancing on a hypercube. In: Proceedings of the fifth international parallel processing symposium, Anaheim. IEEE Computer Society, Los Alamitos, pp – Woo SC, Ohara M, Torrie E, Singh JP, Gupta A () The SPLASH- programs: characterization and methodological
Graph Partitioning
considerations. In: Proceedings of the nd annual international symposium computer architecture, pp – . Yoo A, Chow E, Henderson K, McLendon W, Hendrickson B, Çatalyürek ÜV () A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: Proceedings of supercomputing (SC ), Seattle
Graph Analysis Software SNAP (Small-World Network Analysis and Partitioning) Framework
Graph Partitioning Bruce Hendrickson Sandia National Laboratories, Albuquerque, NM, USA
Definition Graph partitioning is a technique for dividing work amongst processors to make effective use of a parallel computer.
G
units of work) need to be divided into P sets with about the same number of vertices in each. Additionally, the number of edges that connect vertices in two different sets needs to be kept small since these will reflect the need for interprocessor communication. This problem is known as graph partitioning and is an important approach to the parallelization of many applications. More generally, the vertices of the graph can have weights associated with them, reflecting different amounts of computation, and the edges can also have weights corresponding to different quantities of communication. The graph partitioning problem involves dividing the set of vertices into P sets with about the same amount of total vertex weight, while keeping small the total weight of edges that cross between partitions. This problem is known to be NP-hard, but a number of heuristics have been devised that have proven to be effective for many parallel computing applications. Several software tools have been developed for this problem, and they are an important piece of the parallel computing ecosystem. Important algorithms and tools are discussed below.
Discussion
Parallel Computing Applications of Graph Partitioning
When considering the data dependencies in a parallel application, it is very convenient to use concepts from graph theory. A graph consists of a set of entities called vertices, and a set of pairs of entities called edges. The entities of interest in parallel computing are small units of computation that will be performed on a single processor. They might be the work performed to update the state of a single atom in a molecular dynamics simulation, or the work required to compute the contribution of a single row of a matrix to a matrix-vector multiplication. Each such work unit will be a vertex in the graph which describes the computation. If two units have a data dependence between them (i.e., the output of one computation is required as input to the other), then there will be an edge in the graph that joins the two corresponding vertices. For a computation to perform efficiently on a parallel machine each of the P processors needs to have about the same amount of work to perform, and the amount of inter-processor communication must be small. These two conditions can be viewed in terms of the computational graph. The vertices of the graph (signifying
Graph partitioning is a useful technique for parallelizing many scientific applications. It is appropriate when the calculation consists of a series of steps in which the computational structure and data dependencies do not vary much. Under such circumstances the expense of partitioning is rewarded by improved parallel performance for many computational steps. The partitioning model is most applicable for bulk synchronous parallel applications in which each step consists of local computation followed by a global data exchange. Fortunately, many if not most scientific applications exhibit this basic structure. Particle simulations are one such important class of applications. The particles could be atoms in a material science or biological simulation, stars in a simulation of galaxy formation, or units of charge in an electromagnetic application. But by far the most common uses of graph partitioning involve computational meshes for the solution of differential equations. Finite volume, finite difference, and finite element methods all involve the decomposition of a complex geometry into simple shapes that interact only with near neighbors. Various graphs can be
G
G
Graph Partitioning
constructed from the mesh and a partition of the graph identifies subregions to be assigned to processors. The numerical methods associated with such approaches are often very amenable to the graph partitioning approach. These ideas have been used to solve problems from many areas of computational science including fluid flow, structural mechanics, electromagnetics, and many more.
Graph Partitioning Algorithms for Parallel Computing A wide variety of graph partitioning algorithms have been proposed for parallel computing applications. Here we review some of the more important approaches. Geometric partitioning algorithms are very fast techniques for partitioning sets of entities that have an underlying geometry. For parallel computing applications involving simulations of physical phenomena in two or three dimensions, the corresponding data structures typically have geometric coordinates associated with each entity. Examples include molecules in atomistic simulations, masses in gravitational models, or mesh points in a finite element method. Recursive coordinate partitioning is a method in which the elements are recursively divided by planar cuts that are orthogonal to one of the axes []. This has the advantage of producing geometrically simple subdomains – just rectangular parallelepipeds. Recursive inertial bisection also uses planar cuts, but instead of being orthogonal to an axis, they are orthogonal to the direction of greatest inertia []. Intuitively, this is a direction in which the point set is elongated, so cutting perpendicular to this direction is likely to produce a smaller cut. Yet another alternative is to cut with circles or spheres instead of planes []. Geometric methods tend to be very fast, but produce low-quality partitions. They can be improved via local refinement methods like the approach of Fiduccia-Mattheyses discussed below. A quite different set of approaches uses eigenvectors of a matrix associated with the graph. The most popular method in this class uses the second-smallest eigenvector of the Laplacian matrix of the graph []. A justification for this approach is beyond the scope of this article, but spectral methods generally produce partitions of fairly high quality. In a global sense, they find attractive regions for cutting a graph, but they are often poor
in the fine details. This can be rectified by the application of a local refinement method. The main drawback of spectral methods is their high computational cost. Local refinement methods are epitomized by the approach proposed by Fiduccia and Mattheyses [] (FM). This method works by iteratively moving vertices between partitions in a manner that maximally reduces the size of the set of cut edges. Moves are considered even if they make the cut size larger since they may enable subsequent moves that lead to even better partitions. Thus, this method has a limited ability to escape from local minima to search for even better solutions. The key advance underlying FM is the use of clever data structures that allow all the moves and their consequences to be explored and updated efficiently. The FM algorithm is quite fast, and consistently improves results generated by other approaches. But since it only explores sets of partitions that are not far from the initial one, it is generally limited to making small changes and will not find better partitions that are quite different. The most widely used class of graph partitioning techniques are multilevel algorithms as they provide a good balance between speed and quality. They were independently invented by several research groups more or less simultaneously [, , ]. Multilevel algorithms work by applying a local refinement method like FM at multiple scales. This largely overcomes the myopia that limits the effectiveness of local methods. This is accomplished by constructing a series of smaller and smaller graphs that roughly approximate the original graph. The most common way to do this is to merge small clusters of vertices within the original graph (e.g., combine two vertices sharing an edge into a single vertex). Once this series of graphs is constructed, the smallest graph is partitioned using any global method. Then the partition is refined locally and extended to the next larger graph. The refinement/extension process is repeated on larger and larger graphs until a partitioning of the original graph has been produced. A number of general purpose global optimization approaches have been proposed for graph partitioning including simulated annealing, genetic algorithms, and tabu search. These methods can produce high-quality partitions but are usually very expensive and so are limited to niche applications within parallel computing. Graph partitioning is often used as a preprocessing step to set up a parallel computation. The output of a
Graph Partitioning
graph partitioner determines which objects are assigned to which processors, and appropriate input files and data structures are prepared for a parallel run. However, there are several situations in which the partitioning must be done in parallel. For a very large problem, the memory of a serial machine may be insufficient to hold the graph that needs to be partitioned. Also, for some classes of applications the structure of the computation changes over time. Examples include adaptive mesh simulations, or particle methods in which the particles move significantly. For such problems the work load must be periodically redistributed across the processors, and a parallel partitioning tool is required. Simple geometric algorithms have the advantage of being easy to parallelize, but multilevel partitioners have also been parallelized to provide higher-quality solutions. Techniques for effectively parallelizing such methods is an ongoing area of research. A variety of open source graph partitioning tools have been developed in serial or parallel including Chaco, METIS, Jostle, and SCOTCH. Several of these are discussed in companion articles.
Limitations of Graph Partitioning Although widely used to enable the parallelization of scientific applications, graph partitioning is an imperfect abstraction. For a parallel application to perform well, the work must be evenly distributed among processors and the cost of interprocessor communication must be minimized. Graph partition provides only a crude approximation for achieving these objectives. In the graph partitioning model, each vertex is assigned a weight that is supposed to represent the time required to perform a piece of computation. On modern processors with complex memory hierarchies it is very difficult to accurately predict the runtime of a piece of code a priori. Cache performance can dominate runtime, and this is very hard to predict in advance. So the weights assigned to vertices in the graph partitioning model are just rough approximations. An even more significant shortcoming of graph partitioning has to do with communication. For most applications, a vertex has data that needs to be known by all of its neighbors. If two of those neighbors are owned by the same processor, then that data need only be communicated once. In the graph partitioning model, two
G
edges would be cut and so the actual volume of communication would be over-counted. Several alternatives to standard graph partitioning have been proposed to address this problem. In one approach, the number of vertices with off-processor neighbors is counted instead of the number of edges cut. A more powerful and elegant alternative uses hypergraphs and is sketched below. Yet another deficiency in graph partitioning is that it emphasizes the total volume of communication. In many practical situations, latency is the performancelimiting factor, so it is the number of messages that matters most, not the size of messages. As discussed above, graph partitioning is most appropriate for bulk synchronous applications. If the calculation involves complex interleaving of computations with communication or partial synchronizations then graph partitioning is less useful. An important application with this character is the factorization of a sparse matrix. Finally, graph partitioning is only appropriate for applications in which the work and communication pattern are predictable and stable. This happens to be the case for many important scientific computing kernels, but there are other applications that do not fit this model.
Hypergraph Partitioning A hypergraph is a generalization of a graph. Whereas a graph edge connects exactly two vertices, a hyperedge can connect any subset of vertices. This seemingly simple generalization leads to improved and more general partitioning models for parallel computing []. Consider a graph in which vertices represent computation and edges represent data dependencies. For each vertex, replace all the edges connected to it with a single hyperedge that joins the vertex and all of its neighbors. When the vertices are partitioned, if a particular vertex is separated from any of its neighbors, the corresponding hyperedge will be cut. For the common situation in which the vertex needs to communicate the same information to all of its neighbors, this single hyperedge will reflect the amount of data that needs to be shared with another processor. Thus, the number of cut hyperedges (or more generally the total weight of cut hyperedges) correctly captures the total volume of communication induced by a partitioning. In this way, the
G
G
Graph Partitioning Software
hypergraph model resolves an important shortcoming of standard graph partitioning. Hypergraphs also address a second deficiency of the graph model. If the communication is not symmetric (e.g., vertex i needs to send data to j, but j does not need to send data to i), then the graph model has difficulty capturing the communication requirements. The hypergraph model does not have this problem. A hyperedge simply spans a vertex i and every other vertex that i needs to send data to. There is no implicit assumption of symmetry in the construction of the hypergraph model. Graph and hypergraph partitioning models, algorithms, and software continue to be active areas of research in parallel computing. PaToH and hMETIS are widely used hypergraph partitioning tools for parallel computing.
Related Entries Chaco Load Balancing, Distributed Memory Hypergraph Partitioning METIS and ParMETIS
Bibliographic Notes and Further Reading Graph partitioning is a well-studied problem in theoretical computer science and is known to be difficult to solve optimally. For parallel computing, the challenge is to find algorithms that are effective in practice. Algebraic methods like Laplacian partitioning [] are an important class of techniques, but can be expensive. Local refinement techniques like FM are also important [], but get caught in local optima. Multilevel methods seem to offer the best trade off between cost and performance [, , ]. Hypergraph partitioning provides an important alternative to graph partition in many instances []. A survey of different partitioning models can be found in the paper by Hendrickson and Kolda []. Several good codes for graph partitioning are available on the Internet including Chaco, METIS, PaToH, and Scotch.
Acknowledgment Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the
US Department of Energy under contract DE-ACAL.
Bibliography . Berger MJ, Bokhari SH () A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans Comput C-():– . Bui T, Jones C () A heuristic for reducing fill in sparse matrix factorization. In: Proceedings of the th SIAM Conference on parallel processing for scientific computing, SIAM, Portsmouth, Virginia, pp – . Çatalyürek U, Aykanat C () Decomposing irregularly sparse matrices for parallel matrix-vector multiplication. In: Lecture notes in computer science , Proceedings Irregular’, SpringerVerlag, Heidelberg, pp – . Cong J, Smith ML () A parallel bottom-up clustering algorithm with application to circuit partitioning in VLSI design. In: Proceedings of the th Annual CAM/IEEE International Design Automation Conference, DAC’, ACM, San Diego, CA, pp – . Fiduccia CM, Mattheyses RM () A linear time heuristic for improving network partitions. In: Proceedings of the th ACM/IEEE Design Automation Conference, ACM/IEEE, Las Vegas, NV, June , pp – . Hendrickson B, Kolda T () Graph partitioning models for parallel computing. Parallel Comput :– . Hendrickson B, Leland R () A multilevel algorithm for partitioning graphs. In: Proceedings of Supercomputing ’, ACM, New York, December . Previous version published as Sandia Technical Report SAND–, Albuquerque, NM . Miller GL, Teng SH, Vavasis SA () A unified geometric approach to graph separators. In: Proceedings of the nd Symposium on Foundations of Computer Science, IEEE, Pittsburgh, PA, October , pp – . Simon HD () Partitioning of unstructured problems for parallel processing. In: Proceedings of the conference on parallel methods on large scale structural analysis and physics applications. Pergammon Press, Elmsford, NY
Graph Partitioning Software Chaco METIS and ParMETIS PaToH (Partitioning Tool for Hypergraphs)
Graphics Processing Unit NVIDIA GPU
Green Flash: Climate Machine (LBNL)
Green Flash: Climate Machine (LBNL) John Shalf , David Donofrio , Chris Rowen , Leonid Oliker , Michael Wehner Lawrence Berkeley National Laboratory, Berkeley, CA, USA CEO, Tensilica, Santa Clara, CA, USA
Synonyms LBNL climate computer; Manycore; Tensilica; View from Berkeley
Definition Green Flash is a research project focused on an application-driven manycore chip design that leverages commodity-embedded circuit designs and hardware/software codesign processes to create a highly programmable and energy-efficient HPC design. The project demonstrates how a multidisciplinary hardware/software codesign process that facilitates close interactions between applications scientists, computer scientists, and hardware engineers can be used to develop a system tailored for the requirements of scientific computing. By leveraging the efficiency gained from application-driven design philosophy, advanced processor synthesis tools from Tensilica, FPGA-accelerated architectural simulation from RAMP, auto-tuning for rapid optimization of the software implementation, the project demonstrated how a hardware/software codesign process can achieve a × increase in energy efficiency over its contemporaries using cost-effective commodity-embedded building blocks. To demonstrate application-driven design process, Green Flash was tailored for high-resolution global cloud resolving models, which are the leading justification for exascale computing systems. However, the approach can be generalized to a broader array of scientific applications. As such, Green Flash represents a vision of a new design process that could be used to develop effective exascale-class HPC systems.
Discussion Introduction The scientific community is facing one of its greatest challenges in the prediction of global climate change –
G
a question whose answer has staggering economic, political, and sociological ramifications. The computational power required to inform such critical policy decisions requires a new breed of extreme scale computers to accurately model the global climate. The “business as usual” approach of using commercial offthe-shelf (COTS) hardware to build ever-larger clusters is increasingly unsustainable beyond the petaflop scale due to the constraints of power and cooling. Some estimates indicate an exaflop-capable machine would consume close to MW of power. Such unreasonable power costs drive the need for a radically new approach to HPC system design. Green Flash is a theoretical system designed with an application-driven hardware and software codesign for HPC systems that leverages the innovative and low-power architectures and design processes of the low-power/embedded computing industry. Green Flash is the result of Berkeley Lab’s research into energy-efficient system design – many details that are common to all system design, such as power, cooling, mechanical design, etc. are not addressed in this research as they are not unique to Green Flash and are challenges that would need to be overcome regardless of system architecture. The work presented here represents the energy efficiency gained through application-tailored architectures that leverage embedded processors to build energy efficient manycore processors.
History In , a group of University of California researchers, with backgrounds ranging from circuit design, computer architecture, CAD, embedded hardware/software, programming languages, compilers, applied math, to HPC, met for a period of two years to consider how current constraints on device physics at the silicon level would affect CPU design, system architecture, and programming models for future systems. The results of the discussions are documented in the University of California Berkeley Technical Report entitled “The Landscape of Parallel Computing Research: A View from Berkeley.” This report was the genesis of the UC Berkeley ParLab, which was funded by Intel and Microsoft as well as the Green Flash project. Whereas the ParLab carried the work of the View from Berkeley forward for desktop and handheld applications,
G
G
Green Flash: Climate Machine (LBNL)
the Green Flash project took the same principles and applied them to the design of energy-efficient scientific computing systems. Hardware/software codesign, a methodology that allows both software optimization and semi-specialized processor design to be simultaneously developed, has long been a feature of power-sensitive embedded system designs, but thus far has seen very little application in the HPC space. However, given power has become the leading design constraint of future HPC systems and codesign, and other application-driven design processes have received considerably more attention. Green Flash leverages tools that were developed by Tensilica for rapid-synthesis of application-optimized CPU designs, and retargets them to designing processors that are optimized for scientific applications. The project also created novel inter-processor communication to enable that easier-to-program environment than its GPU contemporaries – providing hardware support for more natural programming environments based on partitioned global address space programming models.
Approach It is widely agreed that architectural specialization can significantly improve efficiency, however, creating full-custom designs of HPC systems has often proven impractical due to excessive design/verification costs and lead-times. The embedded processor market relies on architectural customization to meet the demanding cost and power efficiency requirements of its products with a short turn-around time. With time to market a key element in profitability, sophisticated toolchains have been developed to enable rapid and cost-effective turn-around of power-efficient semicustom designs implementations appropriate to each specific processor design. Green Flash leverages these same toolchains to design power-efficient exascale systems, tailoring embedded chips to target scientific applications and providing a feedback path from the application programmer to the hardware design enabling a tight hardware/software codesign loop that is unprecedented in the HPC industry. Auto-tuning technologies are used to automate the software tuning process and maintain portability across the differing architectures produced inside the codesign loop. Auto-tuners can automatically search over a broad parameter space
of optimizations to improve the computational efficiency of application kernels and help produce a more balanced architecture. To enable fast, accurate performance evaluation a Field Programmable Gate Array (FPGA) based hardware emulation platform will be used to allow an experimental architecture to be evaluated at speeds , × faster than typical software-based simulation methods – fast enough to allow execution of a real application rather than an arbitrary benchmark. High-resolution global cloud system resolving models are the target application that will motivate Green Flash’s architectural decisions. A truly exascale problem, a . km scale model would decompose the earth’s atmosphere into twenty-billion individual cells and a machine with unprecedented performance would need to be realized in order for the model to run faster than real time. While using more power-efficient, off-theshelf, embedded processors is a crucial first in meeting this challenge it is still insufficient. Green Flash will offer many other novel optimizations, both hardware and software, including alternatives to cache coherence that enable far more efficient inter-processor communication than a conventional symmetric multiprocessing (SMP) approach and aggressive, architecture-specific software optimization through auto-tuning. All these specialization techniques will allow Green Flash to efficiently meet the exascale computation requirements of global climate change prediction.
Modeling the Earth’s Climate System Current generation climate models are comprehensive representations of the various systems that determine the Earth’s climate. Models prepared for the fourth report of the Intergovernmental Panel on Climate Change coupled submodels of the atmosphere, ocean, and sea ice together to provide simulations of the past, present, and future climate. It is expected that the major remaining components of the climate system, the terrestrial and oceanic biosphere, the Greenland and Antarctic ice sheets, and certain aspects of atmospheric chemistry will be represented in models currently being prepared for the next report. Each of the subsystem models has their own strengths and weaknesses, and each introduces a certain amount of uncertainty into projections of the future. Current computational resources limit the resolution of these submodels and are a contributor to these uncertainties. In
Green Flash: Climate Machine (LBNL)
particular, resolution constraints on models of atmospheric processes do not allow clouds to be resolved forcing model developers to rely on sub-grid scale parameterizations based on statistical methods. However, simulations of the recent past produce cloud distributions that do not agree well with observations. These disagreements, traceable to the cumulus convection parameterizations, lead to other errors in patterns of the Earth’s radiation and moisture budgets. Current global atmospheric models have resolutions of order km, obviously many times larger than individual clouds. Development of models at the limit of the validity of cumulus parameterization (∼ km) is now underway by a few groups, although the necessary century scale integrations are just barely feasible on the largest current computing platforms. It is expected that many issues will be rectified by this increase in horizontal fidelity but that the fundamental limitations of cumulus parameterization will remain. The solution to this problem is to directly simulate cloud processes rather than attempt to model them statistically. At horizontal grid spacing of order ∼ km, cloud systems can be individually resolved providing this direct numerical simulation. However, the computation burden of fluid dynamics algorithms scales nonlinearly with the number of grid points due to time step limitations imposed by numerical stability requirements. Hence, the computational resources necessary to carry out century scale simulations of the Earth’s climate dwarfs any traditional machine currently under development.
Climate Model Requirements Extrapolation from measured computational requirements of existing atmospheric models allow estimates of what would be necessary at resolutions of order km to support Global Cloud Resolving Models. To better make these estimates, the Green Flash project has partnered with Prof. David Randall’s group at the Colorado State University (CSU). In their approach, the globe is represented by a mesh based on an icosahedron as the starting point. By successively bisecting the sides of the triangles making up this object, a remarkably uniform mesh on the sphere can be generated. However, this is not the only way to discretize the globe at this resolution and it will be important to have a variety of independent cloud system–resolving models if projections of the future are to have any credibility. For this reason it
G
is important to emphasize that Green Flash will not be built to only run this particular discretization. Rather, this approach calls for optimizing a system for a class of scientific applications; therefore, Green Flash will be able to efficiently run most global climate models. Extrapolation based on today’s cluster-based, general purpose, HPC systems produce estimates that the sustained computational rate necessary to simulate the Earth’s climate , times faster than it actually occurs was Pflops. A tentative estimate from the CSU model is as much as Pflops. This difference can be regarded as one measure of the considerable uncertainty in making these estimates. As the CSU model matures, there will be the opportunity to determine this rate much more accurately. Multiple realizations of individual simulations are necessary to address the statistical complexities of climate system. Hence, an exaflop scale machine would be necessary to carry out this kind of science. The exact peak flop rate required depends greatly on the efficiency that the machine could be used. These enormous sustained computational rates are not even imaginable if there is not enough parallelism in the climate problem. Fortunately, cloud system resolving models at the kilometer scale do offer plenty of opportunity to decompose the physical domain. Bisection of the triangles composing the icosahedron twelve successive times produces a global mesh with ,, vertices spaced between and km apart. A logically rectangular two-dimensional domain decomposition strategy can be applied horizontally to the icosahedral grid. Choosing square segments of the mesh containing grid points each ( × ) results in ,, horizontal domains. The vertical dimension offers additional parallelism. Assuming that layers could be decomposed in separate vertical domains, the total number of physical sub-domains could be ,,. Twenty-one million way parallelism may seem mind-boggling but this particular strawman decomposition was devised with practical constraints on the performance of an individual core in an SMP in mind. Each of the -million cores in this system will be assigned a small group of sub-domains on which they will execute the full (physics, dynamics, etc.) climate model. Flops per watt is the key performance metric for designing the SMP for Green Flash, and the goal of × energy efficiency over existing machines will be achieved by
G
G
Green Flash: Climate Machine (LBNL)
tailoring the architecture to the needs of the climate model. One example that drives the need for a high core count per socket is the model’s communication pattern. While the model is nearest-neighbor dominated, the majority of the more latency-sensitive communication occurs between the vertical layers. By keeping this communication on-chip the more latency tolerant horizontal communication can be sent off-chip with less performance penalty. Looking at per-core features, by running with a lower ( MHz) clock speed relative to today’s server-class processors Green Flash gains a cubic improvement in power consumption. Removal of processor features that are nonoptimal for science such as Translation Look-aside Buffers (TLBs) and Out of Order (OOO) processing creates a smaller die, which reduces leakage to help further reduce power.
Hardware Design Flow Verification costs are quickly becoming the dominant force in custom hardware solutions. Large arrays of simple processors again hold a significant advantage here, as the work to verify to a simple processing element and then replicate them on die is significantly lower. The long lead times and high costs of doing custom designs has generally dissuaded the HPC community from custom solutions and pushed more for clusters of COTS (Commercial off-the-self) hardware. The traditional definition of COTS in the HPC space is typically at the board or socket level; Green Flash seeks to redefine this notion of COTS and asserts that a custom processor made up of pre-verified building blocks can still be considered COTS hardware. This fine grained view permits Green Flash to benefit from both the architectural specialization afforded by these specialized processing elements and the shorter lead times and reduced verification costs that come with using a building block approach. The constraints of power have long directed the development of embedded architectures and so it is advantageous to begin with an embedded core and leverage the sophisticated tool chains developed to minimize time from architectural specifications to ASIC. These toolchains start with a collection of pre-verified functional units and allow them to be combined in a myriad of ways rapidly producing power-efficient semi-custom designs. For instance, starting with a base architecture a designer may wish to add floating-point
support to a processor, or perhaps add a larger cache or local store. These functional units can be added to a processor design as easily as clicking a checkbox or dropdown menu. The tool will then select the correct functional unit from its library and integrate it into the design – all without the designer needing to intervene. These tools eliminate large amounts of not only boilerplate, but also full custom logic that once needed to be written and re-written in order to change a processor’s architecture. Of course the tools are not boundless and are subject to the same design limitations as any other physical design process – for instance, one cannot efficiently have hundreds of read ports from a single memory, but the amount of flexibility created through these tools vastly outweighs any inherent limitations. The rapid generation of processor cores alone makes these tools very interesting, however, the overhead of generating a usable software stack for each processor would negate the time saved developing the hardware. While adding caches or changing bus widths has little effect on the ISA, and therefore a minimal software impact, adding a new functional unit such as floatingpoint or integer division has a large impact on the software flow. Building custom hardware creates significant work not only in the creation of a potentially complex software stack but also a time-consuming verification process. As with the software stack, without a method to jump-start the verification process the tools would begin to lose their effectiveness. To address both of these critical issues these tools generate optimizing compilers, test benches as well as a functional simulator in parallel with the RTL for the design. Having the processor constructed of pre-verified building blocks combined with the automatic generation of test benches greatly reduces the risk and time required for formal verification. To help maintain backward and general purpose compatibility the processor’s ISA is restricted to one that is functionally complete and allows for the execution of general purpose code.
A Science Optimized Processor Design The processor design for Green Flash is driven by efficiency and the best way to reduce power consumption and increase efficiency is to reduce waste. With that in mind, the target architecture calls for a very simple, in order core with no branch prediction. The heavy memory and communication requirements demanded by the
Green Flash: Climate Machine (LBNL)
climate model have imparted the greatest influence on the design of the Green Flash core. Building on prior work from Williams et al. where it was shown that for memory-intensive applications, cores with a local store, such as Cell, were able to utilize a higher percentage of the available DRAM bandwidth, the target processor architecture includes a local store. In the Green Flash SMP design there will be two on-chip networks – as illustrated in Fig. . As can be somewhat expected, the majority of communication that occurs between subdomains within the climate model is nearest neighbor. Building on work from both Balfour and Dally [] a packet switched network with a concentrated torus topology was chosen for the SMP as it has been shown to provide superior performance and energy efficiency for codes where the dominant communication pattern is nearest neighbor. To further optimize the Green Flash processor for science the programming model is being considered a first class citizen when designing the architecture. Traditional cache coherent models found in many modern SMPs do not allow fine-grained synchronization between cores. In fact, when benchmarking the current climate model on present day machines it is shown that greater than % of execution time is spent in communication. By creating an architecture where an individual core will not pay a huge overhead penalty for sending or receiving a relatively small message the amount of time spent in communication can be greatly reduced. The processing cores used in the Green Flash SMP have powerful, flexible streaming interfaces. Each processor can have multiple, designer defined ports with a simple FIFO-like interface with each port capable of sending and receiving a packet of data on each clock. This low-overhead streaming interface will bypass the cache and connect to one of the torus networks on chip. This narrow network can be used for exchange of addresses, while the wider torus network is used for exchange of data. Following a Partitioned Global Address Space (PGAS) model the address space for each processor’s local store is mapped into the global address space and the data exchange is done as a DMA from local store to local store. This allows the communication between processors to map very well to a MPI send/receive model used by the climate model and many other scientific codes. The view to the programmer will be as though all processors are directly connected to their
G
neighbors. To further simplify programming, a traditional cache hierarchy is also in place to allow codes to be slowly ported to the more efficient local-store based interprocessor network. In order to minimize power, the use of photonic interlinks for the inter-core network is being investigated as an efficient method of transferring long messages. In the case of Green Flash, the data network is one cache line in width and will consist of several phases per message.
Hardware/Software Codesign Strategy Conventional approaches to hardware design generally have a long latency between hardware design and software development/optimization so designers frequently rely on benchmark codes to find a powerefficient architecture. However, modern compilers fail to generate even close to optimal code for target machines. Therefore, a benchmark-based approach to hardware design does not exploit the full performance potential of the architecture design points under consideration leading to possibly sub-optimal hardware solutions. The success of auto-tuners has shown that it is still possible to generate efficient code using domain knowledge. In combination with the ability to rapidly produce semi-custom hardware designs a tight, effective hardware/software codesign loop can be created. The codesign approach, as shown in Fig. incorporates extensive software tuning into the process of hardware design. Hardware design space exploration is routinely done to tailor the hardware design parameters to the target applications. The auto-tuned software tailors the application to the hardware design point under consideration by empirically searching over a range of software implementations to find the best mapping of the software to the micro-architecture. One of the hindrances toward the practical relevance of codesign is the large hardware/software design space exploration. Conventional hardware design approaches use software simulation of hardware to perform hardware design space exploration. Because codesign involves searching over a much larger design space (there is now a need to explore the software design space at each hardware design point), codesign is impractical if software simulation of hardware is used. Rather than be constrained by the limitations of a software simulation environment, it is possible instead to take advantage of the processor generation
G
G
Green Flash: Climate Machine (LBNL)
DRAM Arbiter $
Xtensa
Xtensa
To global network
To global network $
$
Xtensa
Xtensa
$
Local Store Arbiter DRAM
Green Flash: Climate Machine (LBNL). Fig. A concentrated torus network fabric yields the highest performance and most power efficient design for scientific codes
toolchain’s ability to create synthesizable RTL for any given processor. By loading this design onto an FPGA, a potential processor design can be emulated running × faster than a functional simulator. This speedup allows the benchmarking of true applications rather than being forced to rely on representative code snippets or statically defined benchmarks. Furthermore, this speed advantage does not come at the expense of accuracy; to the contrary, FPGA emulation is arguably much
more accurate than a software simulation environment as it truly represents the hardware design. This fast accurate emulation environment provides the ability to run and benchmark the actual climate model as it is being developed and allows the codesign infrastructure to quickly search a large design space. The speed and accuracy advantages of using FPGAs have typically been dwarfed by the increased complexity of coding in Verilog or VHDL versus C++ or Python
Reference HW and SW configuration
GTX280
Generate new HW config.
Estimate Area
Generate new code variant
Benchmark code variant
Conventional Auto-tuning Methodology
Novel Co-tuning Methodology
.c
.cu
Estimate Power
Acceptable SW Performance?
Myriad of equivalent, optimized, implementations (plus test harness)
.f95
Proposed Co-tuning Flow
CUDA
Victoria Falls
FORTRAN
Code Generators
C with pthreads
Parallel
Transformation & Code Generation with high-level knowledge
Internal Abstract Syntax Tree Representation
Serial
Strategy Engines
Acceptable Efficiency?
in context of specific problem
Search Engines
.h
Optimized HW and SW configurations
Best performing implementation and configuration parameters
.c
Green Flash: Climate Machine (LBNL). Fig. The result of combining an existing auto-tuning framework (a) with a rapid hardware design cycle and FPGA emulation is (b) a proposed hardware/software codesign flow
b
a
Reference implementation
.f95
Parse
Auto-tuning Flow
Green Flash: Climate Machine (LBNL)
G
G
G
Green Flash: Climate Machine (LBNL)
as well as the ability to emulate large designs due to limitations in FPGA area/LUT count. The practicality of using FPGAs for large system emulation has increased dramatically over the past decade. The ability to access relatively large dynamic memories, such as DDR, has always been a difficult challenge with FPGAs due to the tight timing requirements. FPGA vendors, such as Xilinx, have eased this difficulty by providing IP through its Memory Interface Generator (MIG) tool and adding IO features to the Virtex- series. Freely available Verilog IP libraries – whether they are Xilinx CoreGen, or the RAMP group’s GateLib – allow for a modular, building block approach to HW design. Finally, while commercial microprocessors are experiencing a plateau in their clock rates and power consumption, FPGAs are not. FPGA LUT count continues to increase allowing the emulation of more complex designs and FPGA clocks, while traditionally significantly slower than commercial microprocessor clock rates have been growing steadily, closing the gap between emulated and production clock rates. In the case of Green Flash, the relatively low target clock frequency ( MHz) of the final ASIC is an additional motivation to target an FPGA emulation environment. The current emulated processor design runs at MHz – a significant fraction of the target clock rate. This relatively high speed enables the efficient benchmarking of an entire application rather than a representative portion. While the steady growth in LUT count on FPGAs has enabled the emulation of more complex designs, with a strawman architecture of cores per socket it is necessary to emulate more than the two or four cores that will fit on a single FPGA. To scale beyond the cores that will fit on a single FPGA a multi-FPGA system, such as the Berkeley Emulation Engine (BEE) can be used. The BEE board has four Virtex- FPGAs connected in a ring with a cross-over connection. Each FPGA has access to two channels of DDR memory allowing GB of memory per FPGA. The BEE will allow effective emulation of eight cores with the appropriate NoC infrastructure per board. To scale beyond eight cores, the BEE includes Gb connections allowing the boards to be linked and emulation of an entire socket becomes possible. There is significant precedence for emulation of massively multithreaded architectures across multiple FPGAs. One recent example was demonstrated by the Berkeley RAMP Blue project
where over , cores were emulated using a stack of BEE boards.
Hardware Support for New Programming Models Applications and algorithms will need to rely increasingly on fine-grained parallelism and strong scaling and support fault resilience to accommodate the massive growth of explicit on-chip parallelism and constrained bandwidth anticipated for future chip architectures. History shows that the application-driven approach offers the most productive strategy for evaluating and selecting among the myriad choices for refactoring algorithms for full scientific application codes as the industry moves through this transitional phase. Green Flash functions as a testbed to explore novel programming models together with hardware support to express fine-grained parallelism to achieve performance, productivity, and correctness for leading-edge application codes in the face of massive parallelism and increasingly hierarchical hardware. The goal of this development thrust is to create a new software model that can provide a stable platform for software development for the next decade and beyond for all scales of scientific computing. The Green Flash design created direct hardware support for both the message passing interface (MPI) and partitioned global address space (PGAS) programming models to enable scaling of these familiar single program, multiple data (SPMD) programming styles to much larger-scale systems. The modest hardware support enables relatively well-known programming paradigms to utilize massive on-chip concurrency and to use hierarchical parallelism to enable use of larger messages for interchip communication. However, not all applications will be able to express parallelism through simple divide-and-conquer problem partitioning. So the message-queues and softwaremanaged memories that are used to implement PGAS are also being used to explore new asymmetric and asynchronous approaches to achieving strong-scaling performance improvements from explicit parallelism. Techniques that resemble class static dataflow methods are garnering renewed interest because of their ability to flexibly schedule work and to accommodate state migration to correct load imbalances and failures. In the case of the climate code, dataflow techniques can be used to concurrently schedule the physics computations
Green Flash: Climate Machine (LBNL)
with the dynamic core of the climate code, thereby doubling the effective concurrency without moving to a finer domain decomposition. This approach also benefits from the unique interprocessor communication interfaces developed for Green Flash.
Fault Resilience A question that comes up when proposing a million processor computing system is how to deal with fault resilience. While trying not to trivialize the issue, it should be noted that this is a problem for everyone designing large-scale machines. The proposed approach of using many simpler cores does not introduce any unique challenges that are different than the challenges faced for aggregating conventional server chips into large-scale systems provided the total number of discrete chips in the system is not dramatically different. The following observations are made to qualify this point. For a given silicon process (e.g., a nm process and same design rules) . Hard failure rates are primarily proportional to the number of sockets in a system (e.g., solder joint failures, weak wire bonds, and a variety of mechanical and electrical issues) and secondarily related to the total chip surface area (probability of defect vs tolerance of the design rules to process variation). It is not proportional to the number of processor cores per se given that the cores come in all shapes and sizes. . Soft error rates caused by comic rays roughly proportional to chip surface area when comparing circuits that employ the same process technology (e.g., nm). . Bit error rates for data transfer tend to increase proportionally with clock rate. . Thermal stress is also a source of hard errors. For hard errors: . Spare cores can be designed into each ASIC to tolerate defects due to process variation. This approach is already used by the core Cisco Metro chip, which incorporates spare cores ( cores in total) to cover chip defects. . Each chip is expected to dissipate a relatively small – W (or that is the target) subjecting them to less mechanical/thermal stress. . It has been demonstrated that Green Flash can achieve more delivered performance out of fewer
G
sockets, which reduces exposure to hard-failures due to bad electrical connections or other mechanical/electrical defects. . Like BlueGene, memory and CPUs can be flowsoldered onto the board to reduce hard and soft failure rates for electrical connections given removable sockets are far more susceptible to both kinds of faults. So eliminating removable sockets can greatly reduce error rates. For soft errors: . All of the basics for reliability and error recovery in the memory subsystem including full ECC (error correcting code) protection for caches and memory interfaces are included in the design. . Using many simpler cores allows fewer sockets to be used and less silicon surface area to achieve the same delivered performance. So that is to say, Green Flash has less exposure to major sources of failure than a conventional high-frequency core design. Therefore, fewer sockets and fewer random bit-flips due to mechanical noise and other stochastic error sources. . The core clock frequency of MHz improves Signal to Noise Ratio for on-chip data transfers. . Incorporation of a Nonvolatile Random Access Memory (NVRAM) memory controller and channel on each System on Chip (SoC). Each node can copy the image of memory to the NVRAM periodically to do local checkpoints. If there is a soft-error (e.g., an uncorrectable memory error), then the node can initiate a roll-back to the last available checkpoint. For hard failures (e.g., a node does and cannot be revived), the checkpoint image will be copied to neighboring nodes on a periodic basis to facilitate localized state recovery. Both strategies enable much faster roll-back when errors are encountered than the conventional user-space checkpointing approach. Therefore, the required fault resilience strategies will bear similarity to other systems that employ a similar number of sockets (∼, ), which are not unprecedented. The BlueGene system at Lawrence Livermore National Laboratory contains a comparable number of sockets, and achieves a – day Mean Time Between Failures (MTBF), which is far longer than systems that contain a fraction the number of processor cores. Therefore, careful application of wellknown fault-resilience techniques together with a few
G
G
Green Flash: Climate Machine (LBNL)
novel “extended fault resilience” mechanisms such as localized NVRAM checkpoints can achieve an acceptable MTBF for extreme-scale implementations of this approach to system design.
Conclusions Green Flash proposes a radical approach of applicationdriven computing design to break through the slow pace of incremental changes, and foster a sustainable hardware/software ecosystem with broad-based support across the IT industry. Green Flash has enabled the exploration of practical advanced programming models together with lightweight hardware support mechanisms that allow programmers to utilize massive onchip concurrency, thereby creating the market demand for massively concurrent components that can also be the building block of midrange and extreme-scale computing systems. New programming models must be part of a new software development ecosystem that spans all scales of systems, from midrange to the extreme-scale to facilitate a viable migration path from development to large-scale production computing systems. The use of the FPGA-based hardware emulation platforms, such as RAMP, to prototype and run hardware prototypes at near-realtime speeds before it is built allow testing of full-fledged application codes and advanced software development to commence many years before the final hardware platform is constructed. These tools have enabled a tightly coupled software/hardware codesign process that can be applied effectively to the complex HPC application space. Rather than ask “what kind of scientific applications can run on our HPC cluster after it arrives,” the question should be turned around to ask “what kind of system should be built to meet the needs of the most important science problems.” This approach is able to realize its most substantial gains in energy-efficiency by peeling back the complexity of high-frequency microprocessor design point to reduce sources of waste (wasted opcodes, wasted bandwidth, waste caused by orienting architectures toward serial performance). BlueGene and SiCortex have demonstrated the advantages of using the simpler low-power embedded processing elements to create energy-efficient computing platforms. However, the Green Flash codesign approach goes beyond traditional embedded core design point of BlueGene and SiCortex by using explicit message queues and software controlled memories to further optimize
data movement, while still retaining a smaller conventional cache-hierarchy only to support incremental porting to the more energy and bandwidth-efficient design point. Furthermore, simple hardware support for lightweight on-chip interprocessor synchronization and communication make it much simpler and straightforward and efficient to program massive arrays of processors than more exotic programming models such as CUDA and Streaming. Green Flash has been a valuable research vehicle to understand how the evolution of massively parallel chip architectures can be guided by close-coupled feedback with the design of the application, algorithms, and hardware together. Application-driven design ensures hardware design decisions do not evolve in reaction to hardware constraints, without regard to programmability and delivered application performance. The design study has been driven by a deep dive into the climate application space, but enables explorations that cut across all application areas and have ramifications to the next generation of fully general-purpose architectures. Ultimately, Green Flash should consist of an architecture that can maximally leverage reusable components from the mass market of the embedded space while improving the programmability for the manycore design point. The building blocks of a future HPC system must be the preferred solution in terms of performance and programmability for everything from the smallest high-performance energy-efficient embedded system, to midrange departmental systems, to the largest-scale systems.
Bibliography . Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA () The landscape of parallel computing research: a view from Berkeley. Technical report no. UCB/EECS--, EECS Department University of California, Berkeley . Wehner M, Oliker L, Shalf J () Towards ultra-high resolution models of climate and weather. Int J High Perform Comput Appl :– . Shalf J () The new landscape of parallel computer architecture. J Phys: Conf Ser : . Donofrio D, Oliker L, Shalf J, Wehner MF, Rowen C, Krueger J, Kamil S, Mohiyuddin M () Energy-efficient computing for extreme-scale science. IEEE Computer, Los Almitos . Wehner M, Oliker L, Shalf J () Low-power supercomputers. IEEE Spectrum ():– . Shalf J, Wehner M, Oliker L, Hules J () Green flash project: the challenge of energy-efficient HPC. SciDAC Review, Fall
Gustafson’s Law
. Kamil SA, Shalf J, Oliker L, Skinner D () Understanding ultra-scale application communication requirements. In: IEEE international symposium on workload characterization (IISWC) Austin, – Oct (LBNL-) . Kamil S, Chan Cy, Oliker L, Shalf J, Williams S () An autotuning framework for parallel multicore stencil computations. In: IPDPS , Atlanta . Hendry G, Kamil SA, Biberman A, Chan J, Lee BG, Mohiyuddin M, Jain A., Bergman K, Carloni LP, Kubiatocics J, Oliker L, Shalf J () Analysis of photonic networks for chip multiprocessor using scientific applications. In: NOCS , San Diego . Mohiyuddin M, Murphy M, Oliker L, Shalf J, Wawrzynek J, Williams S () A design methodology for domain-optimized power-efficient supercomputing, In: SC , Portland . Balfour J, Dally WJ () Design tradeoffs for tiled cmp on-chip networks. In ICS ’: Proceedings of the th annual international conference on supercomputing
Grid Partitioning Domain Decomposition
Gridlock Deadlocks
Group Communication Collective Communication
Gustafson’s Law John L. Gustafson Intel Labs, Santa Clara, CA, USA
Synonyms Gustafson–Barsis Law; Scaled speedup; Weak scaling
Definition Gustafson’s Law says that if you apply P processors to a task that has serial fraction f , scaling the task to take the
G
same amount of time as before, the speedup is Speedup = f + P( − f ) = P − f (P − ) . It shows more generally that the serial fraction does not theoretically limit parallel speed enhancement, if the problem or workload scales in its parallel component. It models a different situation from that of Amdahl’s Law, which predicts time reduction for a fixed problem size.
Discussion Graphical Explanation Figure explains the formula in the Definition: The time the user is willing to wait to solve the workload is unity (lower bar). The part of the work that is observably serial, f , is unaffected by parallelization. The remaining fraction of the work, − f , parallelizes perfectly so that a serial processor would take P times longer to execute it. The ratio of the top bar to the bottom bar is thus f + P( − f ). Some prefer to rearrange this algebraically as P − f (P − ). The diagram resembles the one used in the explanation of Amdahl’s Law (see Amdahl’s Law) except that Amdahl’s Law fixes the problem size and answers the question of how parallel processing can reduce the execution time. Gustafson’s Law fixes the run time and answers the question of how much longer time the present workload would take in the absence of parallelism []. In both cases, f is the experimentally observable fraction of the current workload that is serial. The similarity of the diagram to the one that explains Amdahl’s Law has led some to “unify” the two laws by a change of variable. It is an easy algebraic exercise to set the upper bar to unit time and express the f of Gustafson’s Law in terms of the variables of Amdahl’s Law, but this misses the point that the two laws proceed from different premises. Every attempt at unification begins by applying the same premise, resulting in a circular argument that the two laws are the same. The fundamental underlying observation of Gustafson’s Law is that more powerful computer systems usually solve larger problems, not the same size problem in less time. Hence, a performance enhancement like parallel processing expands what a user can do with a computing system to match the time the user is willing to wait for the answer. While computing power has increased by many orders of magnitude over the last
G
G
Gustafson’s Law
Time required if only serial processing were available f
P(1–f )
… Serial fraction
f
Parallel fraction that would have to be executed in P serial stages
1–f
Present execution time
Gustafson’s Law. Fig. Graphical derivation of Gustafson’s Law
half-century (see Moore’s Law), the execution time for problems of interest has been constant, since that time is tied to human timescales.
History In a conference debate over the merits of parallel computing, IBM’s Gene Amdahl argued that a considerable fraction of the work of computers was inherently serial, from both algorithmic and architectural sources. He estimated the serial fraction f at about .–.. He asserted that this would sharply limit the approach of parallel processing for reducing execution time []. Amdahl argued that even the use of two processors was less cost-effective than a serial processor. Furthermore, the use of a large number of processors would never reduce execution time by more than /f , which by his estimate was a factor of about –. Despite many efforts to find a flaw in Amdahl’s argument, “Amdahl’s Law” held for over years as justification for the continued use of serial computing hardware and serial programming models.
The Rise of Microprocessor-Based Systems By the late s, microprocessors and dynamic random-access memory (DRAM) had dropped in price to the point where academic researchers could afford them as components in experimental parallel designs. Work in by Charles Seitz at Caltech using a message-passing collection of microprocessors []
showed excellent absolute performance in terms of floating-point operations per second, and seemed to defy Amdahl’s pessimistic prediction. Seitz’s success led John Gustafson at FPS to drive development of a massively parallel cluster product with backing from the Defense Advanced Research Projects Agency (DARPA). Although the largest configuration actually sold of that product (the FPS T Series) had only processors, the architecture permitted scaling to , processors. The large number of processors led many to question: What about Amdahl’s Law? Gustafson formulated a counterargument in April , which showed that performance is a function of both the problem size and the number of processors, and thus Amdahl’s Law need not limit performance. That is, the serial fraction f is not a constant but actually decreases with increased problem size. With no experimental evidence to demonstrate the idea, the counterargument had little impact on the computing community. An idea for a source of experimental evidence arose in the form of a challenge that Alan Karp had publicized the year before []. Karp had seen announcements of the ,-processor CM- from Thinking Machines and the ,-processor NCUBE from nCUBE, and believed Amdahl’s Law made it unlikely that such massively parallel computers would achieve a large fraction of their rated performance. He published a skeptical challenge and a financial reward for anyone who could demonstrate a parallel speedup of over times on
Gustafson’s Law
three real applications. Karp suggested computational fluid dynamics, structural analysis, and econometric modeling as the three application areas and gave some ground rules to insure that entries focused on honest parallel speedup without tricks or workarounds. For example, one could not cripple the serial version to make it artificially times slower than the parallel system. And the applications, like the three suggested, had to be ones that had interprocessor communication throughout their execution as opposed to “embarrassingly parallel” problems that had communication only at the beginning and end of a run. By , no one had met Karp’s challenge, so Gordon Bell adopted the same set of rules and suggested applications as the basis for the Gordon Bell Award, softening the goal from times to whatever the best speedup developers could demonstrate. Bell expected the initial entries to achieve about tenfold speedup []. The purchase by Sandia National Laboratories of the first ,-processor NCUBE system created the opportunity for Gustafson to demonstrate his argument on the experiment outlined by Karp and Bell, so he joined Sandia and worked with researchers Gary Montry and Robert Benner to demonstrate the practicality of high parallel speedup. Sandia had real applications in fluid dynamics and structural mechanics, but none in econometric modeling, so the three researchers substituted a wave propagation application. With a few weeks of tuning and optimization, all three applications were running at over -fold speedup with the fixed-size Amdahl restriction, and over ,-fold speedup with the scaled model proposed by Gustafson. Gustafson described his model to Sandia Director Edwin Barsis, who suggested explaining scaled speedup using a graph like that shown in Fig. . Barsis also insisted that Gustafson publish this concept, and is probably the first person to refer to it as “Gustafson’s Law.” With the large experimental speedups combined with the alternative model, Communications of the ACM published the results in May []. Since Gustafson credited Barsis with the idea of expressing the scaled speedup model as graphed in Fig. , some refer to Gustafson’s Law as the Gustafson–Barsis Law. The three Sandia researchers
G
30 25
Speedup when execution time is fixed (Gustafson)
20 15 10
Speedup when problem size is fixed (Amdahl)
5
0.0
0.2 0.4 0.6 0.8 Observable parallel fraction of existing workload
1.0
Gustafson’s Law. Fig. Speedup possible with processors, by Gustafson’s Law and Amdahl’s Law
published the detailed explanation of the application parallelizations in a Society of Industrial and Applied Mathematics (SIAM) journal [].
Parallel Computing Watershed Sandia’s announcement of ,-fold parallel speedups created a sensation that went well beyond the computing research community. Alan Karp announced that the Sandia results had met his Challenge, and Gordon Bell gave his first award to the three Sandia researchers. The results received publicity beyond that of the usual technical journals, appearing in TIME, Newsweek, and the US Congressional Record. Cray, IBM, Intel, and Digital Equipment began work in earnest developing commercial computers with massive amounts of parallelism for the first time. The Sandia announcement also created considerable controversy in the computing community, partly because some journalists sensationalized it as a proof that Amdahl’s Law was false or had been “broken.” This was never the intent of Gustafson’s observation. He maintained that Amdahl’s Law was the correct answer but to the wrong question: “How much can parallel processing reduce the run time of a current workload?”
Observable Fraction and Scaling Models As part of the controversy, many maintained that Amdahl’s Law was still the appropriate model to use in all situations, or that Gustafson’s Law was simply
G
sc
al
ed
m
od
Insufficient main memory
y
Log of problem size
or
a corollary to Amdahl’s Law. For scaled speedup, the argument went that one simply works backward to determine what the f fraction in Amdahl’s Law must have been to yield such performance. This is an example of circular reasoning, since the proof that Amdahl’s Law applies begins by assuming it applies. For many programs, it is possible to instrument and measure the fraction of time f spent in serial execution. One can place timers in the program around serial regions and obtain an estimate of f . This fraction then allows Amdahl’s Law estimates of time reduction, or Gustafson’s Law estimates of scaled speedup. Neither law takes into account communication costs or intermediate degrees of parallelism. (When communication costs are included in Gustafson’s fixed-time model, the speedup is again limited as the number of processors grows, because communication costs rise to the point where there is no way to increase the size of the amount of work without increasing the execution time.) A more common practice is to measure the parallel speedup as the number of processors is varied, and fit the resulting curve to derive f . This approach confuses serial fraction with communication overhead, load imbalance, changes in the relative use of the memory hierarchy, and so on. Some refer to the requirement to keep the problem size the same yet use more processors as “strong scaling.” Still, a common phenomenon that results from “strong scaling” is that it is easier, not harder, to obtain high amounts of speedup. When spreading a problem across more and more processors, the memory per processor goes down to the point where the data fits entirely in cache, resulting in superlinear speedup []. Sometimes, the superlinear speedup effects and the communication overheads partially cancel out, so what appears to be a low value of f is actually the result of the combination of the two effects. In modern parallel systems, performance analysis with either Amdahl’s Law or Gustafson’s Law will usually be inaccurate since communication costs and other parallel processing phenomena have large effects on the speedup. In Fig. , Amdahl’s Law governs the Fixed-Sized Model line, Gustafson’s Law governs the Fixed-Time Model line, and what some call the Sun–Ni Law governs the Memory Scaled Model []. The fixed-time model line is an irregular curve in general, because of
el
Gustafson’s Law
em
G
M
Fix
ed
t
m ime
ode
l Communication bound
Fixed size model Log of number of processors
Gustafson’s Law. Fig. Different scaling types and communication costs
the communication cost effects and because the percentage of the problem that is in each memory tier (mass storage, main RAM, levels of cache) changes with the use of more processors.
Analogies There are many aspects of technology where an enhancement for time reduction actually turns out to be an enhancement for what one can accomplish in the same time as before. Just as Amdahl’s Law is an expression of the more general Law of Diminishing Returns, Gustafson’s Law is an expression of the more general observation that technological advances are used to improve what humans accomplish in the length of time they are accustomed to waiting, not to shorten the waiting time.
Commuting Time As civilization has moved from walking to horses to mechanical transportation, the average speed of getting to and from work every day has gone up dramatically. Yet, people take about half an hour to get to or from work as a tolerable fraction of the day, and this amount of time is probably similar to what it has been for centuries. Cities that have been around for hundreds or thousands of years show a concentric pattern that reflect the increasing distance people could commute for the amount of time they were able to tolerate.
Gustafson’s Law
Transportation provides many analogies for Gustafson’s Law that expose the fallacy of fixing the size of a problem as the control variable in discussing large performance gains. A commercial jet might be able to travel miles per hour, yet if one asks “How much will it reduce the time it takes me presently to walk to work and back?” the answer would be that it does not help at all. It would be easy to apply an Amdahltype argument to the time to travel to an airport as the serial fraction, such that the speedup of using a jet only applies to the remaining fraction of the time and thus is not worth doing. However, this does not mean that commercial jets are useless for transportation. It means that faster devices are for larger jobs, which in this case means longer trips. Here is another transportation example: If one takes a trip at miles per hour and immediately turns around, how fast does one have to go to average miles per hour? This is a trick question that many people incorrectly answer, “ miles per hour.” To average miles per hour, one would have to travel back at infinite speed, that is, return instantly. Amdahl’s Law applies to this fixed-distance trip. However, suppose the question were posed differently: “If one travels for an hour at miles per hour, how fast does one have to travel in the next hour to average miles per hour?” In that case, the intuitive answer of “ miles per hour” is the correct one. Gustafson’s Law applies to this fixed-time trip.
The US Census In the early debates about scaled speedup, Heath and Worley [] provided an example of a fixed-sized problem that they said was not appropriate for Gustafson’s Law and for which Amdahl’s Law should be applied: the US Census. While counting the number of people in the USA would appear to be a fixed-sized problem, it is actually a perfect example of a fixed-time problem since the Constitution mandates a complete headcount every years. It was in the late nineteenth century, when Hollerith estimated that the population had grown to the point where existing approaches would take longer than years that he developed the card punch tabulation methods that made the process fast enough to fit the fixed-time budget. With much faster computing methods now available, the Census process has grown to take into
G
account many more details about people than the simple head count that the Constitution mandates. This illustrates a connection between Gustafson’s Law and the jocular Parkinson’s Law: “Work expands to fill the available time.”
Printer Speed In the s, when IBM and Xerox were developing the first laser printers that could print an entire page at a time, the goal was to create printers that could print several pages per second so that printer speed could match the performance improvements of computing speed. The computer printouts of that era were all of monospaced font with a small character set of uppercase letters and a few symbols. Although many laser printer designers struggled to produce such simple output with reduced time per page, the product category evolved to produce high quality output for desktop publishing instead of using the improved technology for time reduction. People now wait about as long for a page of printout from a laser printer as they did for a page of printout from the line printers of the s, but the task has been scaled up to full color, high resolution printing encompassing graphics output, and a huge collection of typeset fonts from alphabets in all the world’s languages. This is an example of Gustafson’s Law applied to printing technology. Biological Brains Kevin Howard, of Massively Parallel Technologies Inc., once observed that if Amdahl’s Law governed the behavior of biological brains, then a human would have about the same intelligence as a starfish. The human brain has about billion neurons operating in parallel, so for us to avoid passing the point of diminishing returns for all that parallelism, the Amdahl serial fraction f would have to be about − . The fallacy of this seeming paradox is in the underlying assumption that a human brain must do the same task a starfish brain does, but must reduce the execution time to nanoseconds. There is no such requirement, and a human brain accomplishes very little in a few nanoseconds no matter how many neurons it uses at once. Gustafson’s Law says that on a time-averaged basis, the human brain will accomplish vastly more complex tasks than what a starfish can attempt, and thus avoids the absurd conclusion of the fixed-task model.
G
G
Gustafson’s Law
Perspective The concept of scaled speedup had a profound enabling effect on parallel computing, since it showed that simply asking a different question (and perhaps a more realistic one) renders the pessimistic predictions of Amdahl’s Law moot. Gustafson’s announcement of ,-fold parallel speedup created a turning point in the attitude of computer manufacturers towards massively parallel computing, and now all major vendors provide platforms based on the approach. Most (if not all) of the computer systems in the TOP list of the worlds’ fastest supercomputers are comprised of many thousands of processors, a degree of parallelism that computer builders regarded as sheer folly prior to the introduction of scaled speedup in . A common assertion countering Gustafson’s Law is that “Amdahl’s Law still holds for scaled speedup; it’s just that the serial fraction is a lot smaller than had been previously thought.” However, this requires inferring the small serial fraction from the measured speedup. This is an example of circular reasoning since it involves choosing a conclusion, then working backward to determine the data that make the conclusion valid. Gustafson’s Law is a simple formula that predicts scaled performance from experimentally measurable properties of a workload. Some have misinterpreted “scaled speedup” as simply increasing the amount of memory for variables, or increasing the fineness of a grid. It is more general than this. It applies to every way in which a calculation can be improved somehow (accuracy, reliability, robustness, etc.) with the addition of more processing power, and then asks how much longer the enhanced problem would have taken to run without the extra processing power. Horst Simon, in his keynote talk at the International Conference on Supercomputing, “Progress in Supercomputing: The Top Three Breakthroughs of the Last Years and the Top Three Challenges for the Next Years,” declared the invention of the Gustafson’s scaled speedup model as the number one achievement in high-performance computing since .
Related Entries Amdahl’s Law Distributed-Memory Multiprocessor Metrics
Bibliographic Entries and Further Reading Gustafson’s two-page paper in the Communications of the ACM [] outlines his basic idea of fixedtime performance measurement as an alternative to Amdahl’s assumptions. It contains the rhetorical question, “How can this be, in light of Amdahl’s Law?” that some misinterpreted as a serious plea for the resolution of a paradox. Readers may find a flurry of responses in Communications and elsewhere, as well as attempts to “unify” the two laws. An objective analysis of Gustafson’s Law and its relation to Amdahl’s Law can be found in many modern textbooks on parallel computing such as [], [], or []. In much the way some physicists in the early twentieth century refused to accept the concepts of relativity and quantum mechanics, for reasons more intuition-based than scientific, there are computer scientists who refuse to accept the idea of scaled speedup and Gustafson’s Law, and who insist that Amdahl’s Law suffices for all situations. Pat Worley analyzed the extent to which one can usefully scale up scientific simulations by increasing their resolution []. In related work, Xian-He Sun and Lionel Ni built a more complete mathematical framework for scaled speedup [] in which they promote the idea of memory-bounded scaling, even though execution time generally increases beyond human patience when the memory used by a problem scales as much as linearly with the number of processors. In a related vein, Vipin Kumar proposed “Isoefficiency” for which the memory increases as much as necessary to keep the efficiency of the processors at a constant level even when communication and other impediments to parallelism are taken into account.
Bibliography . Amdahl GM () Validity of the single-processor approach to achieve large scale computing capabilities. AFIPS Joint Spring Conference Proceedings (Atlantic City, NJ, Apr. –), pp –. AFIPS Press, Reston VA. At http://wwwinst.eecs.berkeley.edu/∼n /paper/Amdahl.pdf . Bell G (interviewed) () An interview with Gordon Bell. IEEE Software, vol , No. (July ), pp – . Gustafson JL () Fixed time, tiered memory, and superlinear speedup. Distributed Memory Computing Conference, , Proceedings of the Fifth, vol (April ), pp –. ISBN: ---
Gustafson–Barsis Law
. Gustafson JL, Montry GR, Benner RE () Development of parallel methods for a -processor hypercube. SIAM Journal on Scientific and Statistical Computing, vol , No. , (July ), pp – . Gustafson () Reevaluating Amdahl’s Law. Communications of the ACM, vol , No. (May ), pp –. DOI=./. . Heath M, Worley P () Once again, Amdahl’s Law. Communications of the ACM, vol , No. (February ), pp – . Hwang K, Briggs F, Computer Architecture and Parallel Processing, . McGraw-Hill Inc., . ISBN: . Karp A () http://www.netlib.org/benchmark/karp-challenge . Lewis TG, El-Rewini H () Introduction to Parallel Computing, Prentice Hall. ISBN: ---. – . Quinn M () Parallel Computing: Theory and Practice. Second edition. McGraw-Hill, Inc
G
. Seitz CL () Experiments with VLSI ensemble machines. Journal of VLSI and Computer Systems, vol , No. , pp – . Sun X-H, Ni L () Scalable problems and memory-bounded speedup.” Journal of Parallel and Distributed Computing, vol , No. , pp – . Worley PH () The effect of time constraints on scaled speedup. Report ORNL/TM , Oak Ridge National Laboratory
Gustafson–Barsis Law Gustafson’s Law
G
H Half Vector Length Metrics
Hang Deadlocks
Harmful Shared-Memory Access Race Conditions
Haskell Glasgow Parallel Haskell (GpH)
Hazard (in Hardware) Dependences
Discussion Introduction The HDF technology suite is designed to organize, store, discover, access, analyze, share, and preserve diverse, complex data in continuously evolving heterogeneous computing and storage environments. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. The HDF library and file format are portable and extensible, allowing applications to evolve in their use of HDF. The HDF technology suite also includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF format. Originally designed within the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign [], HDF is now primarily developed and maintained by The HDF Group [], a nonprofit organization dedicated to ensuring the sustainable development of HDF technologies and the ongoing accessibility of data stored in HDF files. HDF builds on lessons learned from other data storage libraries and file formats, such as the original HDF file format (now known as HDF []), netCDF [], TIFF [], and FITS [], while adding unique features and extending the boundaries of prior data storage models.
Data Model
HDF Quincey Koziol The HDF Group, Champaign, IL, USA
Synonyms Hierarchical data format
Definition HDF [] is a data model, software library, and file format for storing and managing data.
HDF implements a simple but versatile data model, which has two primary components: groups and datasets. Group objects in an HDF file contain a collection of named links to other objects in an HDF file. Dataset objects in HDF files store arrays of arbitrary element types and are the main method for storing application data. Groups, which are analogous to directories in a traditional file system, can contain an arbitrary number of uniquely named links. A link can connect a group to another object in the same HDF file; include a named
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
H
HDF
path to an object in the HDF file, which may not exist currently; or refer to an object in another HDF file. Unlike links in a traditional file system, HDF links can be used to create fully cyclic directed graph structures. Each group contains one or more B-tree data structures as indices to its collection of links, which are stored in a heap structure within the HDF file. Dataset objects store application data in an HDF file as a multidimensional array of elements. Each dataset is primarily defined by the description of how many dimensions its array has and the size of those dimensions, called a “dataspace”; and the description of the type of element to store at each location in the array, called a “datatype.” An HDF dataspace describes the number of dimensions for an array, as well as the current and maximum number of elements in each dimension. The maximum number of elements in an array dimension can be specified as “unlimited,” allowing an array to be extended over time. An HDF dataspace can have multiple dimensions that have unlimited maximum dimensions, allowing that array to be extended in any or all of those dimensions. An HDF datatype describes the type of data to store in each element of an array and can be one of the following classes: integer, floating-point, string, bitfield, opaque, compound, reference, enum, variable-length sequence, and array. These classes generally correspond to the analogous computer science concepts, but the reference and variable-length sequence datatypes are unusual. Reference datatypes contain links to HDF objects, allowing HDF applications to create datasets that act like groups. The former contain references or pointers to HDF objects, allowing HDF applications to create datasets that can act as lookup tables or indices. Variable-length sequence datatypes allow a dynamic number of elements of a base datatype to be stored as an element and are one mechanism for creating datasets that represent ragged arrays. All of the HDF datatypes can be combined in any arbitrary way, allowing for great flexibility in how an application stores its data. The elements of an HDF dataset can be stored in different ways, allowing an application to choose between various I/O access performance trade-offs. Dataset elements can be stored as a single sequence in the HDF file, called “contiguous” storage, which
allows for constant time access to any element in the array and no storage overhead for locating the elements in the dataset. However, contiguous data storage does not allow a dataset to use a dataspace with unlimited dimensions or to compress the dataset elements. The dataspace for a dataset can also be decomposed into fixed-size sub-arrays of elements, called “chunks,” which are stored individually in the file. This “chunked” data storage requires an index for locating the chunks that store the data elements. Datasets that have a data space with unlimited dimensions must use chunked data storage for storing their elements. Using chunked data storage allows an application that will be accessing sub-arrays of the dataset to tune the chunk size to its sub-array size, allowing for much faster access to those sub-arrays than would be possible with contiguous data storage. Additionally, the elements of datasets that use chunked data storage can be compressed or have other operations, like checksums, etc., applied to them. The advantages of chunked data storage are balanced by some limitations, however. Using an index for mapping dataset element coordinates to chunk locations in the file can slow down access to dataset elements if the application’s I/O access pattern does not line up with the chunk’s sub-array decomposition. Furthermore, there is additional storage overhead for storing an index for the dataset, along with extra I/O operations to access the index data structure. Datasets with very small amounts of element data can store their elements as part of the dataset description in the file, avoiding any extra I/O accesses to retrieve the dataset elements, since the HDF library will read them when accessing the dataset description. This “compact” data storage must be very small (less than KB), and may not be used with a dataspace that has unlimited dimensions or when dataset elements are compressed. Finally, a dataset can store its elements in a different, non-HDF, file. This “external” data storage method can be used to share dataset elements between an HDF application and a non-HDF application. As with contiguous data storage, external data storage does not allow a dataset to use a dataspace with unlimited dimensions or compress the dataset elements. HDF also allows application-defined metadata to be stored with any object in an HDF file.
HDF
These “attributes” are designed to store information about the object they are attached to, such as input parameters to a simulation, the name of an instrument gathering data, etc. Attributes are similar to datasets in that they have a dataspace, a datatype, and elements values. Attributes require a name that is unique among the attributes for an object, similar to link names within groups. Attributes are limited to dataspaces that do not use unlimited maximum dimensions and cannot have their data elements compressed, but can use any datatype for their elements.
Examples of Using HDF The following C code example shows how to use the HDF library to store a large array. In the example below, the data to be written to the file is a threedimensional array of single-precision floating-point values, with dimensions of elements along each axis:
In this example, lines – declare the variables needed for the example, including the -D array of data to store. Line represents the application’s process of filling the data array with information. Lines – create a new HDF file, a new dataspace describing a fixed-size three-dimensional array of dimensions × × , and a new dataset using a singleprecision floating-point datatype and the dataspace created. Line writes the entire GB array to the file in a single I/O operation, and lines – close the objects created earlier. Several of the calls use HP_DEFAULT as a parameter, which is a placeholder for an HDF property list object, which can control more complicated properties of objects or operations. The next C code example creates an identically structured file, but adds the necessary calls to open the file with processes in parallel and to perform a collective write to the dataset created.
float data[][][];
hid_t file_id, dataspace_id, dataset_id; hsize_t dims[] = {, , }; < . . .acquire or assign data values. . . >
float data[][][];
hid_t file_id, file_dataspace_id, mem_dataspace _id, dataset_id, fa_plist_id, dx_plist_id; hsize_t file_dims[] = {, , }; hsize_t mem_dims[] = {, , };
file_id = HFcreate(“example.h”, HF_ACC_TRUNC, HP_DEFAULT, HP_DEFAULT); dataspace_id = HScreate_simple(, dims, NULL); dataset_id = HDcreate(file_id, “/Float_data”, HT_NATIVE_FLOAT, dataspace_id, HP_DEFAULT, HP_DEFAULT, HP_DEFAULT);
H
HDwrite(dataset_id, HT_NATIVE_FLOAT, dataspace_id, dataspace_id, HP_DEFAULT, data);
< . . .acquire or assign data values. . . > fa_plist_id = HPcreate(HP_FILE_ACCESS); HPset_fapl_mpio(fa_plist_id, MPI_COMM_WORLD, MPI_INFO_NULL); file_id = HFcreate(“example.h”, HF_ACC_ TRUNC, HP_DEFAULT, fa_plist_id);
HPclose(fa_plist_id);
file_dataspace_id = HScreate_simple(, file_dims, NULL);
dataset_id = HDcreate(file_id, “/Float_data”, HT_NATIVE_FLOAT, file_dataspace_id, HP_DEFAULT, HP_DEFAULT, HP_DEFAULT);
HDclose(dataset_id);
HSclose(dataspace_id);
mem_dataspace_id = HScreate_simple(, mem_dims, NULL);
HFclose(file_id);
H
H
HDF
< . . .select process’s elements in file dataspace. . . >
dx_plist_id = HPcreate(HP_DATASET_XFER); HPset_dxpl_mpio(dx_plist_id, HFD_MPIO_COLLECTIVE);
HDwrite(dataset_id, HT_NATIVE_FLOAT, mem_dataspace_id, file_dataspace_id, dx_plist_id, data);
datasets and describe the coordinates of dataspace elements. A scientific user community can also use HDF as the basis for exchanging data among its members by creating a standardized domain-specific data model that is relevant to their area of interest. Domain-specific data models specify the names of HDF groups, datasets and attributes, the dataspace and datatype for the datasets and attributes, etc. Frequently, a domain’s user community also creates a “wrapper library” that calls HDF library routines while enforcing the domain’s standardized data model.
Library Interface
HPclose(dx_plist_id);
Software applications create, modify, and delete HDF objects through an object-oriented library interface that manipulates the objects in HDF files. Applications can use the HDF library to operate directly on the base HDF objects or use a domain-specific wrapper library that operates at a higher level of abstraction. The core software library for accessing HDF files is written in C, but library interfaces for the HDF data model have been created for many programming languages, including Fortran, C++, Java, Python, Perl, Ada, and C#.
HDclose(dataset_id); HSclose(mem_dataspace_id); HSclose(file_dataspace_id); HFclose(file_id); In this updated example, the size of the data array on line has been changed to be only one eighth of the total array size, to allow for each of the eight processes to write a portion of the total array in the file. Lines – have been updated to create a file access property list, change the file driver for opening the file to use MPI-I/O and collectively open the file with all processes, using the file access property list. Lines – create the dataset in the file in the same way as the previous example. Line creates a dataspace for each process’s portion of the dataset in the file, and line represents a section of code for selecting the part of the file’s dataspace that each process will write to (which is omitted due to space constraints). Lines – create a dataset transfer property list, set the I/O operation to be collective, and perform a collective write operation where each process writes a different portion of the dataset in the file. Finally, lines – release resources used for the example. This example shows some ways that property lists can be used to modify the operation of HDF API calls, as well as demonstrating a simple example of parallel I/O using HDF API calls.
File Format Objects in the HDF data model created by the library interface are stored in files whose structure is defined by the HDF file format. The HDF file format has many unique aspects, some of which are: a mechanism for storing non-HDF formatted data at the beginning of a file, a method of “micro-versioning” file data structures that makes incremental changes to the format possible, and data structures that enable constant-time lookup of data within the file in situations which previously required a logarithmic number of operations. The HDF file format is designed to be flexible and extensible, allowing for evolution and expansion of the data model in an incremental and structured way. This allows new releases of the HDF software library to continue to access all previous versions of the HDF file format. This capability empowers application developers to create HDF files and access data contained within them over very long periods of time.
Tools Higher-Level Data Models Built on HDF HDF provides a set of generic higher-level data models that describe how to store images and tables as
HDF is distributed with command-line utilities that can inspect and operate on HDF files. Operations provided by command-line utilities include copying HDF
HDF
objects from one file to another, compacting internally fragmented HDF files to reduce their size, and comparing two HDF files to determine differences in the objects contained within them. The latter differencing utility, called “hdiff ”, is also provided as a parallel computing application that uses the MPI programming interface to quickly compare two files using multiple processes. Many other applications, both commercial and open source, can access data stored in HDF files. Some of these applications include MATLAB [], Mathematica [], HDFView [], VisIt [], and EnSight []. Some of these applications provide generic browsing and modification of HDF files, while others provide specialized visualization of domain-specific data models stored in HDF files.
Parallel File I/O Applications that use the MPI programming interface [] can use the HDF library to access HDF files in parallel from multiple concurrently executing processes. Internally, the HDF library uses the MPI interface for coordinating access to the HDF file as well as for performing parallel operations on the file. Efficiently accessing an HDF file in parallel requires storing the file on a parallel file system designed for access through the MPI interface. Two methods of accessing an HDF file in parallel are possible: “independent” and “collective.” Independent parallel access to an HDF file is performed by a process in a parallel application without coordination with or cooperation from the other processes in the application. Collective parallel access to an HDF file is performed with all the processes in the parallel application cooperating and possibly communicating with each other. The following discussion of HDF library capabilities describes parallel I/O features in the current release at the time this entry was written, release ... The parallel I/O features in the HDF library are continually improving and evolving to address the ongoing changes in the landscape of parallel computing. Unless otherwise stated, limitations in the capabilities of the HDF library are not inherent to the HDF data model or file format and may be addressed in future library releases. The HDF library requires that operations that create, modify, or delete objects in an HDF file be performed collectively. However, operations that only open
H
objects for reading can be performed independently. Requiring collective operations for modifying the file’s structure is currently necessary so that all processes in the parallel application keep a consistent view of the file’s contents. Reading or writing the elements of a dataset can be performed independently or collectively. Accessing the elements of a dataset with independent or collective operations involves different trade-offs that application developers must balance. Using independent operations requires a parallel application to create the overall structure of the HDF file at the beginning of its execution, or possibly with a nonparallel application prior to the start of the parallel application. The parallel application can then update elements of a dataset without requiring any synchronization or coordination between the processes in the application. However, the application must subdivide the portions of the file accessed from each process to avoid race conditions that would affect the contents of the file. Additionally, accessing a file independently may cause the underlying parallel file system to perform very poorly, due to its lack of a global perspective on the overall access pattern, which prevents the file system from taking advantage of many caching and buffering opportunities available with collective operations. Accessing HDF dataset elements collectively requires that all processes in the parallel application cooperate when performing a read or write operation. Collective operations use the MPI interface within the HDF library to describe the region(s) of the file to access from each process, and then use collective MPI I/O operations to access the dataset elements. Collectively accessing HDF dataset elements allows the MPI implementation to communicate between processes in the application to determine the best method of accessing the parallel file system, which can greatly improve performance of the I/O operation. However, the communication and synchronization overhead of collective access can also slow down an application’s overall performance. To get good performance when accessing an HDF dataset, it is important to choose a storage method that is compatible with the type of parallel access chosen. For example, compact data storage requires all writes to dataset elements be performed collectively, while external data storage requires all dataset element accesses be performed independently.
H
H
HDF
Collective and independent element access also involves the MPI and parallel file system layers, and those layers add their own complexities to the equation. HDF application developers must carefully balance the trade-offs of collective and independent operations to determine when and how to use them. High performance access to HDF files is a strongly desired feature of application developers, and the HDF library has been designed with the goal of providing performance that closely matches the performance possible when an application accesses unformatted data directly. Considerable effort has been devoted to enhancing the HDF library’s parallel performance, and this effort continues as new parallel I/O developments unfold.
Significant Parallel Applications and Libraries That Use HDF Many applications and software libraries that use HDF have become significant software assets, either commercially or as open source projects governed by a user community. HDF’s stability, longevity, and breadth of features have attracted many developers to it for storing their data, both for sequential and parallel computing purposes. The first software library to use HDF was developed collaboratively by developers at the US Department of Energy’s (DOE) Lawrence Livermore, Los Alamos and Sandia National Laboratories. This effort was called the “Sets and Fields” (SAF) library [] and was developed to give the parallel applications that dealt with large, complex finite element simulations on the highest performing computers of the time a way to efficiently store their data. Many large scientific communities worldwide have adopted HDF for storing data with parallel applications. Some significant examples include the FLASH software for simulating astrophysical nuclear flashes from the University of Chicago [], the Chombo package for solving finite difference equations using adaptive mesh refinement from Lawrence Berkeley National Laboratory [], and the open source NeXus software and data format for interchanging data in the neutron, x-ray, and muon science communities [].
netCDF, PnetCDF, and HDF Another significant software library that uses HDF is the netCDF library [], developed at the Unidata
Program Center for the University Corporation for Atmospheric Research (UCAR). Originally designed for the climate modeling community, netCDF has since been embraced by many other scientific communities. netCDF has adopted HDF as its principal storage method, as of version ., in order to take advantage of several features in HDF that its previous file format did not provide, including data compression, a wider array of types for storing data elements, hierarchical grouping structures, and more flexible parallel operations. Created prior to the development of netCDF-, parallel-netCDF or “PnetCDF” [] was developed by Argonne National Laboratory. PnetCDF allows parallel applications to access netCDF format files with collective data operations. PnetCDF does not extend the netCDF data model or format beyond allowing larger objects to be stored in netCDF files. Files written by PnetCDF and versions of the netCDF library prior to netCDF- do not use the HDF file format and instead store data in the “netCDF classic” format [].
Future Directions Both the primary development team at The HDF Group and the user community that has formed around the HDF project are constantly improving it. HDF continues to be ported to new computer systems and architectures and has its performance improved and errors corrected over time. Additionally, the HDF data model is expanding to encompass new developments in the field of high performance storage and computing. Some improvements being designed or implemented as of this entry’s writing include: increasing the efficiency of small file I/O operations through advanced caching mechanisms, finding ways to allow parallel applications to create HDF objects in a file with independent operations, and implementing new chunked data storage indexing methods to improve collective access performance. Additionally, HDF continues to lead the scientific data storage field in its adoption of asynchronous file I/O for improved performance, journaled file updates for improved resiliency, and methods for improving concurrency by allowing different applications to read and write to the same HDF file without using a locking mechanism.
High-Performance I/O
Related Entries File Systems MPI (Message Passing Interface) NetCDF I/O Library, Parallel
Bibliographic Notes and Further Reading HDF has been under development since , a short history of its development is recorded at The HDF Group web site []. Datasets in HDF files are analogous to sections of trivial fiber bundles [], where the HDF dataspace corresponds to a fiber bundle’s base space, the HDF datatype corresponds to the fiber, and the dataset, a variable whose value is the totality of the data elements stored, represents a section through the total space (which is the Cartesian product of the base space and the fiber). HDF datatypes can be very complex and many more details are found in reference []. The HDF file format is documented in []. Many more applications and libraries use HDF than are discussed in this entry. A partial list can be found in [].
H
. NeXus, http://www.nexusformat.org/Main_Page . PnetCDF, http://trac.mcs.anl.gov/projects/parallel-netcdf . netCDF file formats, http://www.unidata.ucar.edu/software/ netcdf/docs/netcdf/File-Format.html . A history of the HDF group, http://www.hdfgroup.org/about/ history.html . Fiber bundle definition, http://mathworld.wolfram.com/ FiberBundle.html . HDF user guide, Chapter : datatypes, http://www.hdfgroup. org/HDF/doc/UG/UG_frameDatatypes.html . HDF file format specification, http://www.hdfgroup.org/HDF/ doc/H.format.html . Summary of software using HDF, http://www.hdfgroup.org/ products/hdf_tools/SWSummarybyName.htm
H HEP, Denelcor Denelcor HEP
Heterogeneous Element Processor Denelcor HEP
Bibliography . HDF, http://www.hdfgroup.org/HDF/ . National Center for Supercomputing Application at the University of Illinois at Urbana-Champaign, http://www.ncsa.illinois. edu/ . The HDF group, http://www.hdfgroup.org/ . HDF, http://www.hdfgroup.org/products/hdf/ . netCDF, http://www.unidata.ucar.edu/software/netcdf/ . TIFF, http://partners.adobe.com/public/developer/tiff/index. html . FITS, http://fits.gsfc.nasa.gov/ . MATLAB, http://www.mathworks.com/products/matlab/ . Mathematica, http://www.wolfram.com/products/mathematica/ index.html . HDFView, http://www.hdfgroup.org/hdf-java-html/hdfview/ . VisIt, https://wci.llnl.gov/codes/visit/ . EnSight, http://www.ensight.com/ . MPI, http://www.mpi-forum.org/ . Miller M et al () Enabling interoperation of high performance, scientific computing applications: modeling scientific data with the sets & fields (SAF) modeling system. ICCS – , May, part II, San Francisco. Lecture Notes in Computer Science, vol . Springer, Heidelberg, pp – . FLASH, http://flash.uchicago.edu/web/ . Chombo, https://seesar.lbl.gov/anag/chombo/index.html
Hierarchical Data Format HDF
High Performance Fortran (HPF) HPF (High Performance Fortran)
High-Level I/O Library NetCDF I/O Library, Parallel
High-Performance I/O I/O
H
Homology to Sequence Alignment, From
Homology to Sequence Alignment, From Wu-Chun Feng, , Heshan Lin Virginia Tech, Blacksburg, VA, USA Wake Forest University, Winston-Salem, NC, USA
Discussion Two sequences are considered to be homologous if they share a common ancestor. Sequences are either homologous or nonhomologous, but not inbetween []. Determining whether two sequences are actually homologous can be a challenging task, as it requires inferences to be made between the sequences. Further complicating this task is the potential that the sequences may appear to be related via chance similarity rather than via common ancestry. One approach toward determining homology entails the use of sequence-alignment algorithms that maximize the similarity between two sequences. For homology modeling, these alignments could be used to obtain the likely amino-acid correspondence between the sequences.
Introduction Sequence alignment identifies similarities between a pair of biological sequences (i.e., pairwise sequence alignment) or across a set of multiple biological sequences (i.e., multiple sequence alignment). These alignments, in turn, enable the inference of functional, structural, and evolutionary relationships between sequences. For instance, sequence alignment helped biologists identify the similarities between the SARS virus and the more well-studied coronaviruses, thus enhancing the biologists’ ability to combat the new virus.
Pairwise Sequence Alignment There are two types of pairwise alignment: global alignment and local alignment. Global alignment seeks to align a pair of sequences entirely to each other, i.e., from one end to the other. As such, it is suitable for comparing sequences with roughly the same length, e.g., two closely homologous sequences. Local alignment seeks to identify significant matches between parts of the sequences. It is useful to analyze partially related
sequences, e.g., protein sequences that share a common domain. Many approaches have been proposed for aligning a pair of sequences. Among them, dynamic programming is a common technique that can find optimal alignments between sequences. Dynamic programming can be used for both local alignment and global alignment. The algorithms in both cases are quite similar. The scoring model of an alignment algorithm is given by a substitution matrix and a gap-penalty function. A substitution matrix stores a matching score for every possible pair of letters. A matching score is typically measured by the frequency that a pair of letters occurs in the known homologous sequences according a certain statistical model. Popular substitution matrices include PAM and BLOSUM, an example of which is shown in Fig. . A gap-penalty function defines how gaps in the alignments are weighed in alignment scores. For instance, with a linear gap-penalty function, the penalty score grows linearly with the length of a gap. With an affine gap-penalty function, the penalty factors are differentiated for the opening and the extension of a gap. The following discussion will focus on the basic algorithm and its parallelization of Smith–Waterman, a popular local-alignment tool based on dynamic programming. (For more advanced techniques to compute pairwise sequence alignment, the reader should consult the “Bibliographic Notes and Further Reading” section at the end of this entry.)
Case Study: Smith–Waterman Algorithm Given two sequences S = a a ⋯am and S = b b ⋯bn , the Smith–Waterman algorithm uses an m by n scoring matrix H to calculate and track the alignments. A cell Hi,j stores the highest similarity score that can be achieved by any possible alignment ending at ai and bj . The Smith–Waterman algorithm has three phases: initialization, matrix filling, and traceback. The initialization phase simply assigns a value of to each of the matrix cells in the first row and the first column. In the matrix-filling phase, the problem of aligning the two whole sequences is broken into smaller subproblems, i.e., aligning partial sequences. Accordingly, a cell Hi,j is updated based on the values of its preceding neighbors. For the sake of illustration, the rest of the discussion assumes a linear gap-penalty function
Homology to Sequence Alignment, From
C S T P A G N D E Q H R K M I L V F Y W
9 −1 −1 −3 0 −3 −3 −3 −4 −3 −3 −3 −3 −1 −1 −1 −1 −2 −2 −2 C
4 1 −1 1 0 1 0 0 0 −1 −1 0 −1 −2 −2 −2 −2 −2 −3 S
4 1 −1 1 0 1 0 0 0 −1 0 −1 −2 −2 −2 −2 −2 −3 T
7 −1 −2 −2 −1 −1 −1 −2 −2 −1 −2 −3 −3 −2 −4 −3 −4 P
4 0 −2 −2 −1 −1 −2 −1 −1 −1 −1 −1 0 −2 −2 −3 A
6 0 −1 −2 −2 −2 −2 −2 −3 −4 −4 −3 −3 −3 −2 G
6 1 0 0 1 0 0 −2 −3 −3 −3 −3 −2 −4 N
6 2 0 1 −2 −1 −3 −3 −4 −3 −3 −3 −4 D
5 2 0 0 1 −2 −3 −3 −2 −3 −2 −3 E
5 0 1 1 0 −3 −2 −2 −3 −1 −2 Q
8 0 −1 −2 −3 −3 −3 −1 2 −2 H
5 2 −1 −3 −2 −3 −3 −2 −3 R
5 −1 −3 −2 −2 −3 −2 −3 K
5 1 2 1 0 −1 −1 M
4 2 3 0 −1 −3 I
4 1 0 −1 −2 L
4 −1 −1 −3 V
6 3 1 F
7 2 Y
H
11 W
Homology to Sequence Alignment, From. Fig. BLOSUM substitution matrix
where the penalty of a gap is equal to a constant factor g (typically negative) times the length of the gap. There are three possible alignment scenarios where Hi,j is derived from its neighbors: () ai and bj are associated, () there is a gap in sequence S , and () there is a gap in sequence S . As such, the scoring matrix can be filled according to (). The first three terms in () correspond to the three scenarios; the zero value ensures that there are no negative scores. S(ai , bj ) is the matching score derived by looking up the substitution matrix. ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Hi,j = max⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
Hi−,j− + S(ai , bj ) Hi−,j + g Hi,j− + g
Algorithm Matrix Filling in Smith–Waterman for i = to m do for j = to n do max = Hi−,j− + S(ai , bj ) > Hi−,j + g ? Hi−,j− + S(ai , bj ) : Hi−,j + g if Hi,j− + g > max then Hi,j = Hi,j− + g else Hi,j = max end if end for end for
()
When a cell is updated, the direction from which the maximum score is derived also needs to be stored (e.g., in a separate matrix). After the matrix is filled, a traceback process is used to recover the path of the best alignment. It starts from the cell with the highest score in the matrix and ends at a cell with a value of , following the direction information recorded earlier. The majority of execution time is spent on the matrix-filling phase in Smith–Waterman. Algorithm shows a straightforward implementation of the matrix
filling. In the inner loop of the algorithm, the cell calculated in one iteration depends on the value updated in the previous iteration, resulting in a “read-afterwrite” hazard (see [] for details), which can reduce the instruction-level parallelism that can be exploited by pipelining, and hence, adversely impact performance. In addition, this algorithm is difficult to directly parallelize because of the data dependency between iterations in the inner loop. As depicted in Fig. , the calculation of a particular cell depends on its west, northwest, and north neighbors. However, the updates of individual cells along an anti-diagonal are independent. This observation
H
H
Homology to Sequence Alignment, From
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P0
P1
P4
P0
Homology to Sequence Alignment, From. Fig. Data dependency of matrix filling
motivates a wavefront-filling algorithm [], where the matrix cells on an anti-diagonal can be updated simultaneously. That is, because there is no dependency between two adjacent cells along an anti-diagonal, this algorithm greatly reduces read-after-write hazards, and in turn, increases the execution efficiency and ease of parallelization. For example, in a shared-memory environment, individual threads can compute a subset of cells along an anti-diagonal. However, synchronization between threads must occur after computing each antidiagonal. Since a scoring matrix is typically stored in “row major” order, as shown in Algorithm , the above wavefront algorithm may have a large memory footprint when computing an anti-diagonal, thus limiting the benefits of processor caches. One improvement entails partitioning the matrix into tiles and having each parallel processing unit fill a subset of the tiles, as shown in Fig. . By carefully choosing the tile size, data processed by a thread can fit in the processor cache. Furthermore, when parallelized in distributed environments, the tiled approach can effectively reduce internode communication, as compared to the fine-grained wavefront approach, because only elements at the borders of individual tiles need to be exchanged between different compute nodes. With the wavefront approach, the initial amount of parallelism in the algorithm is low. It gradually increases
Homology to Sequence Alignment, From. Fig. Tiled implementation
along each successive anti-diagonal until reaching the maximum parallelism along the longest anti-diagonal, and then monotonically decreases thereafter. A tradeoff needs to be made in choosing the tile size. If the tile size is too large, there is not sufficient parallelism to exploit at the beginning of the wavefront computation, which results in idle resources in systems with a large number processing units. On the other hand, too small a tile size will incur much more synchronization and communication overhead. Nonetheless, the wavefront approach may generate imbalanced workloads on different processors, especially at the beginning and end of the computation. It is worth noting that there is an alternative parallel algorithm that uses prefix-sum to compute the scoring matrix row by row (or column by column) [], which can generate uniform task distribution among all processors. The above discussion assumes a simple linear gappenalty function. In practice, the Smith–Waterman algorithm uses an affine gap-penalty scheme, which requires maintaining three scoring matrices in order to track the gap opening and extension. Consequently, both the time and space usages increase by a factor of three in implementation using affine gap penalties.
Sequence Database Search With the proliferation of public sequence data, sequence database search has become an important task in sequence analysis. For example, newly discovered
Homology to Sequence Alignment, From
sequences are typically searched against a database of sequences with known genes and known functions in order to help predict the functions of the newly discovered sequences. A sequence database-search tool compares a set of query sequences against all sequences in a database with a pairwise alignment algorithm and reports the matches that are statistically significant. Although dynamic-programming algorithms can be used for sequence database search, the algorithms are too computationally demanding to keep up with the database growth. Consequently, heuristic-based algorithms, such as BLAST [, ] and FASTA [], have been developed for rapidly identifying similarities in sequence databases. BLAST is the most widely used sequence databasesearch tool. It reduces the complexity of alignment computation by filtering potential matches with common words, called k-mers. Specifically, there are four stages in comparing a query sequence and a database sequence.
H
theory uses a statistic called the e-value (E) to measure the likelihood that an alignment is resulted from matches by chance (i.e., matches between random sequences) as compared to true homologous relationships. The e-value can be calculated according to (): E = Kmn⋅e−λS
()
where K and λ are the Karlin–Altschul parameters, m and n are the query length and the total length of all database sequences, and S is the alignment score. The e-value indicates how many alignments with a score higher than S can be found by chance in the search space, i.e., the multiple of the query sequence length and the database length. The lower the e-value, the more significant is an alignment. The alignment results of a query sequence are sorted in the order of e-value. A sequence database-search job needs to compute M × N pairwise sequence alignments, where M and N are the numbers of the query and database sequences, respectively. This computation can be parallelized with ● Stage : The query and the database sequences are a coarse-grained approach, where the alignments of parsed into words of length k (k is for protein individual pairs of sequences are assigned to different sequences and for DNA sequences by default). processing units. The algorithm then matches words between the Early parallel sequence-search software adopted a query sequence and the database sequence and calquery segmentation approach, where a sequence-search culates an alignment score for each matched word, job is parallelized by having individual compute nodes based on a substitution matrix (e.g., BLOSUM). concurrently search disjoint subsets of query sequences Only matched words with alignment scores higher against the whole sequence database. Since the searches than a threshold are kept for the next stage. of individual query sequences are independent, this ● Stage : For a high-scoring matched word, ungapped embarrassingly parallel approach is easy to implement alignment is performed by extending the matched and scales well. However, the size of sequence databases word in both directions. An alignment score will is growing much faster than the memory size of a typical be calculated along the extension. The extension single computer. When the database cannot fit in memstops when the alignment score stops increasing ory, data will be frequently swapped in and out of the and slightly drops off from the maximum alignment memory when searching multiple queries, thus causscore (controlled by another threshold). ing significant performance degradation, because disk ● Stage : Ungapped alignments with scores larger I/O is several orders of magnitude slower than memthan a given threshold obtained from stage are ory access. Query segmentation can improve the search chosen as seed alignments. Gapped alignments are throughput, but it does not reduce the response time then performed on the seed alignments using a taken to search a single query sequence. dynamic-programming algorithm, following both Database segmentation is an alternative parallelizaforward and backward directions. tion approach, where large databases are partitioned ● Stage : Traceback is performed to recover the paths and cached in the aggregate memory of a group of comof gapped alignments. pute nodes. By doing so, the repeated I/O overhead of BLAST calculates the significance of result alignments searching large databases is avoided. Database segmenusing Karlin–Altschul statistics. The Karlin–Altschul tation also improves the search response time since a
H
H
Homology to Sequence Alignment, From
compute node searches only a portion of the database. However, database segmentation introduces computational dependencies between individual nodes because the distributed results generated at different nodes need to be merged and sorted to produce the final output. The parallel overhead of merging and sorting increases as the system size grows. With the astronomical growth of sequence databases, today’s large-scale sequence search jobs can be very resource demanding. For instance, BLAST searching in a metagenomics project can consume several millions of processor hours. Massively parallel sequence-search tools, such as mpiBLAST [, , ] and ScalaBLAST [], have been developed to accelerate large-scale sequence search jobs on state-of-the-art supercomputers. These tools use a combination of query segmentation and database segmentation to offer massive parallelism needed to scale on a large number of processors.
equal-sized partitions, which are supervised by a dedicated supermaster process. The supermaster is responsible for assigning tasks to different partitions and handling inter-partition load balancing. Within each partition, there is one master process and many worker processes. The master is responsible for coordinating both computation and I/O scheduling in a partition. The master periodically fetches a subset of query sequences from the supermaster and assigns them to workers, and it coordinates output processing of queries that have been processed in the partition. The sequence database is partitioned into fragments and replicated to workers in the system. This hierarchical design avoids creating scheduling bottlenecks in large systems by distributing scheduling workloads to multiple masters. Large-scale sequence searches can be highly dataintensive, and as such, the efficiency of data management is critical to the program scalability. For the input data, having thousands of processors simultaneously load database fragments from shared storage may Case Study: mpiBLAST mpiBLAST is an open-source parallelization of NCBI overwhelm the I/O subsystem. To address this, mpiBLAST that has been designed for petascale deploy- BLAST designates a set of compute nodes as I/O proxment. Adopting a scalable, hierarchical design, mpiBLAST ies, which read database fragments from the file system parallelizes a search job via a combination of query in parallel and replicate them to other workers using segmentation and database segmentation. As shown the broadcasting mechanism in MPI [, ] libraries. in Fig. , processors in the system are organized into In addition, mpiBLAST allows workers to cache
Supermaster Qi
Qj
Master1
qi1
Mastern
qj1
qi1
qj1
f1
f2
f1
f2
f1
f2
f1
f2
P11
P12
P13
P14
Pn1
Pn2
Pn3
Pn4
Partition 1
Partition n
Homology to Sequence Alignment, From. Fig. mpiBLAST hierarchical design. Qi and Qj are query batches fetched from the supermaster to masters, and qi and qj are query sequences that are assigned by masters to their workers. In this example, the database is segmented into two fragments f and f and replicated twice within each partition
Homology to Sequence Alignment, From
assigned database fragments in the memory or local storage, and it uses a task-scheduling algorithm that takes into account data locality to minimize repeated loading of database fragments. With database segmentation, result alignments of different database fragments are usually interleaved in the global output because those alignments need to be sorted by e-values. Consequently, the output data generated at each worker is noncontiguous in the output file. Straightforward noncontiguous I/O with many seek-and-write operations is slow on most file systems. This type of I/O can be optimized with collective I/O [, , ], which is available in parallel I/O libraries such as ROMIO []. Collective I/O uses a two-phase process. In the first phase, involved processes exchange data with each other to form large trunks of contiguous data, which are stored as memory buffers in individual processes. In the second phase, the buffered data is written to the actual file system. Collective I/O improves I/O performance because continuous data accesses are much more efficient than noncontiguous ones. Traditional collective I/O implementations require synchronization between all involved processes for each I/O operation. This synchronization overhead will adversely impact sequence-search performance when computation is
Worker 1
Worker 2
qi
Worker 3
qi
imbalanced across different processes. To address this issue, mpiBLAST introduces a parallel I/O technique called asynchronous, two-phase I/O (ATIO), which allows worker processes to rearrange I/O data without synchronizing with each other. Specifically, mpiBLAST appoints a worker as the write leader for each query sequence. The write leader aggregates output data from other workers via nonblocking MPI communication and carries out the write operation to the file system. ATIO overlaps I/O reorganization and sequence-search computation, thus improving the overall application performance. Figure shows the difference between collective I/O and ATIO within the context of mpiBLAST.
Multiple Sequence Alignment Multiple sequence alignment (MSA) identifies similarities among three or more sequences. It can be used to analyze a family of related sequences to reveal phylogenetic relationships. Other usages of multiple sequence alignment include detection of conserved biological features and genome sequencing. Multiple sequence alignment can also be global or local. Like pairwise sequence alignment, multiple sequence alignment can be computed using dynamic
Worker 1
qi
H
Worker 2
qi
Worker 3
qi
qi
1 1
1
1 qi+1
qi+1
2
qi+1
qi+1 2
Output file
2
Output of qi Collective I/O
qi+1
Search Idle 1 Exchange 2 Write
qi+1 Search
2
1 Exchange 2 Write
Output file
Output of qi ATIO
Homology to Sequence Alignment, From. Fig. Collective I/O and ATIO. Collective I/O requires synchronization and introduces idle waiting in worker processes. ATIO uses a write leader to aggregate noncontiguous data from other workers in an asynchronous manner
H
H
Homology to Sequence Alignment, From
programming algorithms. Although these algorithms can find optimal alignments, the required resources for these algorithms grow exponentially as the number of sequences increases. Suppose there are N sequences and the average sequence length is L, the time and space complexities are both O(LN ). Thus, it is computationally impractical to use dynamic programming to align a large number of sequences. Many heuristic approaches have been proposed to reduce the computational complexity of multiple sequence alignment. Progressive alignment methods [, , , , ] use guided pairwise sequence alignment to rapidly construct MSA for a large number of sequences. These methods first build a phylogenetic tree based on all-to-all pairwise sequence alignments using neighbor joining [] or UPGMA [] techniques. Guided by the phylogenetic tree, the most similar sequences are first aligned, the less similar sequences are then progressively added to the initial MSA. One problem with progressive methods is that errors that occur in early aligning stages are propagated to the final alignment results. Iterative alignment methods [, ] address this problem by introducing “correction” mechanisms during the MSA construction process. These methods incrementally align sequences as progressive methods, but continuously adjust the structure of the phylogenetic tree and previously computed alignments according to a certain objective function.
or dynamically assigned with a greedy algorithm to different processing units. Additional parallelism can be exploited by parallelizing the alignment of a single pair of sequences, similar to the Smith–Waterman algorithm. In the second stage, a guided tree is constructed using a neighbor-joining algorithm based on the alignment scores in the distance matrix. Initially all sequences are leaf nodes of a star-like tree. The algorithm iteratively selects a pair of nodes and joins them into a new internal node until there are no nodes left to join, at which point a bifurcated tree is created. Algorithm gives the pseudocode of this process. Suppose there are a total of M nodes at an iteration, there possible pairs of nodes. The pair of nodes that are n(n−) results in the smallest length of branches will be selected to join. An example of neighbor-joining process with four sequences is given in Fig. . In Algorithm , the outermost loop cannot be parallelized because one iteration of neighbor joining is dependent on the previous one. However, within an iteration of neighbor joining, the calculation of branch lengths of all pairs of nodes can be performed in parallel. Since the number of available nodes reduces after each joining operation, the parallelism level decreases as the iteration advances in the outermost loop.
Case Study: ClustalW ClustalW [] is a widely used MSA program based on progressive methods. The ClustalW algorithm includes three stages: () distance matrix computation, () guided tree construction, and () progressive alignment. Due to the popularity of ClustalW, its parallelization has been well studied on clusters [, ] and multiprocessor systems [, ]. In the first stage, ClustalW computes a distance matrix by performing all-to-all pairwise alignment over input sequences. This requires a total of N(N−) com parison for N sequences since the alignment of a pair of sequences is symmetric. ClustalW allows users to choose between two alignment algorithms: a faster k-mer matching algorithm and a slower but more accurate dynamic programming algorithm. This stage of the algorithm is embarrassingly parallel. Alignments of individual pairs of sequences can be statically assigned
Algorithm Construction of Guided Tree while there are nodes to join do Let n be the number of current available nodes Let D be the current distance matrix Let Lmin be the smallest length of the tree branches for i = to n do for j = to i − do L(i, j) = (n − )Di,j − ∑nk= Di,k − ∑nk= Dj,k if L(i, j) < Lmin then Lmin = L(i, j) end if end for end for Combine nodes i, j that result in Lmin into a new node Update distance matrix with the new node end while
Homology to Sequence Alignment, From
H
1 1
3
3
2 1–2
2
a
3–4
3
4
4 Step 1
1–2
b
Step 2
4
c
Step 3
d
Tree
Homology to Sequence Alignment, From. Fig. Neighbor-joining example
In the third stage, the sequence is progressively aligned according to the guided tree. For instance, in the tree shown in Fig. (d), sequences and are aligned first, followed by and . Finally, the alignment results of < − > and < − > are aligned. Note that the alignment on a node can only be performed after both of its children are aligned, but the alignments at the same level of the tree can be performed simultaneously. The level of parallelism of the algorithm depends heavily on the tree structure. In the best case, the guided tree is a balanced tree. At the beginning, all the leaves can be aligned in parallel. The number of concurrent alignments decreases by a half at each higher level toward the root of the tree.
Case Study: T-Coffee T-Coffee [] is another popular progressive alignment algorithm. Compared to other typical progressive alignment tools, T-Coffee improves the alignment accuracy by adopting a consistency-based scoring function, which uses similarity information among all input sequences to guide the alignment progress. T-Coffee consists of two main steps: library generation and progressive alignment. The first step constructs a library that contains a mixture of global and local alignments between every pair of input sequences. In the library, an alignment is represented as a list of pairwise constraints. Each constraint stores a pair of matched residues along with an alignment weight, which is essentially a three-tuple < Sxi , Syj , w >, where Sxi is the ith residue of sequence Sx and w is the weight. T-Coffee incorporates pairwise global alignments generated by ClustalW as well as top ten nonintersecting local alignments reported by Lalign from the FASTA package []. T-Coffee can also take alignment information from other MSA software. The pairwise constraints of a same pair of matched residues from various sources (e.g., global and local alignments) will be combined to
remove duplication. Constructing the library requires N(N−) global and local alignments and thus is highly compute-intensive. After all pairwise alignments are incorporated in the library, T-Coffee performs library extension, a procedure that incorporates transitive alignment information to the weighing of pairwise constraints. Basically, for a pair of sequences x and y, if there is a sequence z that aligns to both x and y, then constraints between x and y will be reweighed by combining the weights of the corresponding constraints between x and z as well as y and z. Suppose the average sequence length is L, since for each pair of sequences, the extension algorithm needs to scan the rest of N − sequences, and there are at most L constraints between a pair of sequences, the worst computation complexity of library extension is O(N L ). In the second step, T-Coffee performs progressive alignment guided by a phylogenetic tree built with the neighbor-joining method, similar to the ClustalW algorithm. However, T-Coffee uses the weights in the extended library to align residues when grouping sequences/alignments. Since those weights bear complete alignment information from all input sequences, the progressive alignment in T-Coffee can reduce errors caused by the greediness of classic progressive methods. Parallel T-Coffee (PTC) is a parallel implementation of T-Coffee in cluster environments []. PTC adopts a master–worker architecture and uses MPI to communicate between different processes. As mentioned earlier, the library generation in T-Coffee needs to compute all-to-all global and local alignments. The computation tasks of these alignments are independent of each other. PTC uses guided self-scheduling (GSS) [] to distribute alignment tasks to different worker nodes. GSS first assigns a portion of the tasks to the workers. Each worker monitors its performance when processing the initial assignments; this performance information is
H
H
Homology to Sequence Alignment, From
then sent to the master and used for dynamic scheduling for subsequent assignments. After pairwise alignments are finished, duplicated constraints generated on distributed workers need to be combined. PTC implements this with parallel sorting. Each constraint is assigned to a bucket resident at a worker. Each worker can then concurrently combine duplicated constraints within its own bucket. The constraints in the library are then transformed into a three-dimensional lookup table, with rows and columns indexed by sequences and residues, respectively. Each element in the lookup table stores all constraints for a residue of a sequence. The lookup table will be accessed by all processors during the progressive alignment. PTC evenly distributes the lookup table by rows to all processors and allows table entries to be accessed by other processors through one-sided remote memory access. An efficient caching mechanism is also implemented to improve lookup performance. During the progressive alignment, PTC schedules tree nodes to a processor according to their readiness; a tree node that has a fewer number of unprocessed child nodes has a higher scheduling priority. For tree nodes that have all child nodes processed, PTC gives higher priority to the ones with shorter estimated execution time. Similar to ClustalW, the parallelism of progress alignment in PTC can be limited if the guided tree is unbalanced. To address this issue, Orobitg et al. proposed a heuristic approach that can construct a more balanced guided tree by allowing a pair of nodes to be grouped if their similarity value is smaller than the average similarity value between all sequences in the distance matrix [].
Related Entries Bioinformatics Genome Assembly
Bibliographic Notes and Further Reading As discussed in the case study of Smith–Waterman, typical sequence-alignment algorithms require O(mn) space and time, where m and n are the lengths of compared sequences. Such a space requirement can be impractical for computing alignments of large sequences (e.g., those with a length of multiple
megabytes) on commodity machines. To address this issue, Mayers and Miller introduced a space-efficient alignment algorithm [] adapted from the Hirschberg technique [], which was originally developed for finding the longest common subsequence between two strings. By recursively dividing the alignment problem into subproblems at a “midpoint” along the middle column of the scoring matrix, Mayers and Miller’s approach can find the optimal alignment within O(m + n) space but still O(mn) time. Huang showed that a straightforward parallelization of the Hirschberg algorithm would require more than linear aggregate space [], i.e., each processor needed to store more ) data, where p is the number of concurrent than O( m+n p processors. In turn, Huang proposed an improved algorithm that recursively divides the alignment problem at the midpoint along the middle anti-diagonal of the scoring matrix. By doing so, Huang’s algorithm required ) space per processor but with an increased only O( mn p
time complexity of O( (m+n) ). Aluru et al. presented an p alternative space-efficient parallel algorithm [] that is more time efficient (O( mn )) but also consumes more p space (O(m + np ) , m ≤ n) than Huang’s approach. In their approach, an O(mn) algorithm is first used to partition the scoring matrix into p vertical slices, and the last column of each slice as well as its intersection with the optimal alignment is stored. Each processor then takes a slice and uses a Hirschberg-based algorithm to compute the optimal alignment within the slice. In a subsequent study, Aluru et al. proposed an ) improved parallel algorithm [] that requires O( mn p time and O( m+n ) space, when p = O( logn n ) procesp sors are used. In other words, such a parallel algorithm achieves “optimal” time and space complexities because this algorithm delivers a linear speedup with respect to the best known sequential algorithm. The computational intensity of sequence-alignment algorithms has motivated studies in parallelizing these algorithms on accelerators. Various algorithms have been accelerated using the SIMD instruction extensions of commodity processors [, ], field-programmable gate array (FPGA) [, ], Cell Broadband Engine [, , ], and graphics processing units (GPUs) [–, , , , ]. Developing and optimizing applications on traditional accelerators is much more
Homology to Sequence Alignment, From
difficult than on CPUs, which may partially explain why accelerator-based solutions have not been widely adopted even if these solutions have demonstrated very promising performance results. However, the continuing improvement of software environments on commodity GPUs has made them increasingly popular for accelerating sequent alignments. To cope with the astronomical growth of sequence data, cloud-based solutions [, ] have also been developed to enable users to tackle large-scale problems with elastic compute resources from public clouds such as Amazon EC.
Bibliography . Aji AM, Feng W, Blagojevic F, Nikolopoulos DS () Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine. In: CF ’: Proceedings of the th conference on computing frontiers. ACM, New York, pp – . Altschul S, Gish W, Miller W, Myers E, Lipman D () Basic local alignment search tool. J Mol Biol ():– . Altschul S, Madden T, Schffer A, Zhang J, Zhang Z, Miller W, Lipman D () Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res ():– . Aluru S, Futamura N, Mehrotra K () Parallel biological sequence comparison using prefix computations. J Parallel Distrib Comput :– . Chaichoompu K, Kittitornkun S, Tongsima S () MTClustalW: multithreading multiple sequence alignment. In: International parallel and distributed processing symposium. Rhodes Island, Greece, p . Darling A, Carey L, Feng W () The design, implementation, and evaluation of mpiBLAST. In: Proceedings of the ClusterWorld conference and Expo, in conjunction with the th international conference on Linux clusters: The HPC revolution , San Jose . Di Tommaso P, Orobitg M, Guirado F, Cores F, Espinosa T, Notredame C () Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud. Bioinformatics ():– . Do CB, Mahabhashyam MS, Brudno M, Batzoglou S () ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res ():– . Ebedes J, Datta A () Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics ():– . Edgar R () MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics (): . Edmiston EE, Core NG, Saltz JH, Smith RM () Parallel processing of biological sequence comparison algorithms. Int J Parallel Program :–
H
. Feng DF, Doolittle RF () Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol ():– . Fitch W, Smith T () Optimal sequences alignments. Proc Natl Acad Sci :– . Hennessy JL, Patterson DA () Computer architecture: a quantitative approach, th edn. Morgan Kaufmann Publishers, San Francisco . Hirschberg DS () A linear space algorithm for computing maximal common subsequences. Commun ACM : – . Huang X () A space-efficient parallel sequence comparison algorithm for a message-passing multiprocessor. Int J Parallel Program :– . Li K () W-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics ():– . Li I, Shum W, Truong K () -fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA). BMC Bioinformatics (): . Lin H, Ma X, Chandramohan P, Geist A, Samatova N () Efficient data access for parallel BLAST. In: Proceedings of the th IEEE international parallel and distributed processing symposium (IPDPS’). IEEE Computer Society, Los Alamitos . Lin H, Ma X, Feng W, Samatova NF () Coordinating computation and I/O in massively parallel sequence search. IEEE Trans Parallel Distrib Syst :– . Lipman D, Pearson W () Improved toolsW, HT for biological sequence comparison. Proc Natl Acad Sci ():– . Liu W, Schmidt B, Voss B, Müller-Wittig W () GPUClustalW: using graphics hardware to accelerate multiple sequence alignment, Chapter In: Robert Y, Parashar M, Badrinath R, Prasanna VK (eds) High performance computing – HiPC . Lecture notes in computer science, vol . Springer, Berlin/Heidelberg, pp – . Liu Y, Maskell D, Schmidt B () CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res Notes (): . Liu Y, Schmidt B, Maskell DL () MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. In: ASAP ’: Proceedings of the th IEEE international conference on application-specific systems, architectures and processors, Washington, DC. IEEE Computer Society, Los Alamitos, California, USA, pp – . Lu W, Jackson J, Barga R () AzureBlast: a case study of cloud computing for science applications. In: st workshop on scientific cloud computing, co-located with ACM HPDC (High performance distributed computing). Chicago, Illinois, USA . Mahram A, Herbordt MC () Fast and accurate NCBI BLASTP: acceleration with multiphase FPGA-based prefiltering. In: Proceedings of the th ACM international conference on supercomputing. Tsukuba, Ibaraki, Japan . May J () Parallel I/O for high performance computing. Morgan Kaufmann Publishers, San Francisco . Message Passing Interface Forum () MPI: message-passing interface standard
H
H
Horizon
. Message Passing Interface Forum () MPI- extensions to the message-passing standard . Mikhailov D, Cofer H, Gomperts R () Performance optimization of Clustal W: parallel Clustal W, HT Clustal, and MULTICLUSTAL. White Papers, Silicon Graphics, Mountain View . Myers EW, Miller W () Optimal alignments in linear space. Comput Appl Biosci (CABIOS) ():– . Notredame C () T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol ():– . Oehmen C, Nieplocha J () ScalaBLAST: a scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis. IEEE Trans Parallel Distrib Syst ():– . Orobitg M, Guirado F, Notredame C, Cores F () Exploiting parallelism on progressive alignment methods. J Supercomput –. doi: ./s--- . Pei J, Sadreyev R, Grishin NV () PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics ():– . Polychronopoulos CD, Kuck DJ () Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans Comput :– . Rajko S, Aluru S () Space and time optimal parallel sequence alignments. IEEE Trans Parallel Distrib Syst :– . Rognes T, Seeberg E () Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics ():– . Sachdeva V, Kistler M, Speight E, Tzeng TK () Exploring the viability of the cell broadband engine for bioinformatics applications. Parallel Comput ():– . Saitou N, Nei M () The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol ():– . Sandes EFO, de Melo ACMA () CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences. SIGPLAN Not ():– . Sarje A, Aluru S () Parallel genomic alignments on the cell broadband engine. IEEE Trans Parallel Distrib Syst (): – . Sneath PH, Sokal RR () Numerical taxonomy. Nature :– . Thakur R, Choudhary A () An extended two-phase method for accessing sections of out-of-core arrays. Sci Program (): – . Thakur R, Gropp W, Lusk W () Data sieving and collective I/O in ROMIO. In: Symposium on the frontiers of massively parallel processing. Annapolis, Maryland, USA, p . Thakur R, Gropp W, Lusk E () On implementing MPI-IO portably and with high performance. In: Proceedings of the sixth workshop on I/O in parallel and distributed systems. Atlanta, Georgia, USA . Thompson JD, Higgins DG, Gibson TJ () CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res ():–
. Vouzis PD, Sahinidis NV () GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics ():– . Wallace IM, Orla O, Higgins DG () Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics ():– . Wozniak A () Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci ():– . Xiao S, Aji AM, Feng W () On the robust mapping of dynamic programming onto a graphics processing unit. In: ICPADS ’: proceedings of the th international conference on parallel and distributed systems, Washington, DC. IEEE Computer Society, Los Alamitos, California, USA, pp – . Xiao S, Lin H, Feng W () Characterizing and optimizing protein sequence search on the GPU. In: Proceedings of the th IEEE international parallel and distributed processing symposium Anchorage, Alaska. IEEE Computer Society, Los Alamitos, California, USA . Zola J, Yang X, Rospondek A, Aluru S () Parallel-TCoffee: a parallel multiple sequence aligner. In: ISCA international conference on parallel and distributed computing systems (ISCA PDCS ), pp –
Horizon Tera MTA
HPC Challenge Benchmark Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition HPC Challenge (HPCC) is a benchmark that measures computer performance on various computational kernels that span the memory access locality space. HPCC includes tests that are able to take advantage of nearly all available floating point performance: High Performance LINPACK and matrix–matrix multiply allow for data reuse that is only bound by the size of large register file and fast cache. The twofold benefit from these tests is the ability to answer the question how well the hardware is able to work around the
H
HPC Challenge Benchmark
Discussion The HPC Challenge benchmark suite was initially developed for the DARPA HPCS program [] to provide a set of standardized hardware probes based on commonly occurring computational software kernels. The HPCS program involves a fundamental reassessment of how to define and measure performance, programmability, portability, robustness and, ultimately, productivity across the entire high-end domain. Consequently, the HPCC suite aimed both to give conceptual expression to the underlying computations used in this domain, and to be applicable to a broad spectrum of computational science fields. Clearly, a number of compromises needed to be embodied in the current form of the suite, given such a broad scope of design requirements. HPCC was designed to provide approximate bounds on computations that can be characterized by either high or low spatial and temporal locality (see Fig. , which gives the conceptual design space for the HPCC component tests). In addition, because the HPCC tests consist of simple mathematical operations, HPCC provides a unique opportunity to look at language and parallel programming model issues. As such, the benchmark is designed to serve both the system user and designer communities. Figure shows a generic memory subsystem in the leftmost column and how each level of the hierarchy is tested by the HPCC software (the second column from the left), along with the design goals for the future
PTRANS STREAM
HPL DGEMM
CFD
Radar cross-section
Spatial locality
“memory wall” and how today’s machines compare to the systems of the past as they are cataloged by the LINPACK Benchmark Report [] and TOP. HPCC also includes other tests, STREAM, PTRANS, FFT, RandomAccess – when they are combined together they span the memory access locality space. They are able to reveal the growing inefficiencies of the memory subsystem and how they are addressed in the new computer infrastructures. HPCC also offers scientific rigor to the benchmarking effort. The tests stress double precision floating point accuracy: the absolute prerequisite in the scientific world. In addition, the tests include careful verification of the outputs – undoubtedly an important fault-detection feature at extreme computing scales.
Applications
TSP
RSA
RandomAccess 0
DSP
FFT
Temporal locality
HPC Challenge Benchmark. Fig. The application areas targeted by the HPCS Program are bound by the HPCC tests in the memory access locality space
systems that originated in the HPCS program (the third column from the right). In other words, these are the projected target performance numbers that are to come out of the wining HPCS vendor designs. The last column shows the relative improvement in performance that needs to be achieved in order to meet the goals.
The TOP Influence The most commonly known ranking of supercomputer installations around the world is the TOP list []. It uses the equally well-known LINPACK benchmark [] as a single figure of merit to rank of the world’s most powerful supercomputers. The often-raised question about the relation between the TOP list and HPCC can be addressed by recognizing the positive aspects of the former. In particular, the longevity of the TOP list gives an unprecedented view of the highend arena across the turbulent era of Moore’s law [] rule and the emergence of today’s prevalent computing paradigms. The predictive power of the TOP list is likely to have a lasting influence in the future, as it has had in the past. HPCC extends the TOP list’s concept of exploiting a commonly used kernel and, in the context of the HPCS goals, incorporates a larger, growing suite of computational kernels. HPCC has already
H
H
HPC Challenge Benchmark
HPCS Program: Registers Operands
Benchmarks
Performance Target
Required Improvement
HPL
2 Pflop/s
800%
STREAM
6 Pbyte/s
4000%
FFT RandomAccess b_eff
0.5 Pflop/s 64000 GUPS
20 000% 200 000%
Cache Lines Local Memory Messages Remote Memory Pages Disk
HPC Challenge Benchmark. Fig. HPCS program benchmarks and performance targets
HPC Challenge Benchmark. Table All of the top entries of the th TOP list that have results in the HPCC database Rank
Name
Rmax
BG/L
.
BG W
.
ASC purple
Columbia
Red storm
HPL
PTRANS
STREAM
.
.
,
.
.
.
.
.
,
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
,
.
.
.
begun to serve as a valuable tool for performance analysis. Table shows an example of how the data from the HPCC database can augment the TOP results (for the current version of the table please visit the HPCC website).
Short History of the Benchmark The first reference implementation of the HPCC suite of codes was released to the public in . The first optimized submission came in April from Cray, using the then-recent X installation at Oak Ridge National Lab. Ever since, Cray has championed the list of optimized HPCC submissions. By the time of the first HPCC birds-of-a-feather session at the Supercomputing conference in in Pittsburgh, the public database of results already featured major
FFT
RandomAccess
Lat.
B/w
supercomputer makers – a sign that vendors were participating in the new benchmark initiative. At the same time, behind the scenes, the code was also being tried out by government and private institutions for procurement and marketing purposes. A milestone was the announcement of the HPCC Awards contest. The two complementary categories of the competition emphasized performance and productivity – the same goals as the sponsoring HPCS program. The performance-emphasizing Class award drew the attention of many of the biggest players in the supercomputing industry, which resulted in populating the HPCC database with most of the top entries of the TOP list (some exceeding their performances reported on the TOP – a tribute to HPCC’s continuous results update policy). The contestants competed to achieve the highest raw performance in one of the four
HPC Challenge Benchmark
H
HPL A
x
=
b
Compute x from the system of linear equations Ax = b. DGEMM
¬a A
C
+b C
B
Compute update to matrix C with a product of matrices A and B. STREAM
a
¬ b b
+a c
Perform simple operations on vectors a, b, and c.
PTRANS A
¬
AT
+
B
Compute update to matrix A with a sum of its transpose and another matrix B. RandomAccess
T
.. .
.. .
T
Perform integer update of random vector T locations using pseudo-random sequence. FFT
® x
X
Compute vector z to be the Fast Fourier Transform (FFT) of vector x.
z
® ® ¬
® ¬
® ¬
® ¬
¯
b.eff ¯
Perform ping-pong and various communication ring exchanges.
HPC Challenge Benchmark. Fig. Detail description of the HPCC component tests (A, B, C – matrices, a, b, c, x, z – vectors, α, β – scalars, T – array of -bit integers)
tests: HPL, STREAM, RANDA, and FFT. At the SC conference in Portland, Oregon, HPCC listed its first Pflop/s machine – Cray XT called Jaguar from Oak Ridge National Laboratory. The Class award, by solely focusing on productivity, introduced a subjectivity factor into the judging and also into the submission criteria, regarding what was appropriate for the contest. As a result, a wide range of solutions were submitted, spanning various programming languages (interpreted and compiled) and paradigms (with explicit and implicit parallelism). The Class contest featured openly available as well as proprietary technologies, some of which were arguably confined to niche markets and some were
widely used. The financial incentives for entering turned out to be all but needless, as the HPCC seemed to have gained enough recognition within the high-end community to elicit entries even without the monetary assistance. (HPCwire provided both press coverage and cash rewards for the four winning contestants in Class and the single winner in Class .) At the HPCC’s second birds-of-a-feather session during the SC conference in Seattle, the former class was dominated by IBM’s BlueGene/L at Lawrence Livermore National Lab, while the latter class was split among MTA pragmadecorated C and UPC codes from Cray and IBM, respectively.
H
H
HPC Challenge Benchmark
The Benchmark Tests’ Details Extensive discussion and various implementations of the HPCC tests are available elsewhere [, , ]. However, for the sake of completeness, this section provides the most important facts pertaining to the HPCC tests’ definitions. All calculations use double precision floating-point numbers as described by the IEEE standard [], and no mixed precision calculations [] are allowed. All the tests are designed so that they will run on an arbitrary number of processors (usually denoted as p). Figure shows a more detailed definition of each of the seven tests included in HPCC. In addition, it is possible to run the tests in one of three testing scenarios to stress various hardware components of the system. The scenarios are shown in Fig. . In the “Single” scenario, only one process is chosen to run the test. Accordingly, the remaining processes remain idle and so does the interconnect (shown with strike-out font in the Figure). In the “Embarrassingly Parallel” scenario, all process run the tests simultaneously but they do not communicate with each other. And finally, in the “Global” scenario, all components of the system work together on all tests.
Single P1
Pi .. . .
... ..
PN
Interconnect Embarrassingly Parallel P1
PN
Pi ...
... Interconnect Global Pi
P1 ...
PN ...
Interconnect
HPC Challenge Benchmark. Fig. Testing scenarios of the HPCC components
Benchmark Submission Procedures and Results The reference implementation of the benchmark may be obtained free of charge at the benchmark’s web site (http://icl.cs.utk.edu/hpcc/). The reference implementation should be used for the base run: it is written in a portable subset of ANSI C [] using a hybrid programming model that mixes OpenMP [, ] threading with MPI [–] messaging. The installation of the software requires creating a script file for Unix’s make() utility. The distribution archive comes with script files for many common computer architectures. Usually, a few changes to any of these files will produce the script file for a given platform. The HPCC rules allow only standard system compilers and libraries to be used through their supported and documented interface, and the build procedure should be described at submission time. This ensures repeatability of the results and serves as an educational tool for end users who wish to use a similar build process for their applications. After a successful compilation, the benchmark is ready to run. However, it is recommended that changes be made to the benchmark’s input file that describes the sizes of data to use during the run. The sizes should reflect the available memory on the system and the number of processors available for computations. There must be one baseline run submitted for each computer system entered in the archive. An optimized run for each computer system may also be submitted. The baseline run should use the reference implementation of HPCC, and in a sense it represents the scenario when an application requires use of legacy code – a code that cannot be changed. The optimized run allows the submitter to perform more aggressive optimizations and use system-specific programming techniques (languages, messaging libraries, etc.), but at the same time still includes the verification process enjoyed by the base run. All of the submitted results are publicly available after they have been confirmed by email. In addition to the various displays of results and exportable raw data, the HPCC website also offers a kiviat chart display to visually compare systems using multiple performance numbers at once. A sample chart that uses actual HPCC results data is shown in Fig. .
HPC Challenge Benchmark
H
64 processors: AMD Opteron 2.2 GHz G-HPL 1.0 0.8
G-PTRANS
RandomRing Latency
0.6 0.4 0.2
G-RandomAccess
RandomRing Bandwidth
H S-DGEMM
G-FFT
S-STREAM Triad Rapid Array Quadrics GigE
HPC Challenge Benchmark. Fig. Sample kiviat diagram of results for three different interconnects that connect the same processors
Related Entries Benchmarks LINPACK Benchmark Livermore Loops TOP
Bibliography . ANSI/IEEE Standard – () Standard for binary floating point arithmetic. Technical report, Institute of Electrical and Electronics Engineers, . Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R () Parallel programming in OpenMP. Morgan Kaufmann Publishers, San Francisco, . Dongarra JJ () Performance of various computers using standard linear equations software. Computer Science Department. Technical Report, University of Tennessee, Knoxville, TN, April . Up-to-date version available from http://www.netlib.org/ benchmark/ . Dongarra JJ, Luszczek P, Petitet A () The LINPACK benchmark: past, present, and future. In: Dou Y, Gruber R, Joller JM () Concurrency and computation: practice and experience, vol , pp –
. Dongarra J, Luszczek P () Introduction to the HPC challenge benchmark suite. Technical Report UT-CS--, University of Tennessee, Knoxville . Kepner J () HPC productivity: an overarching view. Int J High Perform Comput Appl ():– . Kernighan BW, Ritchie DM () The C Programming Language. Prentice-Hall, Upper Saddle River, New Jersey . Luszczek P, Dongarra J () High performance development for high end computing with Python Language Wrapper (PLW). Int J High Perfoman Comput Appl. Accepted to Special Issue on High Productivity Languages and Models . Langou J, Langou J, Luszczek P, Kurzak J, Buttari A, Dongarra J () Exploiting the performance of bit floating point arithmetic in obtaining bit accuracy. In: Proceedings of SC, Tampa, Florida, Nomveber – . See http://icl.cs.utk.edu/ iter-ref . Moore GE () Cramming more components onto integrated circuits. Electronics ():– . Message Passing Interface Forum () MPI: A Message-Passing Interface Standard. The International Journal of Supercomputer Applications and High Performance Computing (/):– . Message Passing Interface Forum () MPI: A Message-Passing Interface Standard (version .), . Available at: http://www. mpi-forum.org/
H
HPF (High Performance Fortran)
. Message Passing Interface Forum () MPI-: Extensions to the Message-Passing Interface, July . Available at http://www. mpi-forum.org/docs/mpi-.ps . Meuer HW, Strohmaier E, Dongarra JJ, Simon HD () TOP Supercomputer Sites, th edn. November . (The report can be downloaded from http://www.netlib.org/ benchmark/top.html) . Nadya Travinin and Jeremy Kepner () pMatlab parallel Matlab library. International Journal of High Perfomance Computing Applications ():– . OpenMP: Simple, portable, scalable SMP programming. http:// www.openmp.org/
HPF (High Performance Fortran) High Performance Fortran (HPF) is an extension of Fortran for parallel programming. In HPF programs, parallelism is represented as data parallel operations in a single thread of execution. HPF extensions included statements to specify data distribution, data alignment, and processor topology, which were used for the translation of HPF codes onto an SPMD messagepassing form.
Bibliography . Kennedy K, Koelbel C, Zima H () The rise and fall of high performance Fortran: an historical object lesson. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages (HOPL III), ACM, New York, pp. -–-, doi:./., http://doi.acm.org/./ . . High Performance Fortran Forum Website. http://hpff.rice.edu
HPS Microarchitecture Yale N. Patt The University of Texas at Austin, Austin, TX, USA
Synonyms The high performance substrate
Definition The microarchitecture specified by Yale Patt, Wen-mei Hwu, Stephen Melvin, and Michael Shebanow in for implementing high-performance microprocessors. It achieves high performance via aggressive branch prediction, speculative execution, wide issue, and outof-order execution, while retaining the ability to handle precise exceptions via in-order retirement.
Discussion The High Performance Substrate (HPS) was the name given to the microarchitecture conceived by Professor Yale Patt and his three PhD students, Wen-mei Hwu, Michael Shebanow, and Stephen Melvin, at the University of California, Berkeley in and first published in Micro in October, [, ]. Its goal was highperformance processing of single-instruction streams by combining aggressive branch prediction, speculative execution, wide issue, dynamic scheduling (outof-order execution), and retirement of instructions in program order (i.e., in-order). Out-of-order execution had appeared in previous machines from Control Data [] and IBM [], but had pretty much been dismissed as a nonviable mechanism due to its lack of in-order retirement, which prevented the processor from implementing precise exceptions. The checkpoint retirement mechanisms of HPS removed that problem [, ]. It should also be noted that solutions to the precise exception problem were also being developed simultaneously and independently by James Smith and Andrew Plezskun []. HPS was first targeted for the VAX instruction set architecture and demonstrated that an HPS implementation of the VAX could process instruction streams at a rate of three cycles per instruction (CPI), as compared to the VAX-/, which processed at the rate of . CPI []. Instructions are processed as follows: Using an aggressive branch predictor and wide-issue fetch/decode mechanism, multiple instructions are fetched each cycle, decoded into data flow graphs (one per instruction), and merged into a global data flow graph containing all instructions in process. Instructions are scheduled for execution when their flow dependencies (RAW hazards) have been satisfied and executed speculatively and out-of-order with respect to the program order of the program. Results produced by these
Hybrid Programming With SIMPLE
H
instructions are stored temporarily in a results buffer (aka re-order buffer) until they can be retired in-order. The essence of the paradigm is that the global data graph consists of nodes corresponding to micro-operations and edges corresponding to linkages between microops that produce operands and micro-ops that source them. The edges of the data flow graph produced as a result of decode correspond to internal linkages within an instruction. Edges created as a result of merging an individual instruction’s data flow graph into the global data flow graph correspond to linkages between liveouts of one instruction and live-ins of a subsequent instruction. A Register Alias Table was conceived to maintain correct linkages between live-outs and liveins. A node, corresponding to a micro-op, is available for execution when all its flow dependencies are resolved. The HPS research group refers to the paradigm as Restricted Data Flow (RDF) since at no time does the data flow graph for the entire program exist. Rather, the size of the global data flow graph is increased every cycle as a result of new instructions being decoded and merged, and decreased every cycle as a result of old instructions retiring. At every point in time, only those instructions in the active window – the set of instructions that have been fetched but not yet retired – are present in the data flow graph. The set of instructions in the active window are often referred to as “in-flight” instructions. The number of in-flight instructions is orders of magnitude smaller than the size of a data flow graph for the entire program. The result is data flow processing of a program without incurring any of the problems of classical data flow. Since , the HPS microarchitecture has seen continual development and improvement by many research groups at many universities and industrial labs. The basic paradigm has been adopted for most cutting-edge high-performance microprocessors, starting with Intel on its Pentium Pro microprocessor in the late s [].
. Thornton JE () Design of a computer – the Control Data . Scott, Foresman and Co. Glenview, IL . Anderson DW, Sparacio FJ, Tomasulo RM () The IBM system/ model : machine philosophy and instruction-handling. IBM J Res Development ():– . Hwu W, Patt Y () HPSm, a high performance restricted data flow architecuture having minimal functionality. In: Proceedings, th annual international symposium on computer architecture, Tokyo . Hwu W, Patt Y () Checkpoint repair for high performance out-of-order execution machines. IEEE Trans Computers ():– . Smith JE, Pleszkun A () Implementing precise interrupts. In: Proceedings, th annual international symposium on computer architecture, Boston, MA . Hwu W, Melvin S, Shebanow M, Chen C, Wei J, Patt Y () An HPS implementation of VAX; initial design and analysis. In: Proceedings of the Hawaii international conference on systems sciences, Honolulu, HI . Colwell R () The pentium chronicles: the people, passion, and politics behind intel’s landmark chips. Wiley-IEEE Computer Society Press, NJ, ISBN: ----
Bibliography
Definition
. Patt YN, Hwu W, Shebanow M () HPS, a new microarchitecture: rationale and introduction. In: Proceedings of the th microprogramming workshop, Asilomar, CA . Patt YN, Melvin S, Hwu W, Shebanow M () Critical issues regarding HPS, a high performance microarchitecture. In: Proceedings of the th microprogramming workshop, Asilomar, CA
Most high performance computing systems are clusters of shared-memory nodes. Hybrid parallel programing handles distributed-memory parallelization across the nodes and shared-memory parallelization within a node.
HT HyperTransport
HT. HyperTransport
Hybrid Programming With SIMPLE Guojing Cong , David A. Bader IBM, Yorktown Heights, NY, USA Georgia Institute of Technology, Atlanta, GA, USA
H
H
Hybrid Programming With SIMPLE
SIMPLE refers to the joining of the SMP and MPI-like message passing paradigms [] and the simple programming approach. It provides a methodology of programming cluster of SMP nodes. It advocates a hybrid methodology which maps directly to underlying architectural aspects. SIMPLE combines shared memory programming on shared memory nodes with message passing communication between these nodes. SIMPLE provides () a complexity model and set of efficient communication primitives for SMP nodes and clusters; () a programming methodology for clusters of SMPs which is both efficient and portable; and () high performance algorithms for sorting integers, constraint-satisfied searching, and computing the twodimensional FFT.
two cache (L), which can be tightly integrated into the memory system to provide fast memory accesses and cache coherence. The shared memory programming of each SMP node is based on threads which communicate via coordinated accesses to shared memory. SIMPLE provides several primitives that synchronize the threads at a barrier, enable one thread to broadcast data to the other threads, or calculate reductions across the threads. In SIMPLE, only the CPUs from a certain node have access to that node’s configuration. In this manner, there is no restriction that all nodes must be identical, and certainly configuration can be constructed from SMP nodes of different sizes. Thus, the number of threads on a specific remote node is not globally available. Because of this, SIMPLE supports only node-oriented communication, meaning that communication is restricted such that, given any source node s and destination node d, with s ≠ d, only one thread on node s can send (receive) a message to (from) node d at any given time.
The SIMPLE Computational Model A simple paradigm is used in SIMPLE for designing efficient and portable parallel algorithms. The architecture consists of a collection of SMP nodes interconnected by a communication network (as shown in Fig. ) that can be modeled as a complete graph on which communication is subject to the restrictions imposed by the latency and the bandwidth properties of the network. Each SMP node contains several identical processors, each typically with its own on-chip cache and a larger off-chip cache, which have uniform access to a shared memory and other resources such as the network interface. Parameter r is used to represent the number symmetric processors per node (see Fig. for a diagram of a typical node). Notice that each CPU typically has its own on-chip cache (L) and a larger off-chip level
Complexity Model In the SIMPLE complexity model, each SMP is viewed as a two-level hierarchy for which good performance requires both good load distribution and the minimization of secondary memory access. The cluster is viewed as a collection of powerful processors connected by a communication network. Maximizing performance on the cluster requires both efficient load balancing and regular, balanced communication. Hence, our performance model combines two separate but complimentary models.
Interconnection Network
P0
P1
P2
P3
P4
P5
P6
P7
Hybrid Programming With SIMPLE. Fig. Cluster of processing elements
Pp-4
Pp-3
Pp-2
Pp-1
Hybrid Programming With SIMPLE
CPU 0
CPU 2
CPU r-2
L1
L1
L1
L2
CPU 3
CPU r-1
L1
L1
L1
L2
L2
L2
Network Interface
CPU 1
L2
H
L2
Bus or Switching Network
H Shared Memory
Hybrid Programming With SIMPLE. Fig. A typical symmetric multiprocessing (SMP) node used in a cluster. L is on-chip level-one cache, and L is off-chip level-two cache
The SIMPLE model recognizes that efficient algorithm design requires the efficient decomposition of the problem among the available processors, and so, unlike some other models for hierarchical memory, the cost of computation is included in the complexity. The cost model also encourages the exploitations of temporal and spatial locality. Specifically, memory at each SMP is seen as consisting of two levels: cache and main memory. A block of m contiguous words can be read from or ) time, where є is the written to main memory in (є + mr α latency of the bus, r is the number of processors competing for access to the bus, and α is the bandwidth. By contrast, the transfer of m noncontiguous words would require m(є + αr ) time. A parallel algorithm is viewed as a sequence of local SMP computations interleaved with communication steps, where computation and communication is allowed to overlap. Assuming no congestion, the transfer of a block consisting of m words between two nodes takes (τ + mβ ) time, where τ is the latency of the network, and β is the bandwidth per node. SIMPLE assumes that the bisection bandwidth is sufficiently high to support block permutation routings among the p nodes at the rate of β . In particular, for any subset
of q nodes, a block permutation among the q nodes takes (τ + mβ ) time, where m is the size of the largest block. Using this cost model, the communication time Tcomm (n, p) of an algorithm can be viewed as a function of the input size n, the number of nodes p, and the parameters τ and β. The overall complexity of algorithm for the cluster T(n, p) is given by the sum of Tsmp and Tcomm (n, p).
Communication Primitives The communication primitives are grouped into three modules: Internode Communication Library (ICL), SMP Node, and SIMPLE. ICL communication primitives handle internode communication, SMP Node primitives aid shared-memory node algorithms, and SIMPLE primitives combine SMP Node with ICL on SMP clusters. The ICL communication library services internode communication and can use any of the vendor-supplied or freely available thread-safe implementation of MPI. The ICL libraries are based upon a reliable, application-layer send and receive primitive, as well as a send-and-receive primitive which handles the exchanging of messages between
H
Hybrid Programming With SIMPLE
sets of nodes where each participating node is the source and destination of one message. The library also provides a barrier operation based upon the send and receive which halts the execution at each node until all nodes check into the barrier, at which time, the nodes may continue execution. In addition, ICL includes collective communication primitives, for example, scan, reduce, broadcast, allreduce, alltoall, alltoallv, gather, and scatter. See [].
for ( ≤ j ≤ L − ), each of kj processors writes k unique copies of the data for its k children. The time complexity of this SMP replication algorithm given an m-word buffer is kj m ) α j= m = k(L − )є + (r − ) α L−
Tsmp = ∑ k(є +
≤ k(logk (r(k − ) + ) − ) є + ≤ k(logk (r + /k)) є +
SMP Node The SMP Node Library contains important primitives for an SMP node: barrier, replicate, broadcast, scan, reduce, and allreduce, whereby on a single node, barrier synchronizes the threads, broadcast ensures that each thread has the most recent copy of a shared memory location, scan (reduce) performs a prefix (reduction) operation with a binary associative operator (e.g., addition, multiplication, maximum, minimum, bitwiseAND, and bitwise-OR) with one datum per thread, and allreduce replicates the result from reduce. Each of these collective SMP primitives can be implemented using a fan-out or fan-in tree constructed as follows. A logical k-ary balanced tree is built for an SMP node with r processors, which serves as both a fanout and fan-in communication pattern. In a k-ary tree, level has one processor, level has k processors, level has k processors, and so forth, with level j containing kj processors, for ≤ j ≤ L − ), where there are L levels in the tree. If logk r is not an integer, then the last level (L − ) will hold less than kL− processors. Thus, the number of processors r is bounded by L−
L−
j=
j=
j j ∑ k < r ≤ ∑k .
()
Solving for the number of levels L, it is easy to see that L = ⌈logk (r(k − ) + )⌉
()
where the ceiling function ⌈x⌉ returns the smallest integer greater than or equal to x. An efficient algorithm for replicating a data buffer B such that each processor i, ( ≤ i ≤ r − ), receives a unique copy Bi of B makes use of the fanout tree, with the source processor situated as the root node of the k-ary tree. During step j of the algorithm,
= O((log r)є +
mr ). α
m (r − ) α
m (r − ) α
()
The best choice of k, ( ≤ k ≤ r − ), depends on SMP size r and machine parameters є and α, but can be chosen to minimize Eq. . An algorithm which barrier synchronizes r SMP processors can use a fan-in tree followed by a fanout tree, with a unit message size m, taking twice the replication time, namely, O((log r)є + r/α). For certain SMP algorithms, it may not be necessary to replicate data, but to share a read-only buffer for a given step. A broadcast SMP primitive supplies each processor with the address of the shared buffer by replicating the memory address in Tsmp = O((log r)є + r/α). A reduce primitive, which performs a reduction with a given binary associate operator, uses a fan-in tree, combining partial sums during each step. For initial data arrays of size m per processor, this takes O((log r)є + mr/α). The allreduce primitive performs a reduction followed by replicate so that each processor receives a copy of the result with a cost ). of O((log r)є + mr α Scans (also called prefix-sums) are defined as follows. Each processor i, ( ≤ i ≤ r − ), initially holds an element ai , and at the conclusion of this primitive, holds the prefix-sum bi = a ∗ a ∗ . . . ∗ ai , where ∗ is any binary associative operator. An SMP algorithm similar to the PRAM algorithm (e.g., []) is employed which uses a binary tree for finding the prefix-sums. Given an array of elements A of size r = d where d is a nonnegative integer, the output is array C such that C(, i) is the ith prefix-sum, for ( ≤ i ≤ r − ). In fact, arrays A, B, and C in the SMP prefix-sum algorithm (Alg. ) can be the same array. The analysis log r is as follows. The first for loop takes ∑h= (є + rh mα ) ,
Hybrid Programming With SIMPLE log r
) (є + rh mα ) and the second for loop takes ∑h= ( + for a total complexity of Tsmp ≤ є log r + mr = α ). O((log r)є + mr α
Simple Finally, the SIMPLE communication library, built on top of ICL and SMP Node, includes the primitives for the SIMPLE model: barrier, scan, reduce, broadcast, allreduce, alltoall, alltoallv, gather, and scatter. These hierarchical layers of our communication libraries are pictured in Fig. . The SMP Node, ICL, and SIMPLE libraries are implemented at a high level, completely in user space
Algorithm SMP scan algorithm for processor i, ( ≤ i ≤ r − ) and binary associative operator ∗ set B(, i) = A(i). for h = to log r do if ≤ i ≤ r/h then set B(h, i) = B(h − , i − ) ∗ B(h − , i). for h = log r downto do if ≤ i ≤ r/h then ⎧ ⎪ ⎪ If i even, Set C(h, i) = C(h + , i/); ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨If i = , Set C(h, ) = B(h, ); ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ If i odd, Set C(h, i) = C(h + , (i − )/) ∗ B(h, i). ⎪ ⎩ SIMPLE Communication Library Barrier, Scan, Reduce, Broadcast, Allreduce, Alltoall, Alltoallv, Gather, Scatter Internode Communication Library
SMP Node Library
Scan, Reduce, Broadcast, Allreduce, Alltoall, Alltoallv, Gather, Scatter
Broadcast
Barrier
Reduce
Scan Barrier Send
Recv
SendRecv
Replicate
Hybrid Programming With SIMPLE. Fig. Hierarchy of SMP, message passing, and SIMPLE communication libraries
H
(see Fig. ). Because no kernel modification is required, these libraries easily port to new platforms. As mentioned previously, the number of threads per node can vary, along with machine size. Thus, each thread has a small set of context information which holds such parameters as the number of threads on the given node, the number of nodes in the machine, the rank of that node in the machine, and the rank of the thread both () on the node and () across the machine. Table describes these parameters in detail. Because the design of the communication libraries is modular, it is easy to experiment with different implementations. For example, the ICL module can make use of any of the freely available or vendorsupplied thread-safe implementations of MPI, or a small communication kernel which provides the necessary message passing primitives. Similarly, the SMP Node primitives can be replaced by vendor-supplied SMP implementations.
The Alltoall Primitive One of the most important collective communication events is the alltoall (or transpose) primitive which transmits regular sized blocks of data between each pair of nodes. More formally, given a collection of p nodes each with an m element sending buffer, where p divides m, the alltoall operation consists of each node i sending its jth block of mp data elements to node j, where node j stores the data from i in the ith block of its receiving buffer, for all ( ≤ i, j ≤ p − ). An efficient message passing implementation of alltoall would be as follows. The notation “var i ” refers to memory location “var + ( mp ∗ i),” and src and dst point to the source and destination arrays, respectively. To implement this algorithm (Alg. ), multiple threads (r ≤ p) per node are used. The local memory copy trivially can be performed concurrently by one thread while the remaining threads handle the internode communication as follows. The p − iterations of the loop are partitioned in a straightforward manner to the remaining threads. Each thread has the information necessary to calculate its subset of loop indices, and thus, this loop partitioning step requires no synchronization overheads. The complexity of this primitive is twice (є + mr αr ) for the local memory read
H
H
Hybrid Programming With SIMPLE
User Program
SIMPLE
Message Passing ICL (e.g. MPI)
SMP Node Library POSIX threads
User Libraries User Space Kernel Space
Kernel
Hybrid Programming With SIMPLE. Fig. User code can access SIMPLE, SMP, message passing, and standard user libraries. Note that SIMPLE operates completely in user space Hybrid Programming With SIMPLE. Table The local context parameters available to each SIMPLE thread Parameter
Description
NODES = p
Total number of nodes in the cluster
MYNODE
My node rank, from to NODES −
THREADS = r
Total number of threads on my node
MYTHREAD
The rank of my thread on this node, from to THREADS −
TID
Total number of threads in the cluster
ID
My thread rank, with respect to the cluster, from to TID −
Algorithm SIMPLE Alltoall primitive copy the appropriate mp elements from srcMYNODE to dstMYNODE. for i = to NODES − do set k = MYNODE ⊕ i; send mp elements from srck to node k, and receive mp elements from node k to dstk .
and writes, and (τ +
m ) β
for a total cost of O(τ +
for internode communication, m β
+є+
m ). α
Computation Primitives SIMPLE computation primitives do not communicate data but affect a thread’s execution through () loop parallelization, () restriction, or () shared memory management. Basic support for data parallelism, that is, “parallel do” concurrent execution of loops across processors on one or more nodes, is provided.
Data Parallel The SIMPLE methodology contains several basic “pardo” directives for executing loops concurrently on one or more SMP nodes, provided that no dependencies exist in the loop. Typically, this is useful when an independent operation is to be applied to every location in an array, for example, in the element-wise addition of two arrays. Pardo implicitly partitions the loop to the threads without the need for coordinating overheads such as synchronization or communication between processors. By default, pardo uses block partitioning of the loop assignment values to the threads, which typically results in better cache utilization due to the array locations on left-hand side of the assignment being owned by local caches more often than not. However, SIMPLE explicitly provides both block and cyclic partitioning interfaces for the pardo directive. Similar mechanisms exist for parallelizing loops across nodes. The all_pardo_cyclic (i, a, b) directive will cyclically assign each iteration of the loop
Hybrid Programming With SIMPLE
across the entire collection of processors. For example, i = a will be executed on the first processor of the first node, i = a + on the second processor of the first node, and so on, with i = a + r − on the last processor of the first node. The iteration with i = a + r is executed by the first processor on the second node. After r ⋅ p iterations are assigned, the next index will again be assigned to the first processor on the first node. A similar directive called all_pardo_block, which accepts the same arguments, assigns the iterations in a block fashion to the processors; thus, the first b−a iterrp ations are assigned to the first processor, the next block of iterations are assigned to the second processor, and so forth. With either of these SIMPLE directives, each processor will execute at most ⌈ rpn ⌉ iterations for a loop of size n.
H
Memory Management Finally, shared memory allocations are the third category of SIMPLE computation primitives. Two directives are used: . node_malloc for dynamically allocating a shared structure . node_free for releasing this memory back to the heap The node_malloc primitive is called by all threads on a given node, and takes as a parameter the number of bytes to be allocated dynamically from the heap. The primitive returns to each thread a valid pointer to the shared memory location. In addition, a thread may allow others to access local data by broadcasting the corresponding memory address. When this shared memory is no longer required, the node_free primitive releases it back to the heap.
Control The second category of SIMPLE computation primitives control which threads can participate in the context by using restrictions. Table defines each control primitive and gives the largest number of threads able to execute the portion of the algorithm restricted by this statement. For example, if only one thread per node needs to execute a command, it can be preceded with the on_one_thread directive. Suppose data has been gathered to a single node. Work on this data can be accomplished on that node by preceding the statement with on_one_node. The combination of these two primitives restricts execution to exactly one thread, and can be shortcut with the on_one directive.
SIMPLE Algorithmic Design Programming Model The user writes an algorithm for an arbitrary cluster size p and SMP size r (where each node can assign possibly different values to r at runtime), using the parameters from Table . SIMPLE expects a standard main function (called SIMPLE_main() ) that, to the user’s view, is immediately up and running on each thread. Thus, the user does not need to make any special calls to initialize the libraries or communication channels. SIMPLE makes available the rank of each thread on its node or across the cluster, and algorithms can use these ranks in a straightforward fashion to break symmetries and
Hybrid Programming With SIMPLE. Table Subset of SIMPLE control primitives Control Primitives Max number of participating threads
MYNODE restriction
Primitive
Definition
on_one_thread
only one thread per node p
on_one_node
all threads on a single node
r
on_one
only one thread on a single node
on_thread(i)
one thread (i) per node
p
on_node(j)
all threads on node j
r
MYTHREAD restriction
i
j
H
Hybrid Programming With SIMPLE
partition work. The only argument of SIMPLE_main() is “THREADED,” a macro pointing to a private data structure which holds local thread information. If the user’s main function needs to call subroutines which make use of the SIMPLE library, this information is easily passed via another macro “TH” in the calling arguments. After all threads exit from the main function, the SIMPLE code performs a shut down process.
Runtime Support When a SIMPLE algorithm first begins its execution, the SIMPLE runtime support has already initialized parallel mechanisms such as barriers and established the network-based internode communication channels which remain open for the life of the program. The various libraries described in Sect.“The SIMPLE
Thread (p-1, r-1)
Thread (p-1, 1)
Thread (p-1, 0)
Node p-1
Thread (1, r-1)
Thread (1, 1)
Thread (1, 0)
Master Thread 1
Node 1
Thread (0, r-1)
Thread (0, 1)
Thread (0, 0)
Node 0
Computational Model” have runtime initializations which take place as follows. The runtime start-up routines for a SIMPLE algorithm are performed in two steps. First, the ICL initialization expands computation across the nodes in a cluster by launching a master thread on each of the p nodes and establishing communication channels between each pair of nodes. Second, each master thread launches r user threads, where each node is at least an r-way SMP. (A rule of thumb in practice is to use r threads on an r + -way SMP node, which allows operating system tasks to fully utilize at least one CPU.) It is assumed that the r CPUs concurrently execute the r threads. The thread flow of an example SIMPLE algorithm is shown in Fig. . As mentioned previously, our methodology supports
Master Thread p-1
H
Master Thread 0
Communication Phase (Collective)
Node Barrier
Time
Irregular Communication using sends and receives
Hybrid Programming With SIMPLE. Fig. Example of a SIMPLE algorithm flow of master and compute-based user threads. Note that the only responsibility of each master thread is only to launch and later join user threads, but never to participate in computation or communication
Hybrid Programming With SIMPLE
only node-oriented communication, that is, given any source node s and destination node d, with s ≠ d, only one thread on node s can send (receive) a message to (from) node d during a communication step. Also note that the master thread does not participate in any computation, but sits idle until the completion of the user code, at which time it coordinates the joining of threads and exiting of processes. The programming model is simply implemented using a portable thread package called POSIX threads (pthreads).
A Possible Approach The latency for message passing is an order of magnitude higher than accessing local memory. Thus, the most costly operation in a SIMPLE algorithm is internode communication, and algorithmic design must attempt to minimize the communication costs between the nodes. Given an efficient message passing algorithm, an incremental process can be used to design an efficient SIMPLE algorithm. The computational work assigned to each node is mapped into an efficient SMP algorithm. For example, independent operations such as those arising in functional parallelism (e.g., independent I/O and computational tasks, or the local memory copy in the SIMPLE alltoall primitive presented in the previous section) or loop parallelism typically can be threaded. For functional parallelism, this means that each thread acts as a functional process for that task, and for loop parallelism, each thread computes its portion of the computation concurrently. Loop transformations may be applied to reduce data dependencies between the threads. Thread synchronization is a costly operation when implemented in software and, when possible, should be avoided.
Example: Radix Sort Consider the problem of sorting n integers spread evenly across a cluster of p shared-memory r-way SMP nodes, where n ≥ p . Fast integer sorting is crucial for solving problems in many domains, and as such, is used as a kernel in several parallel benchmarks such as NAS (Note that the NAS IS benchmark requires that the integers be ranked and not necessarily placed in sorted order.) [] and SPLASH []. Consider the problem of sorting n integer keys in the range [, M − ] that are distributed equally in the
H
shared memories of a p-node cluster of r-way SMPs. Radix Sort decomposes each key into groups of ρ-bit digits, for a suitably chosen ρ, and sorts the keys by applying a counting sort routine on each of the ρ-bit digits beginning with the digit containing the least significant bit positions []. Let R = ρ ≥ p. Assume (w.l.o.g.) that the number of nodes is a power of two, say p = k , and hence Rp is an integer = ρ−k ≥ .
SIMPLE Counting Sort Algorithm Counting Sort algorithm sorts n integers in the range [, R − ] by using R counters to accumulate the number of keys equal to the value i in bucket Bi , for ≤ i ≤ R − , followed by determining the rank of each key. Once the rank of each key is known, each key can be moved into its correct position using a permutation ( np -relation) routing [, ], whereby no node is the source or destination of more than np keys. Counting Sort is a stable sorting routine, that is, if two keys are identical, their relative order in the final sort remains the same as their initial order.
Algorithm SIMPLE Counting Sort Algorithm Step (): For each node i, ( ≤ i ≤ p − ), count the frequency of its np keys; that is, compute Hi[k] , the number of keys equal to k, for ( ≤ k ≤ R − ). Step (): Apply the alltoall primitive to the H arrays using the block size Rp . Hence, at the end of this step, each node will hold
R p
consecutive rows of H.
Step (): Each node locally computes the prefix-sums of its rows of the array H. Step (): Apply the (inverse) alltoall primitive to the R corresponding prefix-sums augmented by the total count for each bin. The block size of the alltoall primitive is Rp . Step (): On each node, compute the ranks of the np local elements using the arrays generated in Steps () and (). Step (): Perform a personalized communication of keys to rank location using an np -relation algorithm.
H
H
Hypercube
The pseudocode for our Counting Sort algorithm (Alg. ) uses six major steps and can be described as follows. In Step (), the computation can be divided evenly among the threads. Thus, on each node, each of r threads (A) histograms r of the input concurrently, and (B) merges these r histograms into a single array for node i. For the prefix-sum calculations on each node in Step (), since the rows are independent, each of r threads can compute the prefix-sum calculations for R rows concurrently. Also, the computation of ranks rp on each node in Step () can be handled by r threads, where each thread calculates rpn ranks of the node’s local elements. Communication can also be improved by replacing the message passing alltoall primitive used in Steps () and () with the appropriate SIMPLE primitive. The h-relation used in the final step of Counting Sort is a permutation routing since h = np , and was given in the previous section. Histogramming in Step (A) costs O(є + np α ) to read the input and O(є + R αr ) for each processor to write its histogram into main memory. Merging in Step (B) uses an SMP reduce with cost O((log r)є + R αr ). SIMPLE alltoall in Step () and the inverse alltoall in Step () take O(τ + Rβ + є + Rα ) time. Computing local prefix-sums in Step () costs O(є +
R r p ). Ranking each element in pr α Step () takes O(є + np α ) time. Finally, the SIMPLE permutation with h = np costs O(τ + np ( β + α + єr )), for np ≥ r ⋅ max(p, log r). Thus, the total com-
plexity for Counting Sort, assuming that n ≥ p ⋅ max(R, r⋅max(p, log r)) is O(τ + np ( β +
r α
+ єr )).
()
works on ρ-bit digits of the input keys, starting from the least significant digit of ρ bits to the most significant digit. Radix Sort easily can be adapted for clusters of SMPs by using the SIMPLE Counting Sort routine. Thus, the total complexity for Radix Sort, assuming that n ≥ p⋅max(R, r⋅max(p, log r)) is O( bρ (τ + np ( β +
+ єr ))).
()
Bibliography . Bader DA () On the design and analysis of practical parallel algorithms for combinatorial problems with applications to image processing, PhD thesis, Department of Electrical Engineering, University of Maryland, College Park . Bader DA, Helman DR, JáJá J () Practical parallel algorithms for personalized communication and integer sorting. Technical Report CS-TR- and UMIACS-TR--, UMIACS and Electrical Engineering. University of Maryland, College Park . Bader DA, Helman DR, JáJá J () Practical parallel algorithms for personalized communication and integer sorting. ACM J Exp Algorithmics ():–. www.jea.acm.org// BaderPersonalized/ . Bailey D, Barszcz E, Barton J, Browning D, Carter R, Dagum L, Fatoohi R, Fineberg S, Frederickson P, Lasinski T, Schreiber R, Simon H, Venkatakrishnan V, Weeratunga S () The NAS parallel benchmarks. Technical Report RNR--, Numerical Aerodynamic Simulation Facility. NASA Ames Research Center, Moffett Field . JáJá J () An Introduction to Parallel Algorithms. AddisonWesley Publishing Company, New York . Knuth DE () The Art of Computer Programming: Sorting and Searching, vol . Addison-Wesley Publishing Company, Reading . Message Passing Interface Forum. MPI () A message-passing interface standard. Technical report. University of Tennessee, Knoxville. Version . . Woo SC, Ohara M, Torrie E, Singh JP, Gupta A () The SPLASH- programs: Characterization and methodological considerations. In Proceeding of the nd Annual Int’l Symposium Computer Architecture, pp –
SIMPLE Radix Sort Algorithm The message passing Radix Sort algorithm makes several passes of the previous message passing Counting Sort in order to completely sort integer keys. Counting Sort can be used as the intermediate sorting routine because it provides a stable sort. Let the n integer keys fall in the range [, M − ], and M = b . Then bρ passes of Counting Sort is needed; each pass
r α
Hypercube Hypercubes and Meshes Networks, Direct
Hypercubes and Meshes
H
(vertices) and interconnect wires (edges) that are connecting the processors. Since high-speed data transfers normally require an unidirectional point-to-point Thomas M. Stricker connection, the resulting interconnection graphs are Zürich, Switzerland directed. In addition to the abstract connectivity graph, the optimal layout of processing elements and wires in physical space must be studied. In reference to the branch of mathematics dealing with invariant geometSynonyms Distributed computer; Distributed memory computers; ric properties in relation to different spaces, this specGeneralized meshes and tori; Hypercube; Interconnection ification is called the topology of a parallel computer network; k-ary n-cube; Mesh; Multicomputers; Multi- system. In geometry, the classical term cube refers to the processor networks regular convex hexahedron as one of the five platonic solids in three-dimensional space. Cube alike strucDefinition tures, beneath and beyond the three dimensions of a In the specific context of computer architecture, a hyper- physical cube can be listed as follows and are drawn in cube refers to a parallel computer with a common Fig. . regular interconnect topology that specifies the layout ● Zero dimensional: A dot in geometry or a nonof processing elements and the wiring in between them. connected uniprocessor in parallel computing. The etymology of the term suggests that a hypercube ● One dimensional: A geometric line segment or a is an unbounded, higher dimensional cube alike geotwin processor system, connected with two unidimetric structure, that is scaled beyond (greek “hyper”) rectional wires. the three dimensions of a platonic cube (greek “kubos”). ● Two dimensional: A geometric square or a four In its broader meaning, the term is also commonly processor array, connected with four unidirectional used to denote a genre of supercomputer-prototypes wires in the horizontal and four wires in the vertical and supercomputer-products, that were designed and direction. built in the time period of –, including the Cos● Three dimensional: A geometric cube (hexahedron) mic Cube, the Intel iPSC hypercube, the FPS T-Series, or eight processors connected with wires. and a similar machine manufactured by nCUBE corpo● Four dimensional: A tesseract in geometry or sixteen ration. A mesh-connected parallel computer is using a processors, arranged as two cubes with the eight regular interconnect topology with an array of multiple corresponding vertices linked by two unidirectional processing elements in one, two, or three dimensions. wires each. In a generalization from hypercubes and meshes to the ● n-dimensional: A hypercube as an abstract geometbroader class of k-ary n-cubes, the concept extends to ric figure, that becomes increasingly difficult to draw many more distributed-memory multi-computers with on paper or the arrangement of n processor with highly regular, direct networks. each processor having n wires to its n immediate neighbors in all n dimensions.
Hypercubes and Meshes
Discussion Introduction and Technical Background
History
In a distributed memory multicomputer with a direct network, the processing elements also serve as switching nodes in the network of wires connecting them. For a mathematical abstraction, the arrangement of processors and wires in a parallel machine is commonly expressed as a graph of nodes of processing elements
With the evolution of supercomputers from a single vector processor toward a collection of high performance microprocessors during the mid-s, a large amount of research work focused on the best possible topology for massively parallel machines. In that time period the majority of parallel systems were built from a large
H
H
Hypercubes and Meshes
1
1
3
0
0
2
0
Dot 0-cube uniprocessor
5 1
Line 1-cube twin proc.
Square 2-cube quad proc.
7 3
4 0
6 2
Cube 3-cube
... Hypercubes
Tesseract binary 4-cube
64-node hypercube binary 6-cube
Hypercubes and Meshes. Fig. Graphical rendering of dot ( dimensional), line (D), square (D), cube (D), and hypercubes (>D)
number of identical processing elements, each comprising a standard microprocessor, a dedicated semiconductor random access memory, a network interface and – in some cases – even a dedicated local storage disk in every node. All processing elements also served as switching points in the interconnect fabric. This evolution of technology in highly parallel computers resulted in the class of distributed memory parallel computers or multi-computers. The extensive research activity goes back to a special focus on geometry in the theory of parallel algorithms pre-dating the development of the first practical parallel systems. This led to a widespread belief, that the physical topology of parallel computing architectures had to reflect the communication pattern of the most common parallel algorithms known at the time. The obvious mathematical treatment of the question through graph theory resulted in countless theoretical papers that established many new topologies, mappings, emulations, and entire equivalence classes of topologies including meshes, hypercubes, and fat trees. The suggestions for a physical layout to connect the nodes of a parallel multicomputer range from simple one-dimensional arrays of processors to some fully interconnected graphs with direct wires from every node to every other node in the system. Hierarchical rings and busses were also considered. The many results
of the effort are compiled into a comprehensive text book []. During the golden age of the practical hypercubes (roughly during the years of –), it was assumed that the topology of high-dimensional binary hypercubes would result in the best network for the direct mapping of many well-known parallel algorithms. In a binary hypercube each dimension is populated with just two processing nodes as shown in Fig. . The wiring of P ⋆ log P unidirectional wires to connect P processors in hypercube topology seemed to be an optimal compromise between the minimal interconnect of just P wires for P processors in a ring and the complete interconnect with P⋆ (P-) wires in a fully connected graph (clique). In a hypercube or a mesh layout for a parallel system, the processors can only communicate data directly to a subset of neighboring processors and therefore require a mechanism for indirect communication through data forwarding at the intermediate processing nodes. Indirect communication can take place along pre-configured virtual channels or through the use of forwarded messages. Accordingly the practical hypercube and mesh systems are primarily designed for message passing as their programming model. However, most stream-based programming models can be supported efficiently with channel virtualization. Shared
Hypercubes and Meshes
memory programming models are often supported on top of a low level message passing software layer or by hardware mechanisms like directory-based caches that rely on messages for communication. Binary hypercube parallel computers like the Cosmic Cube [], the iPSC [, ] the FPS T Series [] or NCube [] were succeeded around by a series of distributed memory multi-computers with a two- or three-dimensional mesh/torus topology, such as the Ametek,Warp/iWarp, MassPar, JMachine, the Intel Paragon, Intel ASCI Red, Cray TD, TE, Red Storm, and XT/Seastar. Scaling to larger machines, the early binary hypercubes were at a severe disadvantage, because they required a high number of dimensions resulting in an unbounded number of ports for the switches at every node (due to the unbounded in/out degree of the hypercube topology). Something all the hypercubes and the more advanced mesh/torus architectures have in common is that they rely heavily on the message routing technologies developed for the early hypercubes. They are classified as generalized hypercubes (k-ary n-cubes) in the literature and therefore included in this discussion. By the turn of the century the concept of hypercubes was definitely superseded by the network of workstations (NOWs) and by the cluster of personal computers (COPs) that used a more or less irregular interconnect fabric built with dedicated network switches that are readily available from the global working technology of the Internet.
Significance of Parallel Algorithms and Their Communication Patterns A valid statement about the optimal interconnect layout in a parallel system can only be made under the assumption that there is a best suitable mapping between the M data elements of a large simulation problem and the P processing elements in a parallel computer. For simple, linear mappings (e.g., M/P elements on every processor in block or block/cyclic distribution) the communication patterns of operations between data elements translate into roughly the same communication pattern of messages between the processors in the parallel machine. For simplicity, we also assume M >> P and that M and P are powers of two. In an algorithm with
H
a hypercube communication pattern all communication activities take place between nodes that are direct neighbors in the layout of a hypercube. For the common number scheme of nodes this is between node pairs that differ by exactly one bit in their number. Every communication in a hypercube pattern therefore crosses just one wire in the hypercube machine.
Algorithms with Pure Hypercube Communication Patterns Several well-known and important algorithms do require a sequence of inter-node communication steps that follow a pure hypercube communication pattern. In most of these algorithms, the communication happens in steps along increasing or decreasing dimensions, e.g., the first communication step between neighbors whose number differs in the least significant bit followed by a next step until a last step with communication to neighbors differing in the most significant bit. Most parallel computers cannot use more than one or two links for communication at the same time due to bandwidth constraints within the processor node. Therefore, a true pipeline with more than one or two overlapping hypercube communication steps is rarely encountered in practical programs. One of the earliest and best-known algorithms with a pure hypercube communication pattern is the bitonic sorting algorithm that goes back to a generalized merge sort algorithm for parallel computers published in []. The communication graph in Fig. shows all the hypercube communication steps that are required to merge locally sorted subsequences into a globally sorted sequence on processors. The most significant and most popular algorithm communicating between nodes in a pure hypercube pattern is the classic calculation of a Fast Fourier Transform (FFT) over an one-dimensional array, with its data distributed among P processors in a simple block partitioning. During the Fourier transformation of a large distributed array a certain number of butterfly calculations can be carried out in local memory up to the point when the distance between the elements in the butterfly becomes large enough that processor boundaries are crossed and some inter-processor communication over
H
H
Hypercubes and Meshes
P0
P0
P1 P2
P1 P2
P3 P4
P3 P4
P5 P6
P5 P6
P7 P8
P7 P8
P9 P10
P9 P10
P11 P12
P11 P12
P13 P14
P13 P14 P15
P15 Step 1
Step 2,3
Step 4,5 and 6
Step 7,8,9 and 10
Hypercubes and Meshes. Fig. Batcher bitonic parallel sorting algorithm. A series of hypercube communication steps will merge locally sorted subsequences into one globally sorted array. The arrows indicate the direction of the merge, i.e., one processor gets the upper and the other processor gets the lower half of the merged sequence
the hypercube links is required. The distance of butterfly operations is always a power of two and therefore it maps directly to a hypercube. In the more advanced and multi-dimensional FFT library routines the communication operations between elements at the larger distances might eventually be reduced by dynamically rearranging the distribution of array elements during the calculation. Such a redistribution is typically performed by global data transpose operation that requires an all-to-all personalized communication (AAPC), i.e., individual and distinct data blocks are exchanged among all P processors. In general, fully efficient and scalable AAPC does require a scalable bandwidth interconnect, like the binary hypercube. In practice a talented programmer can optimize such collective communication primitives to run at maximal link speed for sufficiently large mesh- and torus-connected multi-computers (i.e., Cray TD/TE tori with up to nodes) [].
Algorithms with Next Neighbor Communication Pattern Most simulations in computational physics, chemistry, and biology are calculating some physical interactions (forces) between a large number of model nodes (i.e., particles or grid points) in three-dimensional physical space. Therefore, they only require a limited amount
of communication between the model nodes in at most three dimensions because the longer range forces can be summarized or omitted. Consequently the communication in the parallel code of an equation solver, based on relaxation techniques is also limited to nearby processor in two or three dimensions. Many calculations in natural science do not require communication in dimensions that are higher than three and there is no need to arrange processing elements of a parallel computer in hypercube topologies. Users in the field of scientific computing have recently introduced some optimizations that require increased connectivity. Modern physical simulations are carried out on irregular meshes, that are modeling the physical space more accurately by adapting the density of meshing points to some physical key parameter like the density of the material, the electric field, or a current flow. Irregular meshes grid points and sparse matrices are more difficult to partition into regular parallel machines. Advanced simulation methods also involve the summation of certain particle interactions and forces in Fourier and Laplace spaces to achieve a better accuracy in fewer simulation steps (e.g., the particlemesh Ewald summation in molecular dynamics). Both improvements result in denser communication patterns that require higher connectivity. In the end it remains fairly hard to judge for the systems architect, whether
Hypercubes and Meshes
these codes alone can justify the more complex wiring of a high-dimensional hypercube within the range of the machine sizes, that are actually purchased by the customers.
Meshes as a Generalization of Binary Hypercubes into k-ary n-Cubes In the original form of a binary hypercube, the wires connect only two processing nodes in every dimension. In linear arrays and rings a larger number of processors can be connected in every dimension. The combination of multiple dimensions and multiple processors connected along a line in one dimension leads to a natural generalization of the hypercube concept that includes the popular D, D, and D arrays of processing elements. In computer architecture, the term “mesh” is used for a two- or higher-dimensional processor array. A linear array of processing nodes can be wired into a ring by a wrap-around link. A two or higher dimensional mesh, that is equipped with wrap-around links is called a torus. The generalized form of hypercubes and meshes is characterized by two parameters, the maximal number of elements found along one dimension, k, and the total number of dimensions used, n. The binary hypercubes described in the introduction are classified as a -ary n-cubes. A k × k, two-dimensional torus can be classified as a k-ary -cube. With the popularity of the hypercube topology in the beginning of distributed memory computing, many key technologies for message forwarding and routing were developed in the early hypercube machines. These ideas extend quite easily to the generalized k-ary n-cubes and were incorporated quickly into meshes and tori. In hypercubes, meshes and tori alike, the messages traveling between arbitrary processing nodes have to traverse multiple links to reach their destination. They have to be forwarded in a number of intermediate nodes, raising the important issues of message forwarding policies and deadlock avoidance. In a general setting it would be very hard to establish a global guarantee that every message can travel right through without any delay along a given route. Therefore, some low level flow control mechanisms are used to stop and hold the messages, if another message is already occupying a channel along the path. The most
H
common strategies for forwarding the messages in a hypercube network are wormhole or cut-through routing. If not done carefully, blocking and delaying messages along the path can lead to tricky deadlock situations, in particular when the messages are indefinitely blocking each other within an interconnect fabric. During the era of the first hypercube multicomputers two key technologies for deadlock avoidance were described [] and applied to the construction of prototypes and products: ●
Dimension order routing: In binary hypercubes routes between two nodes must be chosen in a way that the distance in the highest dimension is traveled first, before any distance of a lesser dimension is traveled. Therefore messages can only be stopped due to a busy channel in a dimension lower than the one that they are traveling. The channels in the lowest dimension always lead to a final destination. The messages occupying these channels can make progress immediately and free the resource for other messages blocked at the same and higher dimensions. This is sufficient to prevent a routing deadlock. ● Dateline-based torus routing: With the wrap-around link of rings and tori, the case of messages blocking each other in a single dimension around the back-loops has to be addressed. This is done by replicating the physical wires into some parallel virtual channels forming a higher and a lower virtual ring. A dateline is established at a particular position in the ring. The dateline can only be crossed if the message switches from the higher to the lower ring at that position. With this simple trick, all connections along the ring remain possible, but the messages in the ring can no longer block each other in a circular deadlock. Both deadlock avoiding techniques are normally combined to permit deadlock-free routing in the generalized k-ary n-cube topologies. Extending the well-known dimension-order and dateline routing strategies to slightly irregular network topologies is an additional challenge. The two abstract rules for legal message routes can be translated into a simple channel numbering scheme providing a total order of all channels within an interconnect, including all irregular nodes. A simple
H
H
Hypercubes and Meshes
Disk 1.02.02
Disk 1.04.02
3.02.02
1.02.01 3.01.02
1.01.01 1.01.02
2.01.07
2.01.17
2.01.15
2.12.17
2.14.03 2.16.17
2.03.03
2.03.05
2.03.07
2.03.17
2.03.15
2.03.13
2.12.05 2.14.15
2.14.05 2.16.15
2.16.05 2.18.15
Video I/O
2.05.03
2.12.13
1.03.02
2.01.05
2.12.03 2.14.17
2.12.15
11.18
2.01.03
3.01.01
Terminal I/O
3.03.02
3.04.01
3.02.01
1.03.01 3.03.01 1.02.02
2.05.05
2.01.13 2.16.03 2.18.17
3.04.02
1.04.01 3.01.02 1.01.02 2.18.03
2.18.05
2.05.07
2.05.17
2.05.15
2.05.13
2.12.07 2.14.13
2.14.07 2.16.13
2.16.07 2.18.13
2.07.03
2.07.05
2.07.07
3.03.02
2.07.17 3.02.02
2.07.15
2.07.13 1.04.02
1.03.02 3.04.02
2.18.07
Hypercubes and Meshes. Fig. Generalized hypercubes. Channel numbering based on the original hyperube routing rules for the validation of a deadlock-free router in a slightly irregular two-dimensional torus with several IO nodes (an iWarp system)
rule that channels must be followed in a strictly increasing/decreasing order to form a legal route is sufficient to prevent deadlock. With the technique, illustrated in Fig. , it becomes possible to code and validate the router code for any k-ary -cube configuration that was offered as part of the Intel/CMU iWarp product line, including the service nodes that are arbitrarily inserted into the otherwise regular torus network [].
Mapping Hypercube Algorithms into Meshes and Tori By design the binary hypercube topology provides a fully scalable worst case bisection bandwidth. Regardless of how the machine is scaled or partitioned, the algorithm can count on the same amount of bandwidth per processor for the communication between the two
halves. In meshes and tori the total bisection bandwidth does not scale up linearly with the number of nodes. The global operations, that are usually communication bound, become increasingly costly in larger machines. The concept of the scalable bisection bandwidth was thought to be a key advantage of the hypercube designs that is not present in meshes for a long time. In the theory of parallel algorithms the asymptotic lower bounds on communication in binary hypercubes usually differ from the bounds holding for communication in an one-, two- or three-dimensional mesh/torus. The lower bound on the number of steps (i.e., the time complexity) is determined by the amount of data that has to be moved between the processing elements and by the number of wires that are available for this task (i.e., the bisection bandwidth). The upper bound is
Hypercubes and Meshes
usually determined by the sophistication of the algorithm. One of the most studied algorithms is parallel sorting in a – comparison based model [, ]. Similar considerations can be made for matrix transposes or redistributions of block/cyclic arrays distributed across the nodes of a parallel machine. In the practice of parallel computing, a meshconnected machine must be fairly large until an algorithmic limitation due to bisection bandwidth is encountered and a slowdown over a properly hypercube connected machine can be observed []. As an example, the bandwidth of an -node ring, a -node two - dimensional torus or a -node three-dimensional torus is sufficient to permit arbitrary array transpose operations without any slowdown over a hypercube of the same size. The comparison assumes that the processing nodes of both machines can transfer data from and to the network at about the speed of a bidirectional link – as this is the case for most parallel machines. Therefore, a few simple mapping techniques are used in practice to implement higher dimensional binary hypercubes on k-ary meshes or tori. A threedimensional binary hypercube can be mapped into an eight node ring with a small performance penalty using a Gray code mapping as illustrated in Fig. .
In this simple mapping some links have to be shared or multiplexed by a factor of two and next-neighbor communication is extended to a distance of up to three hops. The mapping of larger virtual hypercube networks comes at the cost of an increasing dilation (increase in link distance) and congestion (a multiplexing factor of each link) for higher dimensions. The congestion factor can be derived using the calculations of bisection bandwidth for both topologies. A binary -cube can be mapped into a D torus and the -cube into a D torus by extending the linear scheme to multiple dimensions. In their VLSI implementation, a two- or threedimensional mesh/torus-machine is much easier to lay out in an integrated circuit than an eight- or tendimensional hypercube with its complicated wiring structure. Therefore its simple next-neighbor links can be built with faster wires and wider channels for multiple wires. The resulting advantage for the implementation is known to offset the congestion and dilation factors in practice. It is therefore highly interesting to study these technology trade-offs in practical machines. For – sorting, the algorithmic complexity and the measured execution times on a virtual hypercube mapped to mesh match the known complexities and
(1,1,1)
(1,1,0) 6
4
7 5
(1,0,0)
4
5
7 (0,1,1)
6
...
3
2
2 0 (0,0,0)
1
H
3 (0,0,1) 1 0
Hypercubes and Meshes. Fig. Gray code mapping of a binary three-cube into an eight node ring. Virtual network channels were used to implement virtual hypercube network in machines that are actually wired as meshes or tori
H
H
Hypercubes and Meshes
running times of the algorithm for direct execution on a mesh fairly closely [, ].
Limitations of the Hypercube Machines The ideal topology of a binary hypercube comes at the high cost of significant disadvantages and restrictions regarding their practical implementation in the VLSI circuits of a parallel computer. Therefore, a number of alternative options to build multicomputers were developed beyond hypercubes, meshes, and their superclass of the k-ary n-cubes. Among the alternatives studied are: hierarchical rings, that differ slightly from multi-dimensional tori, cube-connected cycles, that successfully combine the scalable bisection bandwidth of hypercubes with the bounded in/out degrees of lower - dimensional meshes and finally fat trees that are provably optimal in their layout of wiring for VLSI. Those interconnect structures have in common, that they are still fairly regular and can leverage of the same practical technical solutions for message forwarding, routing and link level flow control, the way the traditional hypercubes do. None of these regular interconnects requires a spanning routing protocol or TCP/IP connections to function properly. Towards the end of the golden age of hypercubes, the regular, direct networks were facing an increasing competition by new types of networks built either from physical links with different strength (i.e., different link bandwidths implemented by a multiplication of wires or in a different technology). In , the Thinking Machines Corporation announced the Connection Machine CM using a fat tree topology and this announcement contributed to end of the hypercube age. After the year , the physical network topologies became less and less important in supercomputer architecture. The detailed structure of the networks became hidden and de-emphasized due to many higher-level abstractions of parallelizing compilers and message passing libraries available to the programmers. The availability of high performance commodity networking switches for use in the highly irregular topology of the global Internet accelerated this trend. In most newer designs, a fairly large multistage network of switches (e.g., a Clos network) has replaced the older direct networks of hypercubes and meshes. At this time only a small fraction of PC Clusters is still including
the luxury of a low latency, high performance interconnect. Despite all trends to abstraction and de-emphasis of network structure, the key figures of message forwarding latency and the bisection bandwidth remain the most important measure of scalability in a parallel computer architecture.
Hypercube Machine Prototypes and Products One of the earliest practical computer systems using a hypercube topology is the Cosmic Cube prototype designed by the Group of C. Seitz at the California Institute of Technology []. The system was developed in the first half of the s using the VLSI integration technologies newly available at the time. The first version of the system comprised nodes with an Intel / microprocessor/floating point coprocessors combination running at MHz and using kB of memory in each node. The nodes were connected with point-to-point links running at Mbit/s nominal speed. The original system of nodes was allegedly planned as a × × three-dimensional torus with bidirectional links – but this particular topology is fully equivalent to a binary six-dimensional hypercube under a gray code mapping and became therefore known as a first hypercube multicomputer, rather than as a first three-dimensional torus. Subsequently a small number of commercial machines were built in a collaboration with a division of AMETEK Corporation []. The Caltech prototypes lead to the development of iPSC at the Intel Supercomputer Systems Division, a commercial product series based on hypercube topology [, ]. An example of the typical technical specifications is the iPSC/ system as released in : nodes, Intel / processor/floating point coprocessor running at MHz, – MB of main memory in each node, the nodes connected with links running Mbit/s. iPSC systems were programmed using the NX message passing library, similar in the functionality, but pre-dating the popular portable message passing systems like PVM or MPI. The development of the product line was strongly supported by (D)ARPA, the Advanced Research Project Agency of the US Dept. of Defense. During roughly the same time period a hypercube machine was also offered by NCube Corporation, a super-computing vendor that was fully dedicated to the development of hypercube systems at the time. The
Hypercubes and Meshes
NCube , model, available to Nasa AMES in was a -node system with a MHz full custom bit CPU/FPU with up to MB main memory in every node and Mbit/s interconnects. The largest configuration commercially offered was a binary -cube with , processing nodes [, ]. The FPS (Floating Point Systems) T Series Parallel Vector Supercomputer was also made of multiple processing nodes interconnected as a binary n-cube. Each node is implemented on a single printed circuit board, contains MB of data storage memory, and has a peak vector computational speed of MFLOPS. Eight vector nodes are grouped around a system support node and a disk storage system to form a module. The module is the basic building block for larger system configurations in hypercube topology. The T series was named after the tesseract, a geometric cube in four-dimensional space []. The Thinking Machine CM also included a hypercube communication facility for data exchanges between the nodes as well as a mesh (called NEWS grid) []. As an SIMD machine with a large number of bit slice processors, its architecture differed significantly from the classical hypercube multicomputer. In its successor, the Thinking Machine CM, the direct networks were given up in favor of a fat tree using a variable number of (typically N-) of additional switches to form a network in fat tree topology. A similar trend was followed in the machines of Meiko/Quadrics. The delay of the T transputer chips with processing and communication capabilities on a single chip forced the designers of the CS system to build their communication system with a multi-stage fabric of switches instead of a hypercube interconnect. The multicomputer line by IBM, the IBM SP and its follow-on products also used a multistage switch fabric. Therefore, these architectures do no longer qualify as generalized hypercubes with direct networks. It is worth mentioning that the first Beowulf clusters of commodity PCs were equipped with three to five network interfaces that could be wired directly in hypercube topology. Routing of messages between the links was achieved by system software in the ethernet drivers of the LINUX operating system. The communication benchmark presented in the first beowulf paper were measured on an -node system, wired as ˆ cube []. The author also remembers encountering a
H
Beowulf system on display at the Supercomputing trade show wired as a -node binary -cube using multiple BaseT network interface cards and simple CAT crossover cables. But as mentioned before, the designs of processors with their own communication hardware on board and direct networks were quickly replaced by commodity switches manufactured by a large number of network equipment vendors in the Internet world. So the topology of the parallel system suddenly became a secondary issue. A large variety of different network configurations rapidly succeeded the regular Beowulf systems in the growing market for clusters of commodity PCs.
H Research Conferences on Hypercubes The popularity of hypercube architectures in the second half of the s led to several instances of a computer architecture conference dedicated solely to distributed memory multicomputers with hypercube and mesh topologies. The locations and dates of the conferences are as remembered or collected from citations in subsequent papers: ●
First Hypercube Conference in Knoxville, August –, , with proceedings published by SIAM after the conference. ● Second Conference on Hypercube Multiprocessors, Knoxville, TN, September –October , , with proceedings published by SIAM after the conference. ● Third Conference on Hypercube Concurrent Computers and Applications, Pasadena, CA, January –, with proceedings published by the ACM, New York. ● Fourth Conference on Hypercube Concurrent Computers and Applications, Monterrey, CA, March –, , proceedings publisher unknown. and later giving in to the trend that the architecture as distributed memory computer is more important than the hypercube layout: ●
Fifth Distributed Memory Computing Conference, Charleston, SC, April –, with proceedings published by IEEE Computer Society Press. ● Sixth Distributed Memory Computing Conference, Portland, OR, April –May , with proceedings published by IEEE Computer Society Press.
H
Hypercubes and Meshes
●
First European Distributed Memory Computing, EDMCC, held in Rennes, France. ● Second European Distributed Memory Computing, EDMCC, Munich, FRG, April –, with Proceedings by Springer, Lecture Notes in Computer Science. After these rather successful meetings of computer architects and application programmers, the conference series on hypercubes and distributed memory parallel systems lost its momentum and in , the specialized venues were unfortunately discontinued.
Related Entries Beowulf clusters
different US national labs evaluating and benchmarking these distributed memory multicomputers. The most interesting study dedicated entirely to binary hypercubes appeared in in Parallel Computing []. The visit to the trade show of “Supercomputing” during the “hypercube” years was a memorable experience, because countless new distributed memory system vendors surprised the audience with a new parallel computer architecture every year. Most of them contained some important innovation and can still be admired in the permanent collections of the computer museums in Boston (MA), Mountain View (CA), and Paderborn (Germany).
Bitonic Sort
Bibliography
Clusters
. Leighton FT () Introduction to parallel algorithms and architectures: array, trees, hypercubes. Morgan Kaufmann Publishers, San Francisco p, ISBN:--- . Stricker T () Supporting the hypercube programming model on meshes (a fast parallel sorter for iwarp). In: Proceedings of the symposium for parallel algorithms and architectures, pp –, San Diego, June . Batcher KE () Sorting networks and their applications. In: Proceedings of the american federation of information processing societies spring joint computer conference, vol . AFIPS Press, Montvale, pp – . Nassimi D, Sahni S () Bitonic sort on a mesh-connected parallel computer. In: IEEE TransComput ():– . Hinrichs S, Kosak C, O’Hallaron D, Stricker T, Take R () Optimal all-to-all personalized communication in meshes and tori. In: Proceedings of the symposium of parallel algorithms and architectures, ACM SPAA’, Cape May, Jan . Stricker T () Message routing in irregular meshes and tori. In: Proceedings of the th IEEE distributed memory computing conference, DMCC, Portland, May . Seitz CL () The cosmic cube. Commun ACM ():– . Close P () The iPSC/ node architecture. In: Proceedings of the third conference on hypercube concurrent computers and applications, Pasadena, – Jan , pp – . Nugent SF () The iPSC/ direct-connect communications technology. In: Proceedings of the third conference on hypercube concurrent computers and applications, Pasadena, – Jan , pp – . Hawkinson S () The FPS T series supercomputer, system modelling and optimization. In: Lecture notes in control and information sciences, vol /. Springer, Berlin/Heidelberg, pp – . The nCUBE Handbook, Beaverton OR, and the nCUBE processor, user manual, Beaverton . Dunigan TH () Performance of the Intel iPSC // and Ncube / hypercubes, parallel computing . North Holland, Elsevier, pp –
Connection Machine Cray TE Cray Vector Computers Cray XT and Seastar -D Torus Interconnect Distributed-Memory Multiprocessor Fast Fourier Transform (FFT) IBM RS/ SP Interconnection Networks MasPar MPI (Message Passing Interface) MPP nCUBE Networks, Direct Routing (including Deadlock Avoidance) Sorting Warp and IWarp
Bibliographic Notes and Further Reading A most detailed description of the algorithms optimally suited for hypercubes, the classes of hypercube equivalent networks, and the emulation of different topologies with respect to algorithmic models is given in an page textbook by F.T. Leighton that appeared in []. A detailed description of the network topologies and the technical data of all practical hypercube prototypes built and machines commercially sold between and can be found through Google Scholar in the numerous papers written by the architects working for the computer manufacturers or by researches at the
Hypergraph Partitioning
. Tucker LW, Robertson GG () Architecture and applications of the connection machines. IEEE Comput ():– . Becker J, Sterling D, Savarese T, Dorband JE, Ranawake UA, Packer CV () Beowulf: a parallel workstation for scientific computation. In: Proceedings of ICPP workshop on challenges for parallel processing, CRC Press, Oconomowc, August . Seitz CL, Athas W, Flaig C, Martin A, Seieovic J, Steele CS, Su WK () The architecture and programming of the ametek series multicomputer. In: Proceedings of the third conference on hypercube concurrent computers and applications, Pasadena, – Jan , pp –
Hypergraph Partitioning Ümit V. Çatalyürek , Bora Uçar , Cevdet Aykanat The Ohio State University, Columbus, OH, USA ENS Lyon, Lyon, France Bilkent University, Ankara, Turkey
Definition Hypergraphs are generalization of graphs where each edge (hyperedge) can connect more than two vertices. In simple terms, the hypergraph partitioning problem can be defined as the task of dividing a hypergraph into two or more roughly equal-sized parts such that a cost function on the hyperedges connecting vertices in different parts is minimized.
Discussion Introduction During the last decade, hypergraph-based models gained wide acceptance in the parallel computing community for modeling various problems. By providing natural way to represent multiway interactions and unsymmetric dependencies, hypergraph can be used to elegantly model complex computational structures in parallel computing. Here, some concrete applications will be presented to show how hypergraph models can be used to cast a suitable scientific problem as an hypergraph partitioning problem. Some insights and general guidelines for using hypergraph partitioning methods in some general classes of problems are also given.
H
Formal Definition of Hypergraph Partitioning A hypergraph H = (V, N ) is defined as a set of vertices (cells) V and a set of nets (hyperedges) N among those vertices. Every net n ∈ N is a subset of vertices, that is, n ⊆ V. The vertices in a net n are called its pins. The size of a net is equal to the number of its pins. The degree of a vertex is equal to the number of nets it is connected to. Graph is a special instance of hypergraph such that each net has exactly two pins. Vertices can be associated with weights, denoted with w[⋅], and nets can be associated with costs, denoted with c[⋅]. Π = {V , V , . . . , VK } is a K-way partition of H if the following conditions hold: Each part Vk is a nonempty subset of V, that is, Vk ⊆ V and Vk ≠ / for ≤ k ≤ K ● Parts are pairwise disjoint, that is, Vk ∩ V ℓ = / for all ≤k<ℓ≤K ● Union of K parts is equal to V, i.e., ⋃Kk= Vk =V ●
In a partition Π of H, a net that has at least one pin (vertex) in a part is said to connect that part. Connectivity λ n of a net n denotes the number of parts connected by n. A net n is said to be cut (external) if it connects more than one part (i.e., λn > ), and uncut (internal) otherwise (i.e., λ n = ). A partition is said to be balanced if each part Vk satisfies the balance criterion: Wk ≤ Wavg ( + ε), for k = , , . . . , K.
()
In (), weight Wk of a part Vk is defined as the sum of the weights of the vertices in that part (i.e., Wk = ∑v∈Vk w[v]), Wavg denotes the weight of each part under the perfect load balance condition (i.e., Wavg = (∑v∈V w[v])/K), and ε represents the predetermined maximum imbalance ratio allowed. The set of external nets of a partition Π is denoted as NE . There are various [] cutsize definitions for representing the cost χ(Π) of a partition Π. Two relevant definitions are: χ(Π) = ∑ c[n]
()
n∈NE
χ(Π) = ∑ c[n](λ n − ).
()
n∈NE
In (), the cutsize is equal to the sum of the costs of the cut nets. In (), each cut net n contributes c[n](λn − ) to the cutsize. The cutsize metrics given in () and ()
H
H
Hypergraph Partitioning
will be referred to here as cut-net and connectivity metrics, respectively. The hypergraph partitioning problem can be defined as the task of dividing a hypergraph into two or more parts such that the cutsize is minimized, while a given balance criterion () among part weights is maintained. A recent variant of the above problem is the multiconstraint hypergraph partitioning [, ] in which each vertex has a vector of weights associated with it. The partitioning objective is the same as above, and the partitioning constraint is to satisfy a balancing constraint associated with each weight. Here, w[v, i] denotes the C weights of a vertex v for i = , . . . , C. Hence, the balance criterion () can be rewritten as Wk,i ≤ Wavg,i ( + ε) for k = , . . . , K and i = , . . . , C , () where the ith weight Wk,i of a part Vk is defined as the sum of the ith weights of the vertices in that part (i.e., Wk,i = ∑v∈Vk w[v, i]), and Wavg,i is the average part weight for the ith weight (i.e., Wavg,i = (∑v∈V w[v, i])/K), and ε again represents the allowed imbalance ratio. Another variant is the hypergraph partitioning with fixed vertices, in which some of the vertices are fixed in some parts before partitioning. In other words, in this problem, a fixed-part function is provided as an input to the problem. A vertex is said to be free if it is allowed to be in any part in the final partition, and it is said to be fixed in part k if it is required to be in Vk in the final partition Π. Yet another variant is multi-objective hypergraph partitioning in which there are several objectives to be minimized [, ]. Specifically, a given net contributes different costs to different objectives.
Sparse Matrix Partitioning One of the most elaborated applications of hypergraph partitioning (HP) method in the parallel scientific computing domain is the parallelization of sparse matrix-vector multiply (SpMxV) operation. Repeated matrix-vector and matrix-transpose-vector multiplies that involve the same large, sparse matrix are the kernel operations in various iterative algorithms involving sparse linear systems. Such iterative algorithms include solvers for linear systems, eigenvalues, and linear programs. The pervasive use of such solvers motivates the
development of HP models and methods for efficient parallelization of SpMxV operations. Before discussing the HP models and methods for parallelizing SpMxV operations, it is favorable to discuss parallel algorithms for SpMxV. Consider the matrix-vector multiply of the form y ← A x, where the nonzeros of the sparse matrix A as well as the entries of the input and output vectors x and y are partitioned arbitrarily among the processors. Let map(⋅) denote the nonzero-to-processor and vector-entry-toprocessor assignments induced by this partitioning. A parallel algorithm would execute the following steps at each processor Pk . . Send the local input-vector entries xj , for all j with map(xj ) = Pk , to those processors that have at least one nonzero in column j. . Compute the scalar products aij xj for the local nonzeros, that is, the nonzeros for which map(aij ) = Pk and accumulate the results yki for the same row index i. . Send local nonzero partial results yki to the processor map(yi )≠Pk , for all nonzero yki . . Add the partial yiℓ results received to compute the final results yi = ∑ yiℓ for each i with map(yi )=Pk . As seen in the algorithm, it is necessary to have partitions on the matrix A and the input- and outputvectors x and y of the matrix-vector multiply operation. Finding a partition on the vectors x and y is referred to as the vector partitioning operation, and it can be performed in three different ways: by decoding the partition given on A; in a post-processing step using the partition on the matrix; or explicitly partitioning the vectors during partitioning the matrix. In any of these cases, the vector partitioning for matrix-vector operations is called symmetric if x and y have the same partition, and non-symmetric otherwise. A vector partitioning is said to be consistent, if each vector entry is assigned to a processor that has at least one nonzero in the respective row or column of the matrix. The consistency is easy to achieve for the nonsymmetric vector partitioning; xj can be assigned to any of the processors that has a nonzero in the column j, and yi can be assigned to any of the processors that has a nonzero in the row i. If a symmetric vector partitioning is sought, then special care must be taken to assign a pair of matching input- and output-vector entries, e.g., xi and
Hypergraph Partitioning
yi , to a processor having nonzeros in both row and column i. In order to have such a processor for all vector entry pairs, the sparsity pattern of the matrix A can be modified to have a zero-free diagonal. In such cases, a consistent vector partition is guaranteed to exist, because the processors that own the diagonal entries can also own the corresponding input- and output-vector entries; xi and yi can be assigned to the processor that holds the diagonal entry aii . In order to achieve an efficient parallelism, the processors should have balanced computational load and the inter-processor communication cost should have been minimized. In order to have balanced computational load, it suffices to have almost equal number of nonzeros per processor so that each processor will perform almost equal number of scalar products, for example, aij xj , in any given parallel system. The communication cost, however, has many components (the total volume of messages, the total number of messages, maximum volume/number of messages in a single processor, either in terms of sends or receives or both) each of which can be of utmost importance for a given matrix in a given parallel system. Although there are alternatives and more elaborate proposals, the most common communication cost metric addressed in hypergraph partitioning-based methods is the total volume of communication. Loosely speaking, hypergraph partitioning-based methods for efficient parallelization of SpMxV model the data of the SpMxV (i.e., matrix and vector entries) with the vertices of a hypergraph. A partition on the vertices of the hypergraph is then interpreted in such a way that the data corresponding to a set of vertices in a part are assigned to a single processor. More accurately, there are two classes of hypergraph partitioning-based methods to parallelizing SpMxV. The methods in the first class build a hypergraph model representing the data and invoke a partitioning heuristic on the so-built hypergraph. The methods in this class can be said to be models rather than being algorithms. There are currently three main hypergraph models for representing sparse matrices, and hence there are three methods in this first class. These three main models are described below in the next section. Essential property of these models is that the cutsize () of any given partition is equal to the total communication volume to be incurred under a consistent vector partitioning when the matrix
H
elements are distributed according to the vertex partition. The methods in the second class follow a mixand-match approach and use the three main models, perhaps, along with multi-constraint and fixed-vertex variations in an algorithmic form. There are a number of methods in this second class, and one can develop many others according to application needs and matrix characteristics. Three common methods belonging to this class are described later, after the three main models. The main property of these algorithms is that the sum of the cutsizes of each application of hypergraph partitioning amounts to the total communication volume to be incurred under a consistent vector partitioning (currently these methods compute a vector partitioning after having found a matrix partitioning) when the matrix elements are distributed according to the vertex partitions found at the end.
Three Main Models for Matrix Partitioning In the column-net hypergraph model [] used for D rowwise partitioning, an M × N matrix A with Z nonzeros is represented as a unit-cost hypergraph HR = (VR , NC ) with ∣VR ∣ = M vertices, ∣NC ∣ = N nets, and Z pins. In HR , there exists one vertex vi ∈ VR for each row i of matrix A. Weight w[vi ] of a vertex vi is equal to the number of nonzeros in row i. The name of the model comes from the fact that columns are represented as nets. That is, there exists one unit-cost net nj ∈ NC for each column j of matrix A. Net nj connects the vertices corresponding to the rows that have a nonzero in column j. That is, vi ∈nj if and only if aij ≠. In the row-net hypergraph model [] used for D columnwise partitioning, an M × N matrix A with Z nonzeros is represented as a unit-cost hypergraph HC = (VC , NR ) with ∣VC ∣ = N vertices, ∣NR ∣ = M nets, and Z pins. In HC , there exists one vertex vj ∈ VC for each column j of matrix A. Weight w[vj ] of a vertex vj ∈ VR is equal to the number of nonzeros in column j. The name of the model comes from the fact that rows are represented as nets. That is, there exists one unit-cost net ni ∈ NR for each row i of matrix A. Net ni ⊆ VC connects the vertices corresponding to the columns that have a nonzero in row i. That is, vj ∈ ni if and only if aij ≠. In the column-row-net hypergraph model, otherwise known as the fine-grain model [], used for D
H
H
Hypergraph Partitioning
nonzero-based fine-grain partitioning, an M×N matrix A with Z nonzeros is represented as a unit-weight and unit-cost hypergraph HZ = (VZ , NRC ) with ∣VZ ∣ = Z vertices, ∣NRC ∣ = M +N nets and Z pins. In VZ , there exists one unit-weight vertex vij for each nonzero aij of matrix A. The name of the model comes from the fact that both rows and columns are represented as nets. That is, in NRC , there exist one unit-cost row-net ri for each row i of matrix A and one unit-cost column-net cj for each column j of matrix A. The row-net ri connects the vertices corresponding to the nonzeros in row i of matrix A, and the column-net cj connects the vertices corresponding to the nonzeros in column j of matrix A. That is, vij ∈ ri and vij ∈ cj if and only if aij ≠ . Note that each vertex vij is in exactly two nets.
Some Other Methods for Matrix Partitioning The jagged-like partitioning method [] uses the rownet and column-net hypergraph models. It is an algorithm with two steps, in which each step models either the expand phase (the st line) or the fold phase (the rd line) of the parallel SpMxV algorithm given above. Therefore, there are two alternative schemes for this partitioning method. The one which models the expands in the first step and the folds in the second step is described below.
Given an M × N matrix A and the number K of processors organized as a P × Q mesh, the jagged-like partitioning model proceeds as shown in Fig. . The algorithm has two main steps. First, A is partitioned rowwise into P parts using the column-net hypergraph model HR (lines and of Fig. ). Consider a Pway partition ΠR of HR . From the partition ΠR , one obtains P submatrices Ap , for p = , . . . , P each having roughly equal number of nonzeros. For each p, the rows of the submatrix Ap correspond to the vertices in Rp (lines and of Fig. ). The submatrix Ap is assigned to the pth row of the processor mesh. Second, each submatrix Ap for ≤ p ≤ P is independently partitioned columnwise into Q parts using the row-net hypergraph Hp (lines and of Fig. ). The nonzeros in the ith row of A are partitioned among the Q processors in a row of the processor mesh. In particular, if vi ∈ Rp at the end of line of the algorithm, then the nonzeros in the ith row of A are partitioned among the processors in the pth row of the processor mesh. After partitioning the submatrix Ap columnwise, the map array contains the partition information for the nonzeros residing in Ap . For each i, the volume of communication required to fold the vector entry yi is accurately represented as a part of “foldVolume” in the algorithm. For each j, the volume of communication regarding the vector entry xj
JAGGED-LIKE-PARTITIONING (A, K = P × Q, e1, e2) Input : a matrix A, the number of processors K = P × Q, and the imbalance ratios e1, e2. Output: map(aij) for all aij ≠ 0 and total Volume. 1: HR = (VR, NC) ← columnNet(A) 2: PR = {R1,...,RP} ← partition(HR, P, e1)
rowwise partitioning of A
3: expand Volume ← cutsize(PR) 4: foldVolume ← 0 5: for p = 1 to P do 6:
Rp = {ri:vi ∈ Rp}
7:
Ap ← A(Rp,:)
8:
Hp = (Vp, Np) ← rowNet(Ap)
9:
PCp = {C1p,...,CQp} ← partition(Hp, Q, e2)
10:
foldVolume ← foldVolume +
11:
for all aij ≠ 0 of Ap do
12:
submatrix indexed by rows Rp
cutsize(PCp)
map(aij) = Pp,q ⇔ cj ∈ Cqp
13: return totalVolume←expandVolume+foldVolume
Hypergraph Partitioning. Fig. Jagged-like partitioning
columnwise partitioning of Ap
Hypergraph Partitioning
R1
r16 c11
c3
c16
r4 r6
r7
r9 r12
c12
c7
c9
c5
r14 c6
r1
c2
r8
c14 r3
r5 r2
c8
c4
r11
R2
r15
r10
c15
r13
c10
a
c1
c13
H
3 4 6 8 11 12 14 16 1 2 5 7 9 10 13 15 3 4 6 8 11 12 14 16 1 2 5 7 9 10 13 15
b
nnz = 47 vol = 3 imbal = [–2.1%, 2.1%]
Hypergraph Partitioning. Fig. First step of four-way jagged-like partitioning of a matrix; (a) two-way partitioning ΠR of column-net hypergraph representation HR of A, (b) two-way rowwise partitioning of matrix AΠ obtained by permuting A according to the partitioning induced by Π; the nonzeros in the same partition are shown with the same shape and color; the deviation of the minimum and maximum numbers of nonzeros of a part from the average are displayed as an interval imbal; vol denotes the number of nonzeros and the total communication volume
is accurately represented as a part of “expandVolume” in the algorithm. Figure a illustrates the column-net representation of a sample matrix to be partitioned among the processors of a × mesh. For simplicity of the presentation, the vertices and the nets of the hypergraphs are labeled with letters “r” and “c” to denote the rows and columns of the matrix. The matrix is first partitioned rowwise into two parts, and each part is assigned to a row of the processor mesh, namely to processors {P , P } and {P , P }. The resulting permuted matrix is displayed in Fig. b. Figure a displays the two row-net hypergraphs corresponding to each submatrix Ap for p = , . Each hypergraph is partitioned independently; sample partitions of these hypergraphs are also presented in this figure. As seen in the final symmetric permutation in Fig. b, the nonzeros of columns and are assigned to different parts, resulting P to communicate with both P and P in the expand phase. The checkerboard partitioning method [] is also a two-step method, in which each step models either the expand phase or the fold phase of the parallel SpMxV. Similar to jagged-like partitioning, there are two alternative schemes for this partitioning method. The one
which models the expands in the first step and the folds in the second step is presented below. Given an M×N matrix A and the number K of processors organized as a P × Q mesh, the checkerboard partitioning method proceeds as shown in Fig. . First, A is partitioned rowwise into P parts using the columnnet model (lines and of Fig. ), producing ΠR = {R , . . . , RP }. Note that this first step is exactly the same as that of the jagged-like partitioning. In the second step, the matrix A is partitioned columnwise into Q parts by using the multi-constraint partitioning to obtain ΠC = {C , . . . , CQ }. In comparison to the jaggedlike method, in this second step the whole matrix A is partitioned (lines and of Fig. ), not the submatrices defined by ΠR . The rowwise and columnwise partitions ΠR and ΠC together define a D partition on the matrix A, where map(aij ) = Pp,q ⇔ ri ∈ Rp and cj ∈ Cq . In order to achieve a load balance among processors, a multi-constraint partitioning formulation is used (line of the algorithm). Each vertex vi of HC is assigned P weights: w[i, p], for p = , . . . , P. Here, w[i, p] is equal to the number of nonzeros of column ci in rows Rp (line of Fig. ). Consider a Q-way partitioning of HC with P constraints using the vertex weight definition
H
H
Hypergraph Partitioning
r8
P1
P2
c6
c4
c8
r16 c12
c5
r14
c14
r12
c16
P3
P4
c1
r1
c13
r5
r13 c5 r2
c2
c3
c11
r4
c2
r3
r6
r10
r11
r15
c10
c15
c9 r9 c7
4 8 12 16 3 6 11 14 1 2 5 13 7 9 10 15
c12
r7
a
b
4 8 12 16 3 6 11 14 1 2 5 13 7 9 10 15 nnz = 47 vol = 8 imbal = [–6.4%, 2.1%]
Hypergraph Partitioning. Fig. Second step of four-way jagged-like partitioning: (a) Row-net representations of submatrices of A and two-way partitionings, (b) Final permuted matrix; the nonzeros in the same partition are shown with the same shape and color; the deviation of the minimum and maximum numbers of nonzeros of a part from the average are displayed as an interval imbal; nnz and vol denote, respectively, the number of nonzeros and the total communication volume
CHECKERBOARD-PARTITIONING(A; K = P × Q; e1; e2) Input: a matrix A, the number of processors K = P × Q, and the imbalance ratios e1; e2. Output: map(aij) for all aij ≠ 0 and totalVolume. 1: HR = (VR, NC) ← columnNet(A) 2: PR = {R1,...,RP} ← partition(HR, P, e1)
rowwise partitioning of A
3: expand Volume ← cutsize(PR) 4: HC = (VC, NR) ← rowNet(A) 5: for j = 1 to |VC| do 6: 7:
for p = 1 to P do wj,p = |{nj ∩ Rp}|
8: PC = {C1,...,CQ} ← MCPartition(HC, Q, e2)
columnwise partitioning of A
9: foldVolume← cutsize(PC) 10: for all aij ≠ 0 of A do 11:
map(aij) = Pp,q ⇔ ri ∈ Rp and cj ∈ Cq
12: totalVolume←expandVolume+foldVolume
Hypergraph Partitioning. Fig. Checkerboard partitioning
above. Maintaining the P balance constraints () corresponds to maintaining computational load balance on the processors of each row of the processor mesh. Establishing the equivalence between the total communication volume and the sum of the cutsizes of the
two partitions is fairly straightforward. The volume of communication for the fold operations corresponds exactly to the cutsize(ΠC ). The volume of communication for the expand operations corresponds exactly to the cutsize(ΠR ).
Hypergraph Partitioning
C1
C2
c3 c11 r16
r11
r4 c14
c6
r14 c1 c10
r6
c13
r13
c5
a
r10
r1
r5
c8 c4
r7
r2
r12
r8 r9
c12 c9 c7
c15 r15
c2
W1(1) = 12
W2(1) = 12
W1(2) = 12
W2(2) = 11
b
3 6 11 14 1 5 10 13 4 8 12 16 2 7 9 15 nnz = 47 vol = 8 imbal = [–6.4%, 2.1%]
Hypergraph Partitioning. Fig. Second step of four-way checkerboard partitioning: (a) two-way multi-constraint partitioning ΠC of row-net hypergraph representation HC of A, (b) Final checkerboard partitioning of A induced by (ΠR , ΠC ); the nonzeros in the same partition are shown with the same shape and color; the deviation of the minimum and maximum numbers of nonzeros of a part from the average are displayed as an interval imbal; nnz and vol denote, respectively, the number of nonzeros and the total communication volume
Figure b displays the × checkerboard partition induced by (ΠR , ΠC ). Here, ΠR is a rowwise two-way partition giving the same figure as shown in Fig. , and ΠC is a two-way multi-constraint partition ΠC of the row-net hypergraph model HC of A shown in Fig. a. In Fig. a, w[, ]= and w[, ]= for internal column c of row stripe R , whereas w[, ] = and w[, ] = for external column c . Another common method of matrix partitioning is the orthogonal recursive bisection (ORB) []. In this approach, the matrix is first partitioned rowwise into two submatrices using the column-net hypergraph model, and then each part is further partitioned columnwise into two parts using the row-net hypergraph model. The process is continued recursively until the desired number of parts is obtained. The algorithm is shown in Fig. . In this algorithm, dim represents either rowwise or columnwise partitioning, where −dim switches the partitioning dimension. In the ORB method shown above, the step bisect (A, dim, ε) corresponds to partitioning the given matrix either along the rows or columns with, respectively, the column-net or the row-net hypergraph models into two. The total sum of the cutsizes () of each each bisection
3 6 11 14 1 5 10 13 4 8 12 16 2 7 9 15
c16
r3
H
step corresponds to the total communication volume. It is possible to dynamically adjust the ε at each recursive call by allowing larger imbalance ratio for the recursive call on the submatrix A or A .
Some Other Applications of Hypergraph Partitioning As said before, the initial motivations for hypergraph models were accurate modeling of the nonzero structure of unsymmetric and rectangular sparse matrices to minimize the communication volume for iterative solvers. There are other applications that can make use of hypergraph partitioning formulation. Here, a brief overview of general classes of applications is given along with the names of some specific problems. Further application classes are given in bibliographic notes. Parallel reduction or aggregation operations form a significant class of such applications, including the MapReduce model. The reduction operation consists of computing M output elements from N input elements. An output element may depend on multiple input elements, and an input element may contribute to multiple output elements. Assume that the operation on which
H
H
Hypergraph Partitioning
ORB-PARTITIONING(A, dim, K min, K max, e) Input: a matrix A, the part numbers K min (at initial call, it is equal to 1) and K max (at initial call it is equal to K, the desired number of parts), and the imbalance ratio e. Output: map(aij) for all aij ≠ 0. 1: if K max − K min > 0 then 2: mid ← (K max − K min + 1)/2 3:
P = A1, A2←bisect(A, dim, e)
4:
totalVolume←totalVolume+cutsize(P)
5:
map1(A1)←ORB-PARTITIONING(A1,−dim, Kmin, Kmin + mid−1, e)
6:
map2(A2)←ORB-PARTITIONING(A2−dim, Kmin + mid, Kmax, e)
Partition A along dim into two, producing two submatrices
Recursively partition each submatrix along the orthogonal direction
7: map(A)←map1(A1) ∪ map2(A2) 8: else 9: map(A)←Kmin
Hypergraph Partitioning. Fig. Orthogonal recursive bisection (ORB)
reduction is performed is commutative and associative. Then, the inherent computational structure can be represented with an M × N dependency matrix, where each row and column of the matrix represents an output element and an input element, respectively. For an input element xj and an output element yi , if yi depends on xj , aij is set to (otherwise zero). Using this representation, the problem of partitioning the workload for the reduction operation is equivalent to the problem of partitioning the dependency matrix for efficient SpMxV. In some other reduction problems, the input and output elements may be preassigned to parts. The proposed hypergraph model can be accommodated to those problems by adding K part vertices and connecting those vertices to the nets which correspond to the preassigned input and output elements. Obviously, those part vertices must be fixed to the corresponding parts during the partitioning. Since the required property is already included in the existing hypergraph partitioners [, , ], this does not add extra complexity to the partitioning methods. Iterative methods for solving linear systems usually employ preconditioning techniques. Roughly speaking, preconditioning techniques modify the given linear system to accelerate convergence. Applications of explicit preconditioners in the form of approximate inverses or factored approximate inverses are amenable to parallelization. Because, these techniques require SpMxV operations with the approximate inverse or factors of the approximate inverse at each step. In
other words, preconditioned iterative methods perform SpMxV operations with both coefficient and preconditioner matrices in a step. Therefore, parallelizing a full step of these methods requires the coefficient and preconditioner matrices to be well partitioned, for example, processors’ loads are balanced and communication costs are low in both multiply operations. To meet this requirement, the coefficient and preconditioner matrices should be partitioned simultaneously. One can accomplish such a simultaneous partitioning by building a single hypergraph and then partitioning that hypergraph. Roughly speaking, one follows a fourstep approach: (i) build a hypergraph for each matrix, (ii) determine which vertices of the two hypergraphs need to be in the same part (according to the computations forming the iterative method), (iii) amalgamate those vertices coming from different hypergraphs, (iv) if the computations represented by the two hypergraphs of the first step are separated by synchronization points then assign multiple weights to vertices (the weights of the vertices of the hypergraphs of the first step are kept), otherwise assign a single weight to vertices (the weights of the vertices of the hypergraphs of the first step are summed up for each amalgamation). The computational structure of the preconditioned iterative methods is similar to that of a more general class of scientific computations including multiphase, multiphysics, and multi-mesh simulations. In multiphase simulations, there are a number of computational phases separated by global synchronization points. The existence of the global synchronizations
H
Hypergraph Partitioning
necessitates each phase to be load balanced individually. Multi-constraint formulation of hypergraph partitioning can be used to achieve this goal. In multi-physics simulations, a variety of materials and processes are analyzed using different physics procedures. In these types of simulations, computational as well as the memory requirements are not uniform across the mesh. For scalability issues, processor loads should be balanced in terms of these two components. The multi-constraint partitioning framework also addresses these problems. In multi-mesh simulations, a number of grids with different discretization schemes and with arbitrary overlaps are used. The existence of overlapping grid points necessitates a simultaneous partitioning of the grids. Such a simultaneous partitioning scheme should balance the computational loads of the processors and minimize the communication cost due to interactions within a grid as well as the interactions among different grids. With a particular transformation (the vertex amalgamation operation, also mentioned above), hypergraphs can be used to model the interactions between different grids. With the use of multiconstraint formulation, the partitioning problem in the multi-mesh computations can also be formulated as a hypergraph partitioning problem. In obtaining partitions for two or more computation phases interleaved with synchronization points, the hypergraph models lead to the minimization of the overall sum of the total volume of communication in all phases (assuming that a single hypergraph is built as suggested in the previous paragraphs). In some sophisticated simulations, the magnitude of the interactions in one phase may be different than that of the interactions in another one. In such settings, minimizing the total volume of communication in each phase separately may be advantageous. This problem can be formulated as a multi-objective hypergraph partitioning problem on the so-built hypergraphs. There are certain limitations in applying hypergraph partitioning to the multiphase, multiphysics, and multi-mesh-like computations. The dependencies must remain the same throughout the computations, otherwise the cutsize may not represent the communication volume requirements as precisely as before. The weights assigned to the vertices, for load balancing issues, should be static and available prior to the
partitioning; the hypergraph models cannot be used as naturally for applications whose computational requirements vary drastically in time. If, however, the computational requirements change gradually in time, then the models can be used to re-partition the load at certain time intervals (while also minimizing the redistribution or migration costs associated with the new partition). Ordering methods are quite common techniques to permute matrices in special forms in order to reduce the memory and running time requirements, as well as to achieve increased parallelism in direct methods (such as LU and Cholesky decompositions) used for solving systems of linear equations. Nested-dissection is a well-known ordering method that has been used quite efficiently and successfully. In the current stateof-the-art variations of the nested-dissection approach, a matrix is symmetrically permuted with a permutation matrix P into doubly bordered block diagonal form
ADB
⎡ ⎢ A AS ⎢ ⎢ ⎢ ⎢ A AS ⎢ ⎢ T⎢ = PAP ⎢ ⋱ ⋮ ⎢ ⎢ ⎢ ⎢ AKK AKS ⎢ ⎢ ⎢ ⎢ AS AS ⋯ ASK ASS ⎣
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
where the nonzeros are only in the marked blocks (the blocks on the diagonal and the row and column borders). The aim in such a permutation is to have reduced numbers of rows/columns in the borders and to have equal-sized square blocks in the diagonal. One way to achieve such a permutation when A has symmetric pattern is as follows. Suppose a matrix B is given (if not, it is possible to find one) where the sparsity pattern of BT B equals to that of A (here arithmetic cancellations are ignored). Then, one can permute B nonsymmetrically into the singly bordered form
BSB
⎡ ⎢ B BS ⎢ ⎢ ⎢ ⎢ B BS ⎢ = QBPT ⎢ ⎢ ⎢ ⋱ ⋮ ⎢ ⎢ ⎢ ⎢ BKK BKS ⎣
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ , ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
H
H
Hypergraph Partitioning
so that BTSB BSB = PAPT ; that is, one can use the column permutation of B resulting in BSB to obtain a symmetric permutation for A which results in ADB . Clearly, the column dimension of Bkk will be the size of the square matrix Akk and the number of rows and columns in the border will be equal to the number of columns in the column border of BSB . One can achieve such a permutation of B by partitioning the column-net model of B while reducing the cutsize according to the cut-net metric (), with unit net costs, to obtain the permutation P as follows. First, the permutation Q is defined to be able to define P. Permute all rows corresponding to the vertices in part k before those in a part ℓ, for ≤ k < ℓ ≤ K. Then, permute all columns corresponding to the nets that are internal to a part k before those that are internal to a part ℓ, for ≤ k < ℓ ≤ K, yielding the diagonal blocks, and then permute all columns corresponding to the cut nets to the end, yielding the column border (the order of column defining a diagonal block). Clearly the correspondence between the size of the column border of BSB and the doubly border of ADB is exact, and hence the cutsize according to the cutnet metric is an exact measure. The requirement to have almost equal sized square blocks Akk decoded as the requirement that each part should have an almost equal number of internal nets in the partition of the columnnet model of B. Although such a requirement is neither the objective nor the constraint of the hypergraph partitioning problem, the common hypergraph-partitioning heuristics easily accommodate such requirements.
Related Entries Data Distribution Graph Algorithms Graph Partitioning Linear Algebra, Numerical PaToH (Partitioning Tool for Hypergraphs) Preconditioners for Sparse Iterative Methods Sparse Direct Methods
Bibliographic Notes and Further Reading The first use of the hypergraph partitioning methods for efficient parallel sparse matrix-vector multiply
operations is seen in []. A more comprehensive study [] describes the use of the row-net and column-net hypergraph models in D sparse matrix partitioning. For different views and alternatives on vector partitioning see [, , ]. A fair treatment of parallel sparse matrix-vector multiplication, analysis, and investigations on certain matrix types along with the use of hypergraph partitioning is given in [, Chapter ]. Further analysis of hypergraph partitioning on some model problems is given in []. Hypergraph partitioning schemes for preconditioned iterative methods are given in [], where vertex amalgamation and multi-constraint weighting to represent different phases of computations are given. Discussions on application of such methodology for multiphase, multiphysics, and multi-mesh simulations are also discussed in the same paper. Some different methods for sparse matrix partitioning using hypergraphs can be found in [], including jagged-like and checkerboard partitioning methods, and in [], the orthogonal recursive bisection approach. A recipe to choose a partitioning method for a given matrix is given in []. The use of hypergraph models for permuting matrices into special forms such as singly bordered blockdiagonal form can be found in []. This permutation can be leveraged to develop hypergraph partitioningbased symmetric [, ] and nonsymmetric [] nesteddissection orderings. The standard hypergraph partitioning and the hypergraph partitioning with fixed vertices formulation, respectively, is used for static and dynamic load balancing for some scientific applications in [, ]. Some other applications of hypergraph partitioning are briefly summarized in []. These include image-space parallel direct volume rendering, parallel mixed integer linear programming, data declustering for multi-disk databases, scheduling file-sharing tasks in heterogeneous master-slave computing environments, and work-stealing scheduling, road network clustering methods for efficient query processing, pattern-based data clustering, reducing software development and maintenance costs, processing spatial join operations, and improving locality in memory or cache performance.
Hyperplane Partitioning
Bibliography . Ababei C, Selvakkumaran N, Bazargan K, Karypis G () Multiobjective circuit partitioning for cutsize and path-based delay minimization. In: Proceedings of ICCAD , San Jose, CA, November . Aykanat C, Cambazoglu BB, Uçar B () Multi-level direct k-way hypergraph partitioning with multiple constraints and fixed vertices. J Parallel Distr Comput (): – . Aykanat C, Pınar A, Çatalyürek UV () Permuting sparse rectangular matrices into block-diagonal form. SIAM J Sci Comput ():– . Bisseling RH () Parallel scientific computation: a structured approach using BSP and MPI. Oxford University Press, Oxford, UK . Bisseling RH, Meesen W () Communication balancing in parallel sparse matrix-vector multiplication. Electron Trans Numer Anal :– . Boman E, Devine K, Heaphy R, Hendrickson B, Leung V, Riesen LA, Vaughan C, Catalyurek U, Bozdag D, Mitchell W, Teresco J () Zoltan .: parallel partitioning, load balancing, and datamanagement services; user’s guide. Sandia National Laboratories, Albuquerque, NM, . Technical Report SAND-W http://www.cs.sandia.gov/Zoltan/ug_html/ug.html . Cambazoglu BB, Aykanat C () Hypergraph-partitioningbased remapping models for image-space-parallel direct volume rendering of unstructured grids. IEEE Trans Parallel Distr Syst ():– . Catalyurek U, Boman E, Devine K, Bozdag D, Heaphy R, Riesen L () A repartitioning hypergraph model for dynamic load balancing. J Parallel Distr Comput ():– . Çatalyürek UV () Hypergraph models for sparse matrix partitioning and reordering. Ph.D. thesis, Computer Engineering and Information Science, Bilkent University. Available at http://www. cs.bilkent.edu.tr/tech-reports//ABSTRACTS..html . Çatalyürek UV, Aykanat C () A hypergraph model for mapping repeated sparse matrixvector product computations onto multicomputers. In: Proceedings of International Conference on High Performance Computing (HiPC’), Goa, India, December . Çatalyürek UV, Aykanat C () Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans Parallel Distr Syst ():– . Çatalyürek UV, Aykanat C () PaToH: a multilevel hypergraph partitioning tool, version .. Department of Computer Engineering, Bilkent University, Ankara, Turkey. PaToH is available at http://bmi.osu.edu/~umit/software.htm . Çatalyürek UV, Aykanat C () A fine-grain hypergraph model for D decomposition of sparse matrices. In: Proceedings of th International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, April . Çatalyürek UV, Aykanat C () A hypergraph-partitioning approach for coarse-grain decomposition. In: ACM/IEEE SC, Denver, CO, November
H
. Çatalyürek UV, Aykanat C, Kayaaslan E () Hypergraph partitioning-based fill-reducing ordering. Technical Report OSUBMI-TR--n and BU-CE-, Department of Biomedical Informatics, The Ohio State University and Computer Engineering Department, Bilkent University (Submitted) . Çatalyürek UV, Aykanat C, Uçar B () On two-dimensional sparse matrix partitioning: models, methods, and a recipe. SIAM J Sci Comput ():– . Grigori L, Boman E, Donfack S, Davis T () Hypergraph unsymmetric nested dissection ordering for sparse LU factorization. Technical Report -J, Sandia National Labs, Submitted to SIAM J Sci Comp . Karypis G, Kumar V () Multilevel algorithms for multiconstraint hypergraph partitioning. Technical Report -, Department of Computer Science, University of Minnesota/Army HPC Research Center, Minneapolis, MN . Karypis G, Kumar V, Aggarwal R, Shekhar S () hMeTiS a hypergraph partitioning package, version ... Department of Computer Science, University of Minnesota/Army HPC Research Center, Minneapolis . Lengauer T () Combinatorial algorithms for integrated circuit layout. Wiley–Teubner, Chichester . Selvakkumaran N, Karypis G () Multi-objective hypergraph partitioning algorithms for cut and maximum subdomain degree minimization. In: Proceedings of ICCAD , San Jose, CA, November . Uçar B, Aykanat C () Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies. SIAM J Sci Comput ():– . Uçar B, Aykanat C () Partitioning sparse matrices for parallel preconditioned iterative methods. SIAM J Sci Comput ():– . Uçar B, Aykanat C () Revisiting hypergraph models for sparse matrix partitioning. SIAM Review ():– . Uçar B, Çatalyürek UV () On the scalability of hypergraph models for sparse matrix partitioning. In: Danelutto M, Bourgeois J, Gross T (eds), Proceedings of the th Euromicro Conference on Parallel, Distributed, and Network-based Processing, IEEE Computer Society, Conference Publishing Services, pp – . Uçar B, Çatalyürek UV, Aykanat C () A matrix partitioning interface to PaToH in MATLAB. Parallel Computing (–):– . Vastenhouw B, Bisseling RH () A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review ():–
Hyperplane Partitioning Tiling
H
H
HyperTransport
HyperTransport Federico Silla Universidad Politécnica de Valencia, Valencia, Spain
referred to as coherent HyperTransport (cHT), is proprietary to AMD, and therefore it is not described in this document. Nevertheless, the main difference between both protocols is that the coherent one includes some additional types of packets.
Synonyms HT; HT.
Definition HyperTransport is a scalable packet-based, highbandwidth, and low-latency point-to-point interconnect technology intended to interconnect processors and also link them to I/O peripheral devices. HyperTransport was initially devised as an efficient replacement for traditional system buses for on-board communications. Nevertheless, the last extension to the standard, referred to as High Node Count HyperTransport, as well as the recent standardization of new cables and connectors, allow HyperTransport to efficiently extend its interconnection capabilities beyond a single motherboard and become a very efficient technology to interconnect processors and I/O devices in a cluster. HyperTransport is an open standard managed by the HyperTransport Consortium.
Discussion A complete description of the HyperTransport technology should include both an introduction to the protocol used by communicating devices to exchange data and also a description of the electrical interface that this technology makes use of in order to achieve its tremendous performance. However, describing the electrical interface seems to be less interesting than the protocol itself and, additionally, requires the reader to have a large electrical background. Therefore, the following description will be focused on the protocol used by the HyperTransport technology, leaving aside the electrical details. On the other hand, the protocol used by HyperTransport is a quite complex piece of technology, thus requiring an extensive explanation, which may be out of the scope of this encyclopedia. For this reason, the following description just tries to be a brief introduction to HyperTransport. Finally, the reader should note that AMD uses an extended version of the HyperTransport protocol in order to provide cache coherency among the processors in a system. Such extended protocol, usually
HyperTransport Links The HyperTransport technology is a point-to-point communication standard, meaning that each of the HyperTransport links in the system connects exactly two devices. Figure shows a simplified diagram of a system that deploys HyperTransport in order to interconnect the devices it is composed of. As can be seen in that figure, the main processor is connected to a PCIe device by means of a HyperTransport link. That PCIe device is, at the same time, connected to a GigaByte Ethernet device, which is, additionally, connected to a SATA device. This device connects to a USB device. All of these devices make up a HyperTransport daisy chain. Nevertheless, devices can implement multiple HyperTransport links in order to build larger HyperTransport fabrics. Each of the links depicted in Fig. consists of two unidirectional and independent sets of signals. Each of these sets includes its own CAD signal, as well as CTL and CLK signals.
Memory
CPU
PCle HyperTransport link GB Eth
SATA
USB
HyperTransport. Fig. System deploying HyperTransport
HyperTransport
The CAD signal (named after Command, Address, and Data) carries control and data packets. The CTL signal (named after ConTroL) is intended to differentiate control from data packets in the CAD signal. Finally, the CLK signal (named after CLocK) carries the clock for the CAD and CTL signals. Figure shows a diagram presenting all these signals. The width of the CAD signal is variable from to bits. Actually, not all values are possible. Only widths of -, -, -, -, or -bits are allowed. Nevertheless, note that the HyperTransport protocol remains the same independently of the exact link width. More precisely, the format of the packets exchanged among HyperTransport devices does not depend on link width. However, a given packet will require more time to be transmitted in a -bit link than in a -bit one. The reason for having several widths for the CAD signal is allowing system developers to tune their system for a given performance/cost design point. Narrower links will be cheaper to implement, but they will also provide lower performance. On the other hand, the width of the CAD signal in each of the unidirectional portions of the link may be different, that is, HyperTransport allows asymmetrical link widths. Therefore, a given device may implement a -bit wide link in one direction while deploying a -bit link in the other, for example. The usefulness of such asymmetry is based on the fact that, if such a device sends most of its data in one direction and receives a limited amount of data in the other direction, then the system designer can reduce manufacturing cost by providing a wide link in the direction that requires higher
bandwidth and a narrow link for the opposite direction, which requires much less bandwidth. In addition to the variable link width, HyperTransport also supports variable clock speeds, thus increasing even more the possibilities the system designer has for tuning the bandwidth of the links. The clock speeds currently supported by the HyperTransport specification are MHz, MHz, MHz, MHz, MHz, MHz, GHz, . GHz, . GHz, GHz, . GHz, . GHz, . GHz, GHz, and . GHz. Moreover, clock frequency in both directions of a link does not need to be the same, thus introducing an additional degree of asymmetry. On the other hand, the clock mechanism used in HyperTransport is referred to as double data rate (DDR), which means that rising and falling edges of the clock signal are used to latch data, thus achieving an effective clock frequency that doubles the actual clock rate. In summary, when variable link width is combined with variable clock frequency, HyperTransport links present an extraordinary scalability in terms of performance. For example, when both directions of a link are -bit wide working at MHz, link bandwidth is MB/s. This would be the lowest performance/ lowest cost configuration. On the opposite, when -bit wide links are used at . GHz, overall link performance rises up to . GB/s. This implementation represents the highest bandwidth configuration, also presenting the highest manufacturing cost. This link configuration could be interesting for extreme performance systems, for example, which usually require as much bandwidth as possible. On the other hand, slow devices not
CAD CTL CLK HyperTransport device 1
HyperTransport device 2 CAD CTL CLK
HyperTransport. Fig. Signals in a HyperTransport link
H
H
H
HyperTransport
requiring high bandwidth could reduce cost by using narrow links. If these links are additionally clocked at low frequencies, then power consumption is further reduced. In the same way that the CAD signal can be implemented with a variable width, the CTL and CLK signals can also be implemented with several widths. However, their width does not depend on the implementer’s criterion to choose a performance/cost design point, but on the width of the CAD signal. Additionally, as the CAD signal can present a different width in each direction of the link, then the width of the CTL and CLK signals can also be different in each direction, depending on the corresponding CAD signal in that direction. In the case of the CTL signal, there is an individual CTL bit for each set of , or fewer, CAD bits. Therefore, , , or CTL bits can be found in a HyperTransport link. Moreover, the CTL bits are encoded in such a way that four CTL bits are transferred every CAD bits. These four bits are intended to denote different flavors of the information being transmitted in the CAD signal. These different flavors may include, for example, that a command is being transmitted, that a CRC for a command without data is in the CAD signal, that the CAD signal is being used by a data packet, etc. In the case for the CLK signal, a HyperTransport link has an individual CLK bit for every set of , or fewer, CAD bits. Thus, the number of CTL and CLK bits in a given link is the same. The reason for having a CLK bit for every bits of the CAD signal is because link implementation is made easier. Effectively, the HyperTransport clocking scheme requires that the skew between clock and data signals must be minimized in order to achieve high transmission rates. Therefore, having a CLK bit for every CAD bits allows that differences in trace lengths in the board layout are much lower than just having a CLK bit for the entire set of CAD bits. In addition to the signals mentioned above, all HyperTransport devices share one PWROK and one RESET# signals for initialization and reset purposes. Moreover, if devices require power management, they should additionally include LDTSTOP# and LDTREQ# signals.
HyperTransport Packets Once described the links that connect devices in a HyperTransport fabric, this section presents the packets
that are forwarded along those links. Packets in HyperTransport are multiples of -bytes long and carry the command, address, and data associated with each transaction among devices. Packets can be classified into control and data packets. Control packets are used to initiate and finalize transactions as well as to manage several HyperTransport features, and consist of or bytes. Control packets can be classified into information, request, and response packets. Information packets are used for several link management purposes, like link synchronization, error condition signaling, and updating flow control information. Information packets are always bytes long and can only be exchanged among adjacent devices directly interconnected by a link. On the other hand, request and response control packets are used to build HyperTransport transactions. Request packets, which are - or -bytes long, are used to initiate HyperTransport transactions. On the other hand, response packets, which are always -bytes long, are used in the response phase of transactions to reply to a previous request. Table shows the different request and response types of packets. As can be seen, there are two different types of sized writes: posted and nonposted. Although both types write data to the target device of the request packet, their semantics are different. Non-posted writes require a response packet to be sent back to the requesting device in order to confirm that the operation has completed. On the other hand, posted writes do not require such confirmation.
HyperTransport. Table Types of request and response control packets Packet type Request packet
Sized read Sized write (non-posted) Sized write (posted) Atomic read-modify-write Broadcast Flush Fence Address extension Source identifier extension
Response packet
Read response Target done
HyperTransport
All the packet types will be further described in next section, except the extension packets. These packets are -bytes long extensions that prepend some of the other packets, when required. Their purpose is to allow -bit addressing instead of the -bit one used by default, in the case of the Address Extension, or -bit source identifiers in bus-device-function format in the case of the Source Identifier Extension. With respect to data packets, they carry the actual data transferred among devices. Data packets only include data, with no associated control field. Therefore, data packets immediately follow the associated control packet. For example, a read response control packet, which includes no data, will be followed by the associated data packet carrying the read data. In the same way, a write request control packet will be followed by the data to be written. Data packet length depends on the command that generated that data packet. Nevertheless, the maximum length is bytes.
HyperTransport Transactions HyperTransport devices transfer data by means of transactions. For example, every time a device requests some data from another device, the former initiates a read transaction targeted to the latter. Write transactions happen in a similar way. For example, when a program writes some data to disk, a write transaction is initiated among the processor, which reads the data from main memory, and the SATA device depicted in Fig. . Other transactions include broadcasting a message or explicitly taking control of the order of pending transactions. Every transaction has a request phase. Request control packets are used in this phase. The exact request control packet to be used depends on the particular transaction taking place. On the other hand, many transactions require a response stage. Response control packets are used in this case, for example, to return read data from the target of the transaction, or to confirm its completion. There are six basic transaction types: ● ● ● ●
Sized read transaction Sized write transaction Atomic read-modify-write transaction Broadcast transaction
● ●
H
Flush transaction Fence transaction
Sized Read Transaction Sized read transactions are used by devices when they request data located in the address space of another device. For example, when a device wants to read data from main memory, it starts a read transaction targeted to the main processor. Also, when the processor requires some data from the USB device in Fig. , it will issue a read transaction destined to that device. Read transactions begin with a sized read request control packet being issued by the device requesting the data. Once this packet reaches the destination device, it accesses the requested data and generates a response. This response will be composed of a read response control packet followed by the read data, included in a data packet. Once these two packets arrive at the device that initiated the process, the transaction is completed. It is noteworthy mentioning that during the time elapsed since the requestor delivered the read request on the link until it receives the corresponding response, the HyperTransport chain is not idle. On the opposite, as HyperTransport is a split transaction protocol, other transactions can be issued (even finalized) before our requestor receives the required data. Sized Write Transaction Sized write transactions are similar to sized read ones, with the difference that the requestor device writes data to the target device instead of requesting data from it. A write transaction may happen when a device sends data to memory, or when the processor writes back data from memory to disk, for example. There are two different sized write transactions: posted and non-posted. Posted write transactions start when the requestor sends to the target a posted write request control packet followed by a data packet containing the data to be written. In this case, because of the posted nature of the transaction, no response packet is sent back to the requestor. On the other hand, non-posted write transactions begin with a non-posted write control packet being issued by the requestor (followed by a data packet). In this case, when both packets reach the destination and the data are written, the target issues back a target-done response control packet to the requestor. When the requestor receives this response, the transaction is finished.
H
H
HyperTransport
Atomic Read-Modify-Write Transaction Atomic read-modify-write transactions are intended to atomically access a memory location and modify it. This means that no other device in the system may access the same memory location during the time required to read and modify it. This is useful to avoid race conditions among devices while performing the mutual exclusion required to access a critical section, for example. Two different types of atomic operations are allowed: ● ●
Fetch and Add Compare and Swap The Fetch and Add operation is:
Fetch_and_Add(Out, Addr, In) { Out = Mem[Addr]; Mem[Addr] = Mem[Addr] + In; } The Compare and Swap operation is: Compare_and_Swap(Out, Addr, Compare, In) { Out = Mem[Addr]; If (Mem[Addr] == Compare) Mem[Addr] = In; } The atomic transaction begins when the requestor issues an atomic read-modify-write request control packet on the link followed by a data packet containing the argument of the atomic operation. Once both packets are received at the target and it performs the requested atomic operation, it will send back a read response control packet followed by a data packet containing the original data read from the memory location.
Broadcast Transaction This transaction is used by the processor to communicate information to all HyperTransport devices in the system. This transaction is started with a broadcast request control packet, which can only be issued by the processor. All the other devices accept that packet and forward it to the rest of devices in the system. Broadcast requests include halt, shutdown, and End-Of-Interrupt commands.
Flush Transaction Posted writes do not generate any response once completed. Therefore, when a device issues one or more posted writes targeted to the main processor in order to write some data to main memory, the issuing device has no way to know that those writes have effectively completed their way to memory. Thus, if some of the posted writes have not been completely written to memory, that data will not be visible to other devices in the system. In this scenario, flush transactions allow the device that issued the posted writes to make sure that data has reached main memory by flushing all pending posted writes to memory. Flush transactions begin when a device issues a flush control packet targeted to the processor. Once this control packet reaches the destination, all pending transactions will be flushed to memory and then the processor will generate a target-done response control packet back to the requestor. Once this packet is received, the transaction is completed. Fence Transaction The fence transaction is intended to provide a barrier among posted writes. When a processor (the only possible target of a fence transaction) receives a fence command, it will make sure that no posted write received after the fence command is written to memory before any of the posted writes received earlier than the fence command. The fence transaction begins when a device issues a fence control packet. No response is generated for this transaction.
Virtual Channels and Flow Control in HyperTransport All the different types of control and data packets are multiplexed on the same link and stored in input buffers when they reach the receiving end of the link. Then, they are either accepted by that device in case they are targeted to it or forwarded to the next link in the chain. If packets are not carefully managed, a protocol deadlock may occur. For example, if many devices in the system issue a large number of non-posted requests, those requests may fill up all the available buffers and hinder responses to make forward progress back to the initial requestors. In this case, requestors would stop
HyperTransport
forever because they are waiting for responses that will never arrive because the interconnect is full of requests that avoid responses to advance. Additionally, requests stored in the intermediate buffers will not be able to advance toward their destination because output buffers at the targets will be filled with responses that cannot enter the link, thus hindering targets to accept new requests from the link because they have no free space neither to store them nor to store the responses they would generate. As can be seen, in this situation, no packet can advance because of lack of available free buffers. The result is that the system freezes. In order to avoid such deadlocks, HyperTransport splits traffic into virtual channels and stores different types of packets in buffers belonging to different virtual channels. Additionally, HyperTransport does not allow that packets traveling in one virtual channel move to another virtual channel. In this way, if non-posted requests use a virtual channel different from the one used by responses, the deadlock described above can be avoided. HyperTransport defines a base set of virtual channels that must be supported by all HyperTransport devices. Moreover, some additional virtual channel sets are also defined, although support for them is optional. Regarding the base set, it includes three different virtual channels: ●
The posted request virtual channel, which carries posted write transactions ● The non-posted request virtual channel, which includes reads, non-posted writes, and flush packets ● The response virtual channel, which is responsible for read responses and target-done control packets In addition to separate traffic into the three virtual channels mentioned above, each device must implement separate control and data buffers for each of the virtual channels. Therefore, there are six types of buffers: ● ● ● ● ● ●
Non-posted request control buffer Posted request control buffer Response control buffer Non-posted request data buffer Posted request data buffer Response data buffer
Figure shows the basic buffer configuration for a HyperTransport link. The exact number of packets that
H
can be stored in each buffer depends on the implementation. Nevertheless, request and response buffers must contain, at least, enough space to store the largest control packet of that type. Also, all data buffers can hold bytes. Moreover, in order to improve performance, a HyperTransport device may have larger buffers, able to store multiple packets of each type. The HyperTransport protocol states that a transmitter should not issue a packet that cannot be stored by the receiver. Thus, the transmitter must know how many buffers of each type the receiver has available. To achieve this, a credit-based scheme is used between transmitters and receivers. With such scheme, the transmitter has a counter for each type of buffer implemented at the receiver. When the transmitter sends a packet, it decrements the associated counter. When one of the counters reaches zero, the transmitter stops sending packets of that type. On the other hand, when the receiver frees a buffer, it sends back a NOP control packet (No Operation Packet) to the transmitter in order to update it about space availability.
Extending the HyperTransport Fabric The system depicted in Fig. consists of several HyperTransport devices interconnected in a daisy chain. However, more complex topologies can also be implemented by using HyperTransport bridges. Bridges in HyperTransport are devices having a primary link connecting toward the processor and one or more secondary links that allow extending the topology in the opposite direction. In this way, HyperTransport trees, like the one shown in Fig. , can be implemented. In addition to the use of bridges, HyperTransport defines two more features able to expand a HyperTransport system. These features are the AC mode and the HTX connector. The AC mode allows devices to be connected over longer distances than allowed by regular links, which make use of the DC mode. On the other hand, the HTX connector allows external expansion cards to be plugged to the HyperTransport link, and be presented to the rest of the system as any other HyperTransport device.
Improving the Scalability of HyperTransport As shown above, HyperTransport offers some degree of scalability that enables the implementation of efficient
H
H
HyperTransport
Posted request VC
Posted request VC
Control buffer
Control buffer Data buffer
Data buffer
Non posted request VC
Non posted request VC Control buffer
Control buffer
Data buffer
Data buffer
Response VC
Response VC Control buffer
Control buffer
Data buffer
Data buffer Transmitter
Receiver
Posted request VC
Posted request VC Control buffer
Control buffer
Data buffer
Data buffer
Non posted request VC
Non posted request VC Control buffer
Control buffer
Data buffer
Data buffer
Response VC
Response VC Control buffer
Control buffer
Data buffer
Data buffer Transmitter
Receiver Device 1
Device 2
HyperTransport. Fig. Buffer configuration for a HyperTransport link
topologies. Nevertheless, as HyperTransport was initially designed to replace traditional system buses, its benefits are mainly confined to interconnects within a single motherboard. Therefore, when HyperTransport topologies are to be scaled to larger sizes, in order to interconnect the processors and I/O subsystems in several motherboards (e.g., an entire cluster), HyperTransport is not able to do so because such large system sizes require routing capabilities that exceed current
HyperTransport ones. More specifically, HyperTransport is not able to provide device addressability beyond devices. Additionally, it does not support efficient routing in scalable network topologies. As a result, highperformance computing vendors have no choice but to complement HyperTransport with other interconnect technologies, link Infiniband in the case for general purpose clusters, or proprietary interconnects, like in the case for Cray’s XT and XT supercomputers [].
HyperTransport
Memory
H
Related Entries Busses and Crossbars Data Centers
CPU
Interconnection Networks PCI-Express HT device
HT bridge
HT device
HT device
HT device
HT device
HyperTransport. Fig. HyperTransport tree topology
In order to overcome the limitations of HyperTransport ., an extension to it named High Node Count HyperTransport Specification was recently released. This extension supports very large system sizes, like the ones found in large data centers, at the same time that it is fully compatible with the HyperTransport specification. Additionally, the new extension adds a few bytes to current HyperTransport packets, but only in the cases where strictly required, thus minimizing the protocol overhead. Briefly, the new extension provides an improved addressing scheme and a new control packet that allow HyperTransport devices to address any other device in large clusters. Additionally, new HyperTransport connectors and cables have been recently standardized in order to efficiently allow the deployment of the High Node Count HyperTransport Specification.
PGAS (Partitioned Global Address Space) Languages
Bibliographic Notes and Further Reading The complete description of HyperTransport . can be found in the HyperTransport I/O link specification . []. Additionally, readers are also encouraged to look up more information on HyperTransport in []. This book nicely describes the HyperTransport technology. On the other hand, a complete description of the High Node Count HyperTransport Specification can be found in []. Finally, many white papers and additional information are publicly available in the HyperTransport Consortium web site [].
Bibliography . Cray Inc. () Cray XT specifications. Available online at http://www.cray.com . Duato J, Silla F, Holden B, Miranda P, Underhill J, Cavalli M, Yalamanchili S, Brüning U () Extending HyperTransport protocol for improved Scalability. In: Proceedings of the first international workshop on hypertransport research and applications, Mannheim, Germany, pp – . Holden B, Trodden J, Anderson D () HyperTransport . interconnect technology: a comprehensive guide to the st, nd, and rd generations. MindShare Inc, Colorado Springs, CO . HyperTransport Technology Consortium web site. http://www. hypertransport.org. Accessed . HyperTransport Technology Consortium. HyperTransport I/O link specification revision .. Available online at http://www. hypertransport.org. Accessed
H
I IBM Blue Gene Supercomputer Alan Gara, José E. Moreira IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms Blue Gene/L; Blue Gene/P; Blue Gene/Q
Definition The IBM Blue Gene Supercomputer is a massively parallel system based on the PowerPC processor. A Blue Gene/L system at Lawrence Livermore National Laboratory held the number position in the TOP list of fastest computers in the world from November to November .
Discussion Introduction The IBM Blue Gene Supercomputer is a massively parallel system based on the PowerPC processor. It derives its computing power from scalability and energy efficiency. Each computing node of Blue Gene is optimized to achieve high computational rate per unit of power and to operate with other nodes in parallel. This approach results in a system that can scale to very large sizes and deliver substantial aggregate performance. Most large parallel systems in the – time frame followed a model of using off-the-shelf processors (typically from Intel, AMD, or IBM) and interconnecting them with either an industry standard network (e.g., Ethernet or Infiniband) or a proprietary network (e.g., as used by Cray or IBM). Blue Gene took a different approach by designing a dedicated system-on-achip (SoC) that included not only processors optimized for floating-point computing, but also the networking infrastructure to interconnect these building blocks into
a large system. This customized approach led to the scalability and power efficiency characteristics that differentiated Blue Gene from other machines that existed at the time of its commercial introduction in . As of , IBM has produced two commercial versions of Blue Gene, Blue Gene/L and Blue Gene/P, which were first delivered to customers in and , respectively. A third version, Blue Gene/Q, was under development. Both delivered versions follow the same design principles, the same system architecture and the same software architecture. They differ on the specifics of the basic SoC that serves as the building block for Blue Gene. The November TOP list includes four Blue Gene/L system and ten Blue Gene/P systems (and one prototype Blue Gene/Q system). In this article we cover mostly the common aspects of both versions of the Blue Gene supercomputing and discuss details specific to each version as appropriate.
System Architecture A Blue Gene system consists of a compute section, a file server section, and a host section (Fig. ). The compute and I/O nodes in the compute section form the computational core of Blue Gene. User jobs run in the compute nodes, while the I/O nodes connect the compute section to the file servers and front-end nodes through an Ethernet network. The file server section consists of a set of file servers. The host section consists of a service node and one or more front-end nodes. The service node controls the compute section through an Ethernet control network. The front-end nodes provide job compilation, job submisson, and job debugging services.
Compute Section The compute section of Blue Gene is what is usually called a Blue Gene machine. It consists of a threedimensional array of compute nodes interconnected in a toroidal topology along the x, y, and z axes. I/O nodes are distinct from compute nodes and not in the toroidal
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
I
IBM Blue Gene Supercomputer
Compute section Compute nodes
File server section
I/O nodes File servers
Ethernet (I/O)
Host section
Ethernet Control network
Service node
Front-end nodes
IBM Blue Gene Supercomputer. Fig. High-level view of a Blue Gene system
interconnect, but also belong to the compute section. A collective network interconnects all I/O and compute nodes of a Blue Gene machine. Each I/O node communicates outside of the machine through an Ethernet link. Compute and I/O nodes are built out of the same Blue Gene compute ASIC (application-specific integrated circuit) and memory (DRAM) chips. The difference is in the function they perform. Whereas compute nodes connect to each other for passing application data, I/O nodes form the interface between the compute nodes and the outside world by connecting to an Ethernet network. Reflecting the difference in function, the software stacks of the compute and I/O nodes are also different, as will be discussed below. The particular characteristics of the Blue Gene compute ASIC for the two versions (Blue Gene/L and Blue Gene/P) are illustrated in Fig. and summarized in Table . The Blue Gene/L compute ASIC contains two non-memory coherent PowerPC cores, each with private L data and instruction caches ( KiB each). Associated with each core is a small ( KiB) L cache that acts as a prefetch engine. Completing the onchip memory hierarchy is MiB of embedded DRAM (eDRAM) that is configured to operate as a shared L cache. Also on the ASIC is a memory controller
(for external DRAM) and interfaces to the five networks used to interconnect Blue Gene/L compute and I/O nodes: torus, collective, global barrier, Ethernet, and control (JTAG) network. The Blue Gene/P compute ASIC contains four memory coherent PowerPC cores, each with private L data and instruction caches ( KiB each). Associated with each core are both a small ( KiB) L cache that acts as a prefetch engine, as in Blue Gene/L, and a snoop filter to reduce coherence traffic into each core. Two banks of MiB EDRAM are configured to operate as a shared L cache. Completing the ASIC are a dual memory controller (for external DRAM), interfaces to the same five networks as Blue Gene/L, and a DMA engine to transfer data directly from the memory of one node to another. Both the PowerPC and PowerPC cores include a vector floating-point unit that extends the PowerPC instruction set architecture with instructions that assist in matrix and complex-arithmetic operations. The vector floating-point unit can perform two fused multiply-add operations per second for a total of four floating-point operations per cycle per core. The interconnection networks are primarily used for communication primitives used in parallel high-performance computing applications. The main
IBM Blue Gene Supercomputer
I
32k I1/32k D1
PLB (4:1)
256
128 PPC440
L2 Shared L3 Directory for eDRAM
256 Double FPU Shared SRAM
snoop 32k I1/32k D1
256 128
1024b data 144b ECC
w/ECC
PPC440
L2
4MB eDRAM L3 Cache or On-Chip Memory
256
Double FPU 128
Ethernet Gbit
JTAG Access
Gbit Ethernet
JTAG
Torus
6b out, 6b 3b out, 3b 4 global in 1.4 Gb/s in 2.8 Gb/s barriers or per link per link interrupts
a 32k I1/32k D1 PPC450
Snoop filter
I
144b DDR interface
snoop
128 Multiplexing switch
L2
Double FPU 32k I1/32k D1 PPC450
DDR Controller w/ECC
Global Barrier
Collective
Snoop filter 128
256
Shared L3 512b data 72b ECC Directory for eDRAM
256
w/ECC
4MB eDRAM L3 Cache or On-Chip Memory
L2
Double FPU
32
PPC450
Snoop filter 128 L2
Double FPU 32k I1/32k D1 PPC450 Double FPU
Shared SRAM
Multiplexing switch
32k I1/32k D1
Shared L3 512b data Directory 72b ECC for eDRAM w/ECC
Snoop filter
4MB eDRAM L3 Cache or On-Chip Memory
128 L2 Arb DMA Hybrid PMU w/SRAM 256x64b
JTAG Access
JTAG
b
Torus
6 3.4 Gb/s bidirectional
Collective
3 6.8 Gb/s bidirectional
Global Barrier
4 global barriers or interrupts
Ethernet 10 Gbit
10 Gb/s
DDR-2 Controller w/ ECC
DDR-2 Controller w/ ECC
13.6 Gb/s DDR-2 DRAM bus
IBM Blue Gene Supercomputer. Fig. Blue Gene compute ASIC for Blue Gene/L (a) and Blue Gene/P (b)
I
IBM Blue Gene Supercomputer
IBM Blue Gene Supercomputer. Table Summary of differences between Blue Gene/L and Blue Gene/P nodes Node characteristics
Blue Gene/L
Blue Gene/P
Processors
Two PowerPC
Four PowerPC
Processor frequency
MHz
MHz
Peak computing capacity
. Gflop/s ( flops/cycle/core)
. Gflop/s ( flops/cycle/core)
Main memory capacity
MiB or GiB
GiB or GiB
Main memory bandwidth
. GB/s ( byte @ MHz)
. GB/s ( × byte @ MHz)
Torus link bandwidth ( links)
MB/s per direction
MB/s per direction
Collective link bandwidth ( links)
MB/s per direction
MB/s per direction
interconnection network is the torus network, which provides point-to-point and multicast communication across compute nodes. A collective network in a tree topology interconnects all compute and I/O nodes and supports efficient collective operations, such as reductions and broadcasts. Arithmetic and logical operations are implemented as part of the communication primitives to facilitate low-latency collective operations. A global barrier network supports fast synchronization and notification across compute and I/O nodes. The Ethernet network is used to connect the I/O nodes outside the machine, as previously discussed. Finally, the control network is used to control the hardware from the service node. Figure illustrates the topology of the various networks. Compute and I/O nodes are grouped into units of midplanes. A midplane contains node cards and each node card contains compute nodes and up to four (Blue Gene/L) or two (Blue Gene/P) I/O nodes. The compute nodes in a midplane are arranged as an ×× three-dimensional mesh. Midplanes are typically configured with – (Blue Gene/P) or – (Blue Gene/L) I/O nodes. Each midplane also has link chips used for inter-midplane connection to build larger systems. The link chips also implement the “closing” of the toroidal interconnect, by connecting the two ends of a dimension. Midplanes are arranged two to a rack, and racks are arranged in a two-dimensional layout of rows and columns. Because the midplane is the basic replication unit, the dimensions of the array of compute nodes must be a multiple of . The configuration of the original Blue Gene/L system at Lawrence Livermore National Laboratory (LLNL) is shown in Fig. . That machine was later upgraded to racks and is the largest Blue Gene
system delivered to date. The highest performing Blue Gene system delivered to date is a -rack Blue Gene/P system at Juelich Research Center. Several single-rack (, compute nodes) Blue Gene systems, as well as systems of intermediate size, were also delivered to various customers. A given Blue Gene machine can be partitioned along midplane boundaries. A partition is formed by a rectangular arrangement of midplanes. Each partition can run only one job at any given time. During each job, all the compute nodes of a partition stay in the same execution mode for the duration of the job. These modes of execution are described in section Overall Operating System Architecture.
File Server Section The file server section of a Blue Gene system provides the storage for the file system that runs on the Blue Gene I/O nodes. Several parallel file systems have been ported to Blue Gene, including GPFS, PVFS, and Lustre. To feed data to a Blue Gene system, multiple servers are required to achieve the required bandwidth. The original Blue Gene/L system at LLNL, for example, uses servers operating in parallel. Data is striped across those servers, and a multi-level switching Ethernet network is used to connect the I/O nodes to the servers. The servers themselves are standard rack-mounted machines, typically Intel, AMD, or POWER processor based.
Host Section The host section for a Blue Gene/L system consists of one service node and one or more front-end nodes. These nodes are standard POWER processor machines. For the LLNL machine, the service node is a
IBM Blue Gene Supercomputer
a
I
b
I/O Ethernet
I
I/O
I/O: I/O nodes C: compute nodes C
FPGA
C
C
C
C
C
Control Ethernet
JTAG
C
c IBM Blue Gene Supercomputer. Fig. Blue Gene networks. Three-dimensional torus (a), collective network (b), and control network and Ethernet (c)
-processor POWER machine, and each of the front-end nodes is a PowerPC blade. The service node is responsible for controlling and monitoring the operation of the compute section. The services it implements include: machine partitioning, partition boot, application launch, standard I/O routing, application signaling and termination, event monitoring (for events generated by the compute and I/O nodes), and environmental monitoring (for things like power supply voltages, fan speeds, and temperatures).
The front-end nodes are where users work. They provide access to compilers, debuggers, and job submission services. Console I/O from user applications are routed to the submitting front-end node.
Blue Gene System Software The primary operating system for Blue Gene compute nodes is a lightweight operating system called the Compute Node Kernel (CNK). This simple kernel implements only a limited set of services, which are
I
IBM Blue Gene Supercomputer System 64 Racks, 64x32x32 Rack 32 node cards
360 TF/s 32 TB
Node card (32 chips 4x4x2) 16 compute, 0–2 IO cards 5.6 TF/s 512 GB Compute card 2 chips, 1x2x1
180 GF/s 16 GB
Chip 2 processors 11.2 GF/s 1.0 GB 5.6 GF/s 4 MB
IBM Blue Gene Supercomputer. Fig. Packaging hierarchy for the original Blue Gene/L at Lawrence Livermore National Laboratory
complemented by services provided by the I/O nodes. The I/O nodes, in turn, run a version of the Linux operating system. The I/O nodes act as gateways between the outside world and the compute nodes, complementing the services provided by the CNK with file and socket operations, debugging, and signaling. This split of functions between I/O and compute nodes, with the I/O nodes dedicated to system services and the compute nodes dedicated to application execution, resulted in a simplified design for both components. It also enables Blue Gene scalability and robustness and achieves a deterministic execution environment. Scientific middleware for Blue Gene includes a userlevel library implementation of the MPI standard, optimized to take advantage of Blue Gene networks, and various math libraries, also in the user level. Implementing all the message passing functions in user mode simplifies the supervisor (kernel) code of Blue Gene and results in better performance by reducing the number of kernel system calls that an application performs.
Overall Operating System Architecture A key concept in the Blue Gene operating system solution is the organization of compute and I/O nodes into logical entities called processing sets or psets. A pset consists of one I/O node and a collection of compute nodes. Every system partition, in turn, is organized as a collection of psets. All psets in a partition must have the same number of compute nodes, and the psets of a partition must cover all the I/O and compute nodes of the partition. The psets of a partition never overlap. The supported pset sizes are (Blue Gene/L only), , , , and compute nodes, plus the I/O node. The psets are a purely logical concept implemented by the Blue Gene system software stack. They are built to reflect the topological proximity between I/O and compute nodes, thus improving communication performance within a pset. A Blue Gene job consists of a collection of N compute processes. Each process has its own private address space and two processes of the same job communicate only through message passing. The
IBM Blue Gene Supercomputer
primary communication model for Blue Gene is MPI. The N compute processes of a Blue Gene job correspond to tasks with ranks to N − in the MPI COMM WORLD communicator. Compute processes run only on compute nodes; conversely, compute nodes run only compute processes. In Blue Gene/L, the CNK implements two modes of operation for the compute nodes: coprocessor mode and virtual node mode. In the coprocessor mode, the single process in the node has access to the entire node memory. One processor executes user code while the other performs communication functions. In virtual node mode, the node memory is split in half between the two processes running on the two processors. Each process performs both computation and communication functions. In Blue Gene/P, the CNK provides three modes of operation for the compute nodes: SMP mode, dual mode, and quad mode. The SMP mode supports a single multithreaded (up to four threads) application process per compute node. That process has access to the entire node memory. Dual mode supports two multithreaded (up to two threads each) application processes per compute node. Quad mode supports four single-threaded application processes per compute node. In dual and quad modes, the application processes in a node split the memory of that node. Blue Gene/P supports at most one application thread per processor. It also supports an optional communication thread in each processor. Each I/O node runs one image of the Linux operating system. It can offer the entire spectrum of services expected in a Linux box, such as multiprocessing, file systems, and a TCP/IP communication stack. These services are used to extend the capabilities of the compute node kernel, providing a richer functionality to the compute processes. Due to the lack of cache coherency between the processors of a Blue Gene/L node, only one of the processors of each I/O node is used by Linux, while the other processor remains idle. Since the processors in a Blue Gene/P node are cache coherent, Blue Gene/P I/O nodes can run a true multiprocessor Linux instance.
The Compute Node Kernel CNK is a lean operating system that performs a simple sequence of operations at job start time. This sequence
I
of operations happens in every compute node of a partition, at the start of each job: . It creates the address space(s) for execution of compute process(es) in a compute node. . It loads code and initialized data for the executable of that (those) process(es). . It transfers processor control to the loaded executable, changing from supervisor to user mode. The address spaces of the processes are flat and fixed, with no paging. The entire mapping is designed to fit statically in the TLBs of the PowerPC processors. The loading of code and data occurs in push mode. The I/O node of a pset reads the executable from the file system and forwards it to all compute nodes in the pset. The CNK in a compute node receives that executable and stores the appropriate memory values in the address space(s) of the compute process(es). Once the CNK transfers control to the user application, its primary mission is to “stay out of the way.” In normal execution, the processor control stays with the compute process until it requests an operating system service through a system call. Exceptions to this normal execution are caused by hardware interrupts: either timer alarms requested by the user code, communication interrupts caused by arriving packets, or an abnormal hardware event that requires attention by the compute node kernel. When a compute process makes a system call, three things may happen: . “Simple” system calls that require little operating system functionality, such as getting the time or setting an alarm, are handled locally by the compute node kernel. Control is transferred back to the compute process at completion of the call. . “I/O” system calls that require infrastructure for file systems and IP stack are shipped for execution in the I/O node associated with that compute node (i.e., the I/O node in the pset of the compute node). The compute node kernel waits for a reply from the I/O node, and then returns control back to the compute process. . “Unsupported” system calls that require infrastructure not present in Blue Gene are returned right away with an error condition.
I
I
IBM Blue Gene Supercomputer
There are two main benefits from the simple approach for a compute node operating system: robustness and scalability. Robustness comes from the fact that the compute node kernel performs few services, which greatly simplify its design, implementation, and test. Scalability comes from lack of interference with running compute processes.
System Software for the I/O Node The I/O node plays a dual role in Blue Gene. On one hand, it acts as an effective master of its corresponding pset. On the other hand, it services requests from compute nodes in that pset. Jobs are launched in a partition by contacting corresponding I/O nodes. Each I/O node is then responsible for loading and starting the execution of the processes in each of the compute nodes of its pset. Once the compute processes start running, the I/O nodes wait for requests from those processes. Those requests are mainly I/O operations to be performed against the file systems mounted in the I/O node. Blue Gene I/O nodes execute an embedded version of the Linux operating system. It is classified as an embedded version because it does not use any swap space, it has an in-memory root file system, it uses little memory, and it lacks the majority of daemons and services found in a server-grade configuration of Linux. It is, however, a complete port of the Linux kernel and those services can be, and in various cases have been, turned on for specific purposes. The Linux in Blue Gene I/O nodes includes a full TCP/IP stack, supporting communications to the outside world through Ethernet. It also includes file system support. Various network file systems have been ported to the Blue Gene/L I/O node, including GPFS, Lustre, NFS, and PVFS. Blue Gene/L I/O nodes never run application processes. That duty is reserved to the compute nodes. The main user-level process running on the Blue Gene/L I/O node is the control and I/O daemon (CIOD). CIOD is the process that links the compute processes of an application running on compute nodes to the outside world. To launch a user job in a partition, the service node contacts the CIOD of each I/O node of the partition and passes the parameters of the job (user ID, group ID, supplementary groups, executable name, starting working directory, command-line arguments, and environment variables). CIOD swaps itself to the user’s identity,
which includes the user ID, group ID, and supplementary groups. It then retrieves the executable from the file system and sends the code and initialized data through the collective network to each of the compute nodes in the pset. It also sends the command-line arguments and environment variables, together with a start signal. Figure illustrates how I/O system calls are handled in Blue Gene. When a compute process performs a system call requiring I/O (e.g., open, close, read, write), that call is trapped by the compute node kernel, which packages the parameters of the system call and sends that message to the CIOD in its corresponding I/O node. CIOD unpacks the message and then reissues the system call, this time under the Linux operating system of the I/O node. Once the system call completes, CIOD packages the result and sends it back to the originating compute node kernel, which, in turn, returns the result to the compute process. There is a synergistic effect between simplification and separation of responsibilities. By offloading complex system operations to the I/O node, Blue Gene keeps the compute node operating system simple. Correspondingly, by keeping application processes separate from the I/O node activity, it avoids many security and safety issues regarding execution in the I/O nodes. In particular, there is never a need for the common scrubbing daemons typically used in Linux clusters to clean up after misbehaving jobs. Just as keeping system services in the I/O nodes prevents interference with compute processes, keeping those processes in compute nodes prevents interference with system services in the I/O node. This isolation is particularly helpful during performance debugging work. The overall simplification of the operating system has enabled the scalability, reproducibility (performance results for Blue Gene applications are very close across runs), and highperformance of important Blue Gene/L applications.
System Software for the Service Node The Blue Gene service node runs its control software, typically referred to as the Blue Gene control system. The control system is responsible for the operation and monitoring of all compute and I/O nodes. It is also responsible for other hardware components such as link chips, power supplies, and fans. Tight integration between the Blue Gene control system and the I/O and compute nodes operating systems is central to the Blue
IBM Blue Gene Supercomputer
I
ciod
Application fscanf
libc
read
read
data
read Linux NFS
CNK Tree packets
Tree packets
IP File server BG/L ASIC
BG/L ASIC Tree
Ethernet
IBM Blue Gene Supercomputer. Fig. Function shipping between CNK and CIOD
Gene software stack. It represents one more step in the specialization of services that characterize that stack. In Blue Gene, the control system is responsible for setting up system partitions and loading the initial code and state in the nodes of a partition. The Blue Gene compute and I/O nodes are completely stateless: no hard drives and no persistent memory. When a partition is created, the control system programs the hardware to isolate that partition from others in the system. It computes the network routing for the torus, collective, and global interrupt networks, thus simplifying the compute node kernel. It loads the operating system code for all compute and I/O nodes of a partition through the dedicated control network. It also loads an initial state in each node (called the personality of the node). The personality of a node contains information specific to the node.
Blue Gene and Its Impact on Science Blue Gene was designed to deliver a level of performance for scientific computing that enables entire new studies and new applications. One of the main areas of applications of the original LLNL system is in materials’ science. Scientists at LLNL use a variety of models, including quantum molecular dynamics, classical molecular dynamics, and dislocation dynamics, to study materials at different levels of resolution.
Typically, each model is applicable to a range of system sizes being studied. First principle models can be used for small systems, while more phenomenological models have to be used for large systems. The original Blue Gene/L at LLNL was the first system that allowed crossing of those boundaries. Scientists can use first principle models in systems large enough to validate the phenomenological models, which in turn can be used in even larger systems. Applications of notable significance at LLNL include ddcMD, a classical molecular dynamics code that has been used to simulate systems with approximately half a billion atoms (and in the process win the Gordon Bell award), and QBox, a quantum molecular dynamics code that won the Gordon Bell award. Other success stories for Blue Gene in science include: () astrophysical simulations in Argonne National Laboratory (ANL) using the FLASH code; () global climate simulations in the National Center for Atmospheric Research (NCAR) using the HOMME code; () biomolecular simulations at the T.J. Watson Research Center using the Blue Matter code; and () quantum chromo dynamics (QCD) at IBM T.J. Watson Research Center, LLNL, San Diego Supercomputing Center, Juelich Research Center, Massachusetts Institute of Technology, Boston University, University of Edinburgh, and KEK (Japan) using a variety of codes.
I
I
IBM Power
One of the most innovative uses of Blue Gene is as the central processor for the large-scale LOFAR radio telescope in the Netherlands.
Related Entries IBM Power Architecture LINPACK Benchmark MPI (Message Passing Interface) TOP
Bibliographic Notes and Further Reading Additional information about the system architecture of Blue Gene can be found in [, , , , ]. For details on the Blue Gene system software, the reader is referred to [, ]. Finally, examples of the impact of Blue Gene on science can be found in [–, , –].
Bibliography . Almasi G, Chatterjee S, Gara A, Gunnels J, Gupta M, Henning A, Moreira JE, Walkup B () Unlocking the performance of the BlueGene/L supercomputer. In: Proceeding IEEE/ACM SC, Pittsburgh, Nov . Almasi G, Bhanot G, Gara A, Gupta M, Sexton J, Walkup B, Bulatov VV, Cook AW, de Supinski BR, Glosli JN, Greenough JA, Gygi F, Kubota A, Louis S, Streitz FH, Williams PL, Yates RK, Archer C, Moreira J, Rendleman C () Scaling physics and material science applications on a massively parallel Blue Gene/L system. In: Proceedings of the th ACM international conference on supercomputing. Cambridge, MA, –, – June . Bulatov V, Cai W, Fier J, Hiratani M, Hommes G, Pierce T, Tang M, Rhee M, Yates RK, Arsenlis T () Scalable line dynamics in ParaDiS. In: Proceeding IEEE/ACM SC, Pittsburgh, Nov . Fitch BG, Rayshubskiy A, Eleftheriou M, Ward TJC, Giampapa M, Pitman MC, Germain RS () Blue matter: approaching the limits of concurrency for classical molecular dynamics. In: Proceeding IEEE/ACM SC, Tampa, Nov . Fryxell B, Olson K, Ricker P, Timmes FX, Zingale M, Lamb DQ, MacNeice P, Rosner R, Truran JW, Tufo H () FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophys J Suppl : . Gara A, Blumrich MA, Chen D, Chiu GL-T, Coteus P, Giampapa ME, Haring RA, Heidelberger P, Hoenicke D, Kopcsay GV, Liebsch TA, Ohmacht M, Steinmacher-Burow BD, Takken T, Vranas P () Overview of the Blue Gene/L system architecture. IBM J Res Dev (/):– . Gygi F, Yates RK, Lorenz J, Draeger EW, Franchetti F, Ueberhuber CW, de Supinski BR, Kral S, Gunnels JA, Sexton JC () Large-scale first-principles molecular dynamics simulations on the BlueGene/L platform using the Qbox code, IEEE/ACM SC, Seattle, Nov
. IBM Blue Gene Team () Overview of the IBM Blue Gene/P project. IBM J Res Dev (/):– . Moreira JE, Almasi G, Archer C, Bellofatto R, Bergner P, Brunheroto JR, Brutman M, Castaños JG, Crumley PG, Gupta M, Inglett T, Lieber D, Limpert D, McCarthy P, Megerian M, Mendell M, Mundy M, Reed D, Sahoo RK, Sanomiya A, Shok R, Smith B, Stewart GG () Blue Gene/L programming and operating environment. IBM J Res Dev (/):– . Moreira JE et al () The Blue Gene/L supercomputer: a hardware and software story. Int J Parallel Program ():– . Salapura V, Bickford R, Blumrich M, Bright AA, Chen D, Coteus P, Gara A, Giampapa M, Gschwind M, Gupta M, Hall S, Haring RA, Heidelberger P, Hoenicke D, Kopcsay GV, Ohmacht M, Rand RA, Takken T, Vranas P () Power and performance optimization at the system level. In: Proceeding ACM Computing Frontiers , Ischia, May . van der Schaaf K () Blue Gene in the heart of a wide area sensor network. In: Proceeding QCDOC and Blue Gene: next generation of HPC architecture workshop, Edinburgh, Oct . Streitz FH, Glosli JN, Patel MV, Chan B, Yates RK, de Supinski BR, Sexton J, Gunnels JA () + TFlop solidification simulations on BlueGene/L. In: Proceeding IEEE/ACM SC, Seattle, Nov . Vranas P, Bhanot G, Blumrich M, Chen D, Gara A, Heidelberger P, Salapura V, Sexton JC () The BlueGene/L supercomputer and quantum chromodynamics. In: Proceeding IEEE/ACM SC, Tampa, Nov . Wait CD () IBM PowerPC FPU with complex-arithmetic extensions. IBM J Res Dev (/):–
IBM Power IBM Power Architecture
IBM Power Architecture Tejas S. Karkhanis, José E. Moreira IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms IBM power; IBM powerPC
Definition The IBM Power architecture is an instruction set architecture (ISA) implemented by a variety of processors from IBM and other vendors, including Power, IBM’s
IBM Power Architecture
latest server processor. The IBM Power architecture is designed to exploit parallelism at the instruction-, data-, and thread-level.
IBM’s Power ISATM is an instruction set architecture designed to expose and exploit parallelism in a wide range of applications, from embedded computing to high-end scientific computing to traditional transaction processing. Processors implementing the Power ISA have been used to create several notable parallel computing systems, including the IBM RS/ SP, the Blue Gene family of computers, the Deep Blue chess playing machine, the PERCS system, the Sony Playstation game console, and the Watson system that competed in the popular television show Jeopardy! Power ISA covers both -bit and -bit variants and, as of its latest version (. Revision B []), is organized in a set of four “books,” as shown in Fig. . Books I and II are common to all implementations. Book I, Power ISA User Instruction Set Architecture, covers the base instruction set and related facilities available to the application programmer. Book II, Power ISA Virtual Environment Architecture, defines the storage (memory) model and related instructions and facilities available to the application programmer. In addition to the specifications of Books I and II, implementations of the Power ISA need to follow either Book III-S or Book III-E. Book III-S, Power ISA Operating Environment Architecture – Server Environment, defines the supervisor instructions and related facilities used for general purpose implementations. Book III-E, Power ISA Operating Environment Architecture – Embedded Environment, defines the supervisor instructions and related facilities used for embedded implementations. Finally, Book VLE, Power ISA Operating Environment Architecture – Variable Length Encoding Environment, defines alternative instruction encodings and definitions intended to increase instruction density for very low end implementations. Figure shows the evolution of the main line of Power architecture server processors from IBM, just one of the many families of products based on Power ISA. The figure shows, for each generation of processors, its introduction date, the silicon technology used,
BOOK-III Operating Environment Architecture BOOK-III-S Server Environment
BOOK-III-E Embedded Environment
Discussion Introduction
I
BOOK-II Virtual Environment Architecture
BOOK-VLE Variable Length Encoding
BOOK-I User Instruction Set Architecture
IBM Power Architecture. Fig. Books of Power ISA version .
I and the main architectural innovations delivered in that generation. Power [] is IBM’s latest-generations server processor and implements the Power ISA according to Books I, II, and III-S of []. Power is used in both the PERCS and Watson systems and in a variety of servers offered by IBM. The latest machine in the Blue Gene family, Blue Gene/Q, follows a Book III-E implementation. Power ISA was designed to support high program execution performance and efficient utilization of hardware resources. To that end, Power ISA provides facilities for expressing instruction-level parallelism, data-level parallelism, and thread-level parallelism. Providing facilities for a variety of parallelism types gives the programmer the flexibility in extracting the particular combination of parallelism that is optimal for his or her program.
Instruction-Level Parallelism Instruction-level parallelism (ILP) is the simultaneous processing of several instructions by a processor. ILP is important for performance because it allows instructions to overlap, thus effectively hiding the execution latency of long latency computational and memory access instructions. Achieving ILP has been so important in the processor industry that processor core designs have gone from simple multicycle designs to complex designs that implement superscalar pipelines and out-of-order execution []. Key aspects of Power
I
IBM Power Architecture
POWER5 130 nm SMT
POWER6 65 nm ultra high frequency
POWER7 45 nm multi core
POWER4 180 nm dual core POWER3 .22 mm SMP, 64-bit
POWER2 .72 mm wide ILP POWER1 1.0 mm RISC
1990
1995
2000
2005
2010
IBM Power Architecture. Fig. Evolution of main line Power architecture server processors
ISA that facilitate ILP are independent instruction facilities, reduced set of instructions, fixed length instructions, and large register set.
Independent Instruction Facilities Conceptually, Power ISA views the underlying processor as composed of several engines or units, as illustrated in the floor plan for the Power processor core shown in Fig. . Book I of the Power ISA groups the instructions into “facilities,” including () the branch facility, with instructions implemented by the instruction fetch unit (IFU); () the fixed-point facility, with instructions implemented by the fixed-point unit (FXU) and load-store unit (LSU); () the floating-point facility, with instructions implemented by the vector and scalar unit (VSU); () the decimal floating-point facility, with instructions implemented by the decimal floating-point unit (DFU); and () the vector facility, with instructions implemented by the vector and scalar
unit (VSU). Also shown in Fig. is the instructionsequencing unit (ISU), which controls the execution of instructions, and a level- cache. Origins of this conceptual decomposition are in the era of building processors with multiple integrated circuits (chips). With a single processor spread across multiple chips, the communication between two integrated circuits took significantly more time relative to communication internal to a chip. Consequently, either the clock frequency would have to be reduced or the number of stages in the processor pipeline would have to be increased. Both approaches, reducing clock frequency and increasing the number of pipeline stages, can degrade performance. The decomposition into multiple units allowed a clear separation of work, and each unit could be implemented on a single chip for maximum performance. Today, the conceptual decomposition provides two primary benefits. First, because of the conceptual decomposition, the interfaces between the engines
IBM Power Architecture
DFU
ISU
VSU FXU
IFU LSU
256 Kbyte L2 cache
IBM Power Architecture. Fig. Power processor core floor plan, showing the main units
are clearly defined. Clearly defined interfaces lead to hardware design that is simpler to implement and to verify. Second, the conceptual decomposition addresses the inability of scaling frequency of long on-chip wires that can be a performance limiter, just as it addressed wiring issues between two or more integrated circuits when the conceptual decomposition was introduced.
Reduced Set of Nondestructive Fixed Length Instructions Power ISA consists of a reduced set of fixed length -bit instructions. A large fraction of the set of instructions are nondestructive. That is, the result register is explicitly identified, as opposed to implicitly being one of the source registers. A reduced set of instructions simplifies the design of the processor core and also verification of corner cases in the hardware. Ignoring the Book-VLE case, which is targeted to very low end systems, Power ISA instructions are all -bits in length, thus the beginning and end of every instruction is known before decode. The bits in an instruction word are numbered from (most significant) to (least significant), following the big-endian
I
convention of the Power architecture. All Power ISA instructions have a major opcode that is located at instruction bits –. Some instructions also have a minor opcode, to differentiate among instructions with the same major opcode. The location and length of the minor opcode depends on the major opcode. Additionally, every instruction is word-aligned. Fixed length, word aligned, and fixed opcode location make the instruction predecode, fetch, branch prediction, and decode logic simpler, when compared to the decode logic of variable length ISA. Srinivasan et al. [] present a comprehensive study of optimality of pipeline length of Power processors from a power and performance perspective. Instruction set architectures that employ destructive operations (i.e., one of the source registers is also the target) must temporarily save one of the source registers, if the contents of that register are required later in the program. Temporarily saving and later restoring registers often lead to store to and load from, respectively, a memory location. Memory operations can take longer to complete than a computational operation. Nondestructive operations in Power ISA eliminate the need for extra instructions for saving and restoring one of the source registers, facilitating higher instruction level parallelism.
Large Register Set Power ISA originally specified general-purpose (fixed-point, either - or -bit) and floating-point (-bit) registers. An additional set of vector (-bit) registers were added with the first set of vector instructions. The latest specification, Power ISA ., expands the number of vector registers to . A large number of registers means that more data, including function and subroutine parameters, can be kept in fast registers. This in turn avoids load/store operations to save and retrieve data to and from memory and supports concurrent execution of more instructions.
Load/Store Architecture Power ISA specifies a load-store architecture consisting of two distinct types of instructions: () memory access instructions, and () compute instructions. Memory
I
I
IBM Power Architecture
access instructions load data from memory into computational registers and store the data from the computational registers to the memory. Compute instructions perform computations on the data residing in the computational registers. This arrangement decouples the responsibilities of the memory instructions and computational instructions, providing a powerful lever to hide the memory access latency by overlapping the long latency of memory access instructions with compute instructions.
ILP in Power Power is an out-of-order superscalar processor that can operate at frequencies exceeding GHz. In a given clock cycle, a Power processor core can fetch up to eight instructions, decode and dispatch up to six instructions, issue and execute up to eight instructions, and commit up to six instructions. To ensure a high instruction throughput, Power can simultaneously maintain about instructions in various stages of processing. To further extract independent instruction for parallelism, Power implements register renaming – each of the architected register files are mapped to a much larger set of physical registers. Execution of the instructions is carried out by a total of execution units. Power implements the Power ISA in a way that extracts high levels of instruction-level parallelism while operating at a high clock frequency.
Data Level Parallelism Data-level parallelism (DLP) consists of simultaneously performing the same type of operations on different data values, using multiple functional units, with a single instruction. The most common approach of providing DLP in general purpose processors is the Single Instruction Multiple Data (SIMD) technique. SIMD (also called vector) instructions provide a concise and efficient way to express DLP. With SIMD instructions, fewer instructions are required to perform the same data computation resulting in lower fetch, decode and dispatch bandwidth, and consequently higher power efficiency. Power ISA . contains two sets of SIMD instructions. The first one is the original set of instructions implemented by the vector facility since and also known as AltiVec [] or Vector Media Extensions
(VMX) instructions. The second is a new set of SIMD instructions called Vector-Scalar Extension (VSX).
VMX Instructions VMX instructions operate on -bit wide data, which can be vectors of byte (-bit), half-word (-bit) and word (-bit) elements. The word elements can be either integer or single-precision floating-point numbers. The VMX instructions follow the load/store model, with a -entry register set (each entry is -bit wide) that is separate from the original (scalar) fixed- and floatingpoint registers in Power ISA .. VSX Instructions VSX also operates on -bit wide data, which can be vectors of word (-bit) and double word (bit) elements. Most operations are on floating-point numbers (single and double precision) but VSX also includes integer conversion and logical operations. VSX instructions also follow the load/store model, with a -entry register set ( bits per entry) that overlaps the VMX and floating-point registers. VSX requires no operating-mode switches. Therefore, it is possible to interleave VSX instructions with floating point and integer instructions. Power Vector and Scalar Unit (VSU) The vector and scalar unit of Power is responsible for execution of the VMX and VSX SIMD instructions. The unit contains one vector pipeline and four doubleprecision floating-point pipelines. A VSX floating-point instruction uses two floating-point pipelines, and two VSX instructions can be issued every cycle, to keep all floating-point pipelines busy. The four floating-point pipelines can each execute a double-precision fused multiply-add operation, leading to a performance of flops/cycle for a Power core.
Thread Level Parallelism Thread-level parallelism (TLP) is the simultaneous execution of multiple threads of instructions. Unlike ILP and DLP, that rely on extracting parallelism from within the same program thread, TLP relies on explicit parallelism from multiple concurrently running threads. The multiple threads can come from the decomposition of
IBM Power Architecture
a single program or from multiple independent programs.
Thread Level Parallelism Within a Processor Core In the first systems that exploited thread-level parallelism, different threads executed on different processor cores and shared a memory system. Today, processor core designs have evolved such that multiple threads can run on single processor core. This increases resource utilization and, consequently, the computational throughput of the core. Effectively, TLP within a core enables hiding of the long latency events of stalled threads with forward progress of active threads. Examples of Power ISA processors that support multithreading within a core include Power [], Power [], and Power [] processors.
TLP Support in Power Processor The structure of a Power processor chip is shown in Fig. . There are eight processor cores and three levels of cache in a single chip. Each processor core (which
8 cores (up to 32 threads) 32-Kbyte L1 D-cache
Fetch
Store
8 private L2s Fetch
256-Kbyte private L2 cache
Cast out 32-Mbyte shared L3 cache
8 local L3s 4-Mbyte local L3 cache
Memory
Memory Coherence Models For programs where the concurrent threads share memory while working on a common task, the memory consistency model of the architectures plays a key role in the performance of TLP as a function of the number of threads. The memory consistency model specifies how memory references from different threads can be interleaved. Power ISA specifies a release consistency memory model. A release consistency model relaxes the ordering of memory references as seen by different threads. When a particular ordering of memory references among threads is necessary for the program, explicit synchronization operations must be used.
Core (ST, SMT2, SMT4)
32-Kbyte L1 I-cache
I
Cast out
IBM Power Architecture. Fig. Structure of a Power Processor Chip
I
I
IBM Power Architecture
includes -Kbyte level data and instruction caches) is paired with a -Kbyte level (L) cache that is private to the core. There is also a -Mbyte level (L) cache that is shared by all cores. The level cache is organized as eight -Mbyte caches, each local to a core/L pair. Cast outs from an L cache can only go to its local L cache, but from there data can be cast out across the eight local Ls. Each core is capable of operating in three different threading modes: single-threaded (ST), dualthreaded (SMT), or quad-threaded (SMT). The cores can switch modes while executing, thus adapting to the needs of different applications. The ST mode delivers higher single-thread performance, since the resources of a core are dedicated to the execution of that single thread. The SMT mode partitions the core resources among four threads, resulting in higher total throughput at the cost of reduced performance for each thread. The SMT mode is an intermediate point. A single Power processor chip supports up to simultaneous threads of execution ( cores, threads per core). Power scales to systems of processor chips or up to threads of execution sharing a single memory image.
Summary Since its inception in the Power processor in , the Power architecture has evolved to address the technology and applications issues of the time. The ability of Power architecture to provide instruction-, data-, and thread-level parallelism has enabled a variety of parallel systems, including some notable supercomputers. Power ISA allows exposing and extraction of ILP primarily because of the RISC principles embodied in the ISA. The reduced set of fixed length instructions enables simple hardware implementation that can be efficiently pipelined, thus increasing concurrency. The larger register set provides several optimization opportunities for the compiler as well as the hardware. Power ISA provides facilities for data-level parallelism via SIMD instructions. VMX and VSX instructions increase the computational efficiency of the processor by performing the same operation on multiple data values. For some programs DLP can be extracted automatically by the compiler. For others, explicit SIMD programming is more appropriate.
Power ISA supports thread-level parallelism through a release consistency memory model. Because of the release consistency, Power ISA based systems permit aggressive software and hardware optimizations that would otherwise be restricted under a sequential consistency model. The Power processor implements the latest version of Power ISA and exploits all forms of parallelism supported by the instruction set architecture: instruction-level parallelism, data-level parallelism, and thread-level parallelism.
Related Entries IBM Blue Gene Supercomputer Cell Broadband Engine Processor IBM RS/ SP PERCS System Architecture
Bibliographic Notes Official information on the Power instruction set architecture is available in the Power.org website (www.power.org). In particular, the latest version of the Power ISA (. revision B), implemented by the Power processor, can be found in []. An early history of the Power architecture is provided by Diefendorff []. Details of Power architecture including the instruction specification and programming environment are given in several reference manuals [, –]. Evolution of IBM’s RISC philosophy is explained in []. More detailed information on the microarchitecture of specific Power processors can be found for Power [], Power [], Power [], and Power [] processors. The AltiVec Programming Environment Manual [] and AltiVec Programming Interface Manual [] are two thorough references for effectively employing AltiVec. Gwennap [] and Diefendorff [] have a good survey of Power AltiVec. Methods for extracting instruction-level parallelism for the Power architecture are described in []. One of the key impediments to data-level parallelism is unaligned memory accesses. To overcome these unaligned accesses Eichenberger, Wu and O’Brien [] present some data-level parallelism optimization techniques. Finally, Adve and Gharacharloo’s tutorial on
IBM RS/ SP
shared memory consistency model [] is a great reference for further reading on thread-level parallelism.
Bibliography . Adve SV, Gharachorloo K () Shared memory consistency models: a tutorial. IEEE Comput :– . Cocke J, Markstein V () The evolution of RISC technology at IBM. IBM J Res Dev :– . Diefendorff K () History of the PowerPC architecture. Commun ACM ():– . Diefendorff F, Dubey P, Hochsprung R, Scales H () Altivec extension to PowerPC accelerates media processing. IEEE Micro ():– . Eichenberger AE, Wu P, O’Brien K () Vectorization for SIMD architectures with alignment constraints. Sigplan Not ():– . Gwennap L () AltiVec vectorizes PowerPC. Microprocessor Report ():– . Hoxey S, Karim F, Hay B, Warren H (eds) () The PowerPC compiler writer’s guide. Warthman Associates, Palo Alto . IBM Power ISA Version . Revision B. http://www.power. org/resources/downloads/PowerISA_V.B_V_PUBLIC.pdf . Kalla R, Sinharoy B, Starke WJ, Floyd M () Power: IBM’s next-generation server processor. IEEE Micro ():– . Le HQ, Starke WJ, Fields JS, O’Connell FP, Nguyen DQ, Ronchetti BJ, Sauer WM, Schwarz EM, Vaden MT () IBM Power microarchitecture. IBM J Res Dev ():– . May C, Silha ED, Simpson R, Warren H (eds) () The PowerPC architecture: a specification for a new family of RISC processors. Morgan Kaufmann, San Francisco . () Freescale Semiconductor. In: AltiVec technology programming environments manual. . () Freescale Semiconductor.In: AltiVec technology programming interface manual. . Sinharoy B, Kalla RN, Tendler JM, Eickemeyer RJ, Joyner JB () Power system microarchitecture. IBM J Res Dev (/):– . Smith JE, Sohi GS () The microarchitecture of superscalar processors. P IEEE :– . Srinivasan V, Brooks D, Gschwind M, Bose P, Zyuban V, Strenski PN, Emma PG () Optimizing pipelines for power and performance. In: Proceedings of the th annual ACM/IEEE international symposium on microarchitecture, MICRO . IEEE Computer Society Press, Los Alamitos, pp – . Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B () Power system microarchitecture. IBM J Res Dev ():–
IBM PowerPC IBM Power Architecture
I
IBM RS/ SP José E. Moreira IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms IBM SP; IBM SP; IBM SP; IBM SP
Definition The IBM RS/ SP is a distributed memory, message passing parallel system based on the IBM POWER processors. Several generations of the system were developed by IBM and thousands of systems were delivered to customers. The pinnacle of the IBM RS/ SP was the ASCI White machine at Lawrence Livermore National Laboratory, which held the number spot in the TOP list from November to November .
Discussion Introduction The IBM RS/ SP (SP for short) is a general-purpose parallel system. It was one of the first parallel systems designed to address both technical computing applications (the usual domain of parallel supercomputers) and commercial applications (e.g., database servers, transaction processing, multimedia servers). The SP is a distributed memory, message passing parallel system. It consists of a set of nodes, each running its own operating system image, interconnected by a high-speed network. In the TOP classification, it falls into the cluster class of machines. IBM delivered several generations of SP machines, all based on IBM POWER processors. The initial machines were simply called the IBM SP (or SP) and were based on the original POWER processors. Later, IBM delivered the SP machines based on POWER and then finally a generation based on POWER processors (which was unofficially called the SP by some). IBM continued to deliver parallel systems based on later generations of POWER processors (POWER and beyond), but those were no longer considered IBM RS/ SP systems. The November TOP list shows POWER Systems machines (one based on
I
I
IBM RS/ SP
POWER processors and the rest based on POWER processors), which can be considered direct follow-on to the RS/ SP. Notable IBM RS/ SP systems include the Argonne National Laboratory SP (installed in ) [], the Cornell Theory Center IBM SP, the Lawrence Livermore National Laboratory ASCI Blue Pacific, and the Lawrence Livermore National Laboratory ASCI White. This last system consisted of nodes with POWER processors each and held the number spot in the TOP list from November to November . The RS/ SP was designed to serve a broad range of applications, from both the technical and commercial computing domains. The designers of the system followed a set of principles [] that can be summarized as follows: maximize the use of off-the-shelf hardware and software components while developing some special-purpose components that maximize the value of the system. As a result, the RS/ SP utilizes the same processors, operating systems, and compilers as the contemporary IBM workstations. It also utilizes a special-purpose high-speed interconnect switch, a parallel operating environment and message passing libraries, and a parallel programming environment, including a High Performance Fortran (HPF) compiler. To further enable the system for commercial applications, IBM and other vendors developed parallel versions of important commercial middleware such as DB and CICS/. With these parallel middleware, customers were able to quickly port applications from the more conventional single system image commercial servers to the IBM SP.
Hardware Architecture The IBM RS/ SP consists of a cluster of nodes interconnected by a switch (Fig. ). The nodes (Fig. ) are independent computers based on hardware (processors, memory, disks, I/O adapters) developed for IBM workstations and servers. Each node has its own private memory and runs its own image of the AIX operating system. Through the evolution of the RS/ SP, different nodes were used. The initial nodes for SP and SP models were single processor nodes. Later, with the introduction of PowerPC and POWER
Node Node
Node
Node
Switch
Node
Node
Node Node
IBM RS/ SP. Fig. High-level hardware architecture of an IBM RS/ SP
processors, symmetric multiprocessing (SMP) nodes became available. Nodes could be configured to better serve specific purposes. For example, compute nodes could be configured with more processors and memory, whereas I/O nodes could be configured with more I/O adapters. The RS/ SP architecture supports different kinds of nodes in the same system, and it was usual to have both computeoptimized and I/O-optimized nodes in the same system. The node architecture (illustrated in Fig. ) is essentially the same as contemporary standalone workstations and servers based on POWER processors. Processor and memory modules are interconnected by a system bus that supports memory coherence within the node. An I/O bus also hangs off this system bus. This I/O bus supports off-the-shelf adapters found on standalone machines, such as Ethernet and Fibre Channel. It also supports the switch adapters that connect the node to the network. Whereas the SP nodes are built primarily out of offthe-shelf hardware (except for the switch adapter), the SP switch is a special-purpose design. At the time the SP was conceived, standard interconnection networks (Ethernet, FDDI, ATM) delivered neither the bandwidth nor the latency that a large-scale general-purpose
IBM RS/ SP
Processor
Processor
I
Processor
System bus
I/O bus
FC adapter
Memory
Ethernet adapter
Memory
Switch adapter
Memory
I
IBM RS/ SP. Fig. Hardware architecture of an IBM RS/ SP node. Different kinds of nodes can be and have been used in SP systems
4⫻4
4⫻4
4⫻4
4⫻4
4⫻4
4⫻4
4⫻4
4⫻4
To switch boards in other frames
16⫻16 Switch board
To nodes in the same frame
parallel system like the SP required. Therefore, the designers decided that a special-purpose interconnect was necessary [, ]. The IBM RS/ SP switch [] is an any-to-any packet-switched multistage network. The bisection bandwidth of the switch scales linearly with the size (number of nodes) of the system. The available bandwidth between any pair of communicating nodes remains constant irrespective of where in the topology the two nodes lie. These features supported both system scalability and ease of use. The system could be viewed as a flat collection of nodes, which could be freely selected for parallel jobs irrespective to their location. Selection could focus on other features, such as processor speed and memory size. As a consequence, the IBM RS/ SP did not suffer from the fragmentation problem that was observed in other parallel systems of the time []. The SP switch is built from a basic × bidirectional crossbar switching elements, which are grouped eight to a board to form a × switch board, as shown in Fig. . A switch board connects to nodes on one side and to other switch boards on the other side. Systems with up to nodes can be assembled with just one layer of switch boards, whereas larger systems require additional layers.
IBM RS/ SP. Fig. A × switch board is built by interconnecting eight × bidirectional crossbar switching elements. The switch board connects to nodes on one side and other boards on the other side
Software Architecture Figure illustrates the software stack of the RS/ SP. That software stack is built upon off-the-shelf UNIX components and specialized services for parallel
I
IBM RS/ SP
Parallel applications Application middleware System management
Job management
Parallel environment
Compilers & libraries
Global services Availability services High-performance services
Standard operating system (AIX)
Hardware (Processors, memory, I/O devices, adapters)
IBM RS/ SP. Fig. Software stack for RS/ SP
processing. Each node runs a full AIX operating system instance. That operating system is complemented at the bottom layer of the software stack by high-performance services that provide connectivity to the SP switch. The availability and global services layers implement aspects of a single-system image. They are intended to be the basis for parallel applications and application middleware. The availability services of the IBM SP support heartbeat, membership, notification, and recovery coordination. Heartbeat services implement the monitoring of components to detect failures. Membership services allow processors and processes to be identified as belonging to a group. Notification services allow members of a group to be notified when new members are added or old members are removed from that group. Finally, recovery coordination services provide a mechanism for performing recovery procedures within the group in response to changes in the membership. The global services of the IBM SP provide global access to resources such as disks, files, and networks. Global access to files is provided by networked file solutions, either with a client-server model (e.g., NFS) or with a parallel file system model (e.g., GPFS). With this approach, processes in every node have access to the same file space. Global network access is implemented through TCP/IP and UDP/IP protocols over the IBM SP switch. Gateway nodes, connected to both the SP switch and an external Ethernet network, allow all nodes to access the Ethernet network. Global access to disks is implemented by virtual shared disk (VSD) functionality. VSD allows a process running on any SP
node to access any disk in the system as if it were locally attached to that node. Another global service is the system data repository (SDR). The SDR contains systemwide information about the nodes, switches, and jobs currently in the system. The job management system of the IBM SP supports both interactive and batch jobs. Batch jobs are submitted, scheduled, and controlled by LoadLeveler []. For interactive jobs, a user can login directly to any node in the SP, since the nodes all run a full version of the AIX operating system. System management for the IBM SP is built upon components used for management of RS/ AIX workstations. It also includes extensions developed specifically for the SP to facilitate performing standard management functions across the many nodes of an SP. The functions supported include system installation, system operation, user management, configuration management, file management, security management, job accounting, problem management, change management, hardware monitoring and control, and print and mail services. The system management functions can be performed via a control workstation that acts as the system console. The compilers and run-time libraries for the IBM RS/ SP are based on the standard software stack for IBM AIX augmented with certain features specific to the SP. Fortran, C and C ++ compilers, and run-time libraries for POWER-based workstations and servers can be directly used in the SP, since the nodes of the latter are based on hardware developed for the former. The software stack for the SP also includes message-passing libraries that implement both IBM proprietary models, such as MPL, and standard models such as PVM and MPI [, ]. Also available for the IBM RS/ SP is an implementation of the High Performance Fortran (HPF) programming language []. In addition to middleware like MPI libraries that cater to scientific applications, the RS/ SP software stack also includes middleware targeted at enabling parallel commercial applications. The main example is DB Parallel Edition (PE) [], an implementation of the DB relational database product that runs in parallel across the nodes of the SP. DB PE is a shared-nothing parallel database system, in which the data is partitioned across the nodes. DB PE splits SQL queries into multiple operations that are then shipped to the nodes for
IBM RS/ SP
execution against their local data. A final stage combines the results from each node into a single result. DB PE enables database applications to use the parallelism of the RS/ SP without changes to the application itself, since the exploitation of parallelism happens in the database middleware layer.
I
each with POWER processors. Each processor had a clock speed of MHz and a peak floating-point performance of . Gflops. As indicated in the TOP list, the machine had a peak performance (Rpeak) of , Gflops and a Linpack performance (Rmax) of , Gflops. It was ranked # in the November , June and November lists.
Example Applications Thousands of IBM RS/ SP systems were delivered over the product lifetime, ranging in size from as few as two nodes all the way up to nodes and larger. The availability of a full workstation- and server-compatible software stack on the SP nodes allowed it to run offthe-shelf AIX applications with zero porting effort. The availability of standard message passing libraries (such as MPI), High Performance Fortran, and parallel commercial middleware such as DB PE also meant that existing parallel applications could be moved to the IBM RS/ SP with relative ease. Furthermore, several applications were specifically developed or optimized for the SP. At a relatively early point in the product lifetime (), the SP was already being used in many different areas, including computational chemistry, crash analysis, electronic design analysis, seismic analysis, reservoir modeling, decision support, data analysis, on-line transaction processing, local area network consolidation, and as workgroup servers. In terms of economic sectors, SP systems were being used in manufacturing, distribution, transportation, petroleum, communications, utilities, education, government, finance, insurance, and travel []. The IBM RS/ SP played an important role in the Accelerated Strategic Computing Initiative by the US Department of Energy. That program was responsible for several of the fastest computers in the world, including two SPs: the ASCI Blue-Pacific and ASCI White machines. ASCI Blue-Pacific was the largest SP in number of nodes. It consisted of , nodes, each with four PowerPC e processors. Each processor had a clock speed of MHz and a peak floating-point performance of Mflops. As indicated in the TOP list, the machine had a peak performance (Rpeak) of . Gflops and a Linpack performance (Rmax) of Gflops. It was ranked # in the November and June lists. ASCI White was the largest SP in number of processors (or cores). It consisted of nodes,
Related Entries IBM Power Architecture LINPACK Benchmark MPI (Message Passing Interface) TOP
Bibliographic Notes and Further Reading For a thorough discussion of the system architecture of the RS/ SP, the reader is referred to []. Details of the RS/ SP interconnection network can be found in []. An overview of the system software for the RS/ SP is given in [] while details for the MPI environment and the job scheduling facilities are described in [] and [], respectively. The HPF compiler for the RS/ SP is described in []. Commercial middleware is covered in [] and user experience in a scientific computing environment is described in []. Finally, additional information on the machine fragmentation problem is available in [].
Bibliography . Agerwala T, Martin JL, Mirza JH, Sadler DC, Dias DM, Snir M () SP system architecture. IBM Syst J ():– . Baru CK, Fecteau G, Goyal A, Hsiao H, Jhingran A, Padmanabhan S, Copeland GP, Wilson WG () DB Parallel Edition. IBM Syst J ():– . Dewey S, Banas J () LoadLeveler: a solution for job management in the UNIX environment. AIXtra, May/June . Feitelson DG, Jette MA () Improved utilization and responsiveness with gang scheduling. In: Feitelson DG, Rudolph L (eds) Proceedings of job scheduling strategies for parallel processing (JSSPP ’). Lecture notes in computer science, vol . Springer, Berlin, pp – . Franke H, Wu CE, Riviere M, Pattnaik P, Snir M () MPI programming environment for IBM SP/SP. In: Proceedings of the th international conference on distributed computing systems (ICDCS ’), Vancouver, May –June , , pp – . Gropp WD, Lusk E () Experiences with the IBM SP. IBM Syst J ():–
I
I
IBM SP
. Gupta M, Midkiff S, Schonberg E, Seshadri V, Shields D, Wang K-Y, Ching W-M, Ngo T () An HPF compiler for the IBM SP. In: Proceedings of the ACM/IEEE conference on supercomputing, San Diego, . Snir M, Hochschild P, Frye DD, Gildea KJ () The communication software and parallel environment of the IBM SP. IBM Syst J ():– . Stunkel CB, Shea DG, Abali B, Atkins MG, Bender CA, Grice DG, Hochschild P, Joseph DJ, Nathanson BJ, Swetz RA, Stucke RF, Tsao M, Varker PR () The SP high-performance switch. IBM Syst J ():–
IBM SP IBM RS/ SP
IBM SP IBM RS/ SP
IBM SP IBM RS/ SP
form of branch speculation, pipelined execution units, and division by binomial series approximation.
Discussion Introduction When IBM introduced System/ in , it included an announcement of a high-end computer referred to as “Model .” This computer was realized as the Model with first installation in . While the Model had a core memory with ns cycle time, there was a fast memory version which used thin-film magnetic storage with a cycle time of ns. This version was labeled the Model ; two Model s were produced. Orders for the Model were closed in , but a revised version with improved technology and cache was introduced around as the Model . The Model used the same logic as the Model . In the mid s, there were some references to a Model (with ns memory cycle). This version was never realized. The Model series used a hybrid technology called ASLT (advanced solid logic technology), and had a processor cycle of ns for Models and ; the Model had a processor cycle of ns. The total production of all computers in the Model series was about two dozen.
Overview of the System and CPU
IBM SP IBM RS/ SP
IBM System/ Model Michael Flynn Stanford University, Stanford, CA, USA
Definition The Model was the highest performing computer system introduced by IBM as part of its System/ series in the s. It was distinguished by a number of innovations such as out-of-order instruction execution, Tomasulo’s algorithm for data flow execution, a limited
Instruction Processing The instruction set was exactly that of System/; however, instruction execution of the decimal instruction set was not supported in hardware in the Model . If such an instruction was invoked, it would trap and be interpreted by the processor [, ]. The design was predicated on the requirements of a long ( stage) pipeline. Instructions were prefetched into an entry I buffer each entry having bytes. The I buffer enabled limited speculation on branch outcomes. If the branch target was backward and in the buffer, the branch was predicted to be successful otherwise the branch was predicted to be untaken and proceed in line. A significant feature of the execution units was a mechanism called the common data bus, developed by R Tomasulo [] and better known as Tomasulo’s algorithm. This is a data flow control algorithm, which renames registers into reservation stations.
IBM System/ Model
Each register in the central register set is extended to include a tag that identifies the functional unit that produces a result to be placed in a particular register. Similarly, each of the multiple functional units has one or more reservation stations. The reservation station, however, can contain either a tag identifying another functional unit or register, or it can contain the variable needed. Each reservation station effectively defines its own functional unit; thus, two reservations for a floating point multiplier are two functional unit tags: multiplier and multiplier . If operands can go directly into the multiplier, then there is another tag: multiplier . Once a pair of operands has a designated functional unit tag, that tag remains with that operand pair until completion of the operation. Any unit (or register) that depends on that result has a copy of the functional unit tag and in gates the result that is broadcast on the common data bus. In this dataflow approach, the results to a targeted register may never actually go to that register; in fact, the computation based on the load of a particular register may be continually forwarded to various functional units, so that before the value is stored, a new value based upon a new computational sequence (a new load instruction) is able to use the targeted register. The renaming of registers enabled out of order instruction execution, but in doing so it did not allow precise interrupts for all cases of exceptions. Since one instruction may have completed execution before a slow executing earlier issued one (such as divide) takes an exception and causes an interrupt it is impossible to reconstruct the machine state precisely.
Execution Functional Units The floating point units also provided significant innovation. The floating point adder executed an instruction in two cycles but it accepted a new instruction each cycle. This functional unit pipelining was novel at this time but now widely used []. The floating point multiplier executed an instruction in three cycles. It used a Booth encoding of the multiplier, requiring the addition of signed multiples of the multiplicand (the mantissa had bits). These were added in Sum + Carry form, six at a time using a tree of carry save adders (CSA). The partial product was fed back into the tree for assimilation with the next six multiples. Using a newly developed Earle latch, this assimilation iteration eliminated latching overhead and
I
required only four logic stages (two CSAs) to implement. This allowed six signed multiples to be assimilated every ns. Thus, the assimilation process took ns to form the produce; another ns was required for startup. The divide process introduced the Goldschmidt algorithm based on a binomial expansion of the reciprocal. The value of the reciprocal was then multiplied by the dividend to form the quotient. In binomial expansion the divisor, d = −x; now the expansion of /(−x) can be represented as ( − x)( + x )( + x )( + x ). . . Since d is binary normalized, x is less than or equal /. So each term doubles the number of zeros following the and the resulting product quadratically converges to the reciprocal. A initial table lookup reduces the number of multiplies, the Model took cycles for floating point divide.
Memory The Model memory system used conventional (for the day) magnetic core technology. It was interleaved ways and had a . microsecond cycle time with a cycle access time. A similar memory technology supported the Model ’s cache based memory. The memory system buffered outstanding requests [].
The Technology The machine itself was implemented with ECL (emitter coupled logic) as the basic circuit technology using multi transistor chips mounted on a × cm aluminum ceramic substrate. Passive devices were implemented as thin-film printed components on the substrate. Two substrates were stacked one on top of each other forming a module which formed a cube about cm on a side. The module provided a circuit density of about two to three circuits. Approximately modules could be mounted on a daughterboard and twenty daughter boards could be plugged in to a motherboard ( × cm). Twenty motherboards formed a frame about × m × cm and four frames formed the basic CPU for the system [–]. The ECL circuit delay was about . ns but the transit time between logic gates added another . ns. All interconnections were made by terminated transmission line except for a small number of stubbed transmission lines. A great deal of care was paid in developing the signal transmission system. For example: a dual
I
I
IEEE .
impedance system of and ohms was used and the width of the basic ohm line was reduced to create a ohm line in the vicinity of loads so that the effective impedance of the loaded ohm line would appear as a ohm line. The processor had in total about , gates. Since each gate had an associated interconnection delay the transit time plus loading effects made the total delay per gate approximately . ns. With stages of logic as the definition for cycle time, this defined ns as the basic CPU cycle. The multiply-divide unit had a subcycle of ns for a partial product iteration. The processor used water-cooled heat exchangers between motherboards for cooling. A motor generator set powered the system, isolating the system from short power disruptions. The total power consumption was a significant fraction of a megawatt.
Bibliographic Notes and Further Reading The basic source material for the Model is the IBM J of Research and Development cited below [–]. The term “Tomasulo’s algorithm” and some similar designations are introduced in []. Thirty years after the Model was initially dedicated there was a retrospective paper on its accomplishments and problems [].
Bibliography The following eight papers are all from the special issue of the IBM Journal of Research and Development devoted to The IBM System/ Model ; vol , issue , January . Flynn MJ, Low PR Some remarks on system development, pp – . Anderson DW, Sparacio FJ, Tomasulo RM Machine philosophy and instruction handling, pp – . Tomasulo RM An efficient algorithm for exploiting multiple arithmetic units, pp – . Anderson SF, Goldschmidt RE, Earle JG, Powers DM, Floatingpoint execution unit, pp – . Boland LJ, Messina BU, Granito GD, Smith JW, Marcotte AU Storage system, pp – . Langdon JL, Van Derveer EJ Design of a high-speed transistor for the ASLT current switch, pp – . Sechler RF, Strube AR, Turnbull JR ASLT circuit design, pp – . Lloyd RF ASLT: an extension of hybrid miniaturization techniques, pp – Other papers mentioning the Model or related computers
. Flynn MJ () Very high speed computers. Proc IEEE : – . Flynn MJ () Computer engineering years after the IBM model . IEEE Comput ():–
IEEE . Ethernet
Illegal Memory Access Intel Parallel Inspector
Illiac IV Yoichi Muraoka Waseda University, Tokyo, Japan
History The Illiac IV computer was the first practical largescale array computer, which can be classified as an SIMD (single-instruction-stream-multiple-datastreams)-type computer. As the name suggests, the project was managed at the University of Illinois Digital Computer Laboratory under the contract from the Defense Advanced Research Project Agency (then called ARPA). The project started in , and the machine was delivered to the NASA Ames Research Center in . It took years to run its first successful application and was made available via the ARPANET, the predecessor of the Internet. The principal investigator was Professor Daniel Slotnick, who conceived the idea in the mid-s as the Solomon computer. The machine was built to answer the large computational requirements such as ballistic missile defense analysis, climate modeling, and so on. It was said then that two Illiac-IV computers would suffice to cover all computational requirements in the planet. The design of the computer and the majority of early software suits development were done by Illinois researchers, including many ambitious graduate students, while the computer was built by the Burroughs Corporation.
Illiac IV
To be precise, the project was transferred from the University of Illinois to the Ames Research Center in before its completion due to the heavy antiVietnam war protest activity on the campus. The hardware is now being displayed in the Computer History Museum at Silicon Valley.
Hardware An instruction is decoded by Control Unit (CU), and the decoded signals are sent to processors, called PE. PE is basically a combination of arithmetic-logic-unit, registers, and ,-words memory of -bit length (So in total, MB). It operates at a MHz clock. Each PE loads data to a register from its own memory. Thus, the PEs perform an identical operation over independent data simultaneously. For example, two elements vectors can be added by one operation. The operation of each PE can be controlled by a mode bit. If a control bit is set, then a PE participates the operation, otherwise it does not. Thus, PE has some degree of freedom. There is an instruction to set/reset the mode bit. A memory address to access the memory in each PE is also provided from CU. Included in each PE is an index register, so one can modify and access to a different memory address in different PEs. To exchange data between PEs, PEs are set in the shape of a list with wrapped around connection. All PEs move data, for example, to their left neighbor PEs simultaneously. The mechanism is called the routing. The megabyte memory is far less from ideal. So it is backed up by a million word head-per-track desk with an I/O rate of MB/s. The average access time is ms. To manage the operation of the computer and to provide the I/O capability, a Burroughs B machine is used. The majority of operating system functions are run on this machine. To further speed up the instruction execution, the instruction decoding in CU is overlapped with the execution in PEs, i.e., the instruction decoding and the instruction execution is pipelined.
Software Besides an ordinary assembler, called ASK, originally, two compilers were planned, the TRANQUIL and the GLYPNIR.
I
The TRANQUIL language is an extension of the ALGOL. Its main feature is the capability to specify the parallel execution of a FOR loop. For example, FOR SIM I = ( . . . ) A(I) = B(I) + B(I) adds two elements vectors in one step. Furthermore, it allows a user to specify how to store data in the memory. For example, suppose we have an array of columns by rows. If we store each entire column in separate PE memory, then while operation on rows (e.g., addition of corresponding elements of two rows) can be done in parallel, operation on columns must be done sequentially. To avoid this situation, we may store data as PE
PE
PE
PE
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
a(,)
While the simple data mapping scheme is called straight storage, this scheme is called skewed storage. TRANQUIL provides further varieties of storage scheme for an array to allow efficient parallel computation. While the TRANQUIL language aims at a high-level programming gear, the GLYPNIR language, so to speak, provides a low-level view of the computer. In GLYPNIR, one writes a program for a PE in ALGOL like statements. The parallel execution is implicit, i.e., basically the identical program is executed in all PEs. The special features are added to take advantage of the mode-bit control and the separate memory indexing in each PE. For example, to add two vectors, we declare variables as PE REAL X,Y by which two elements vectors are created, and an element X(i) is stored in PEM of PEi. Thus, the statement X ← X + Y will add two vectors in parallel. To control the execution in each PE, there is a control statement such as IF < BE > THEN S where BE is a elements Boolean values, each of which corresponds the mode bit of PEs. Thus, the statement
I
I
Illiac IV
Instructions CU
Routing Path PE0
PE1
PE3
PE63
2048 Words PEM Organization of Illiac IV
Illiac IV. Fig. Organization of Illiac IV
S is executed in PEs whose corresponding element of BE is true.
Applications The Illiac IV was the first practical parallel computer to allow writing real parallel codes. Many original and great parallel application ideas have been developed by the project which contributed to solve real world problems. Among many influential results, we just list a few of them below. Application areas attacked by the project are: . Computational fluid dynamics NASA people developed an aerodynamic flow simulation program on Illiac IV which helped them to replace their wind tunnel with a computer simulation system. Other programs developed include the Navier–Stokes solver for two dimensional unsteady transonic flow, a viscousflow airfoil code, turbulence modeling for threedimensional incompressible flow, and many more. . Image processing Image processing may be one of the application areas that is most suited for parallel computation. Many innovatory parallel algorithms were developed and implemented on the Illiac IV which include mage line detection, image skeletonizing,
shape detection, and many others. Also, the Illiac IV was successfully used to analyze LANDSAT satellite data, especially in clustering and classification. . Astronomy Although very little research was done in this area three-dimensional galaxy simulations have been run successfully on the Illiac IV. This is a typical n-body problem. . Seismic This is one of applications which were studied intensively to develop parallel algorithms. The application is characterized as a three-dimensional finite difference code with many irregular data structure. . Mathematical programs Many basic numerical computations have been coded in parallel. They include dense and sparse matrix calculations, the single-value decomposition, and so on. . Weather/climate simulation A dynamic atmospheric model incorporating chemistry and heat exchange was build.
Assessment The project took a decade of development and was also massively over budget. Costs escalated from the
ILUPACK
$ million estimated in to $ million by . Only a quarter of the fully planned machine, i.e., PEs instead of PEs, was ever built. Nevertheless, the project developed a very unique and for the time, a powerful computer. Also, many ideas in software are still vital and appreciated. The project pushed research forward, leading the way for machines such as the Thinking Machines CM- and CM-. It is also true that the software team of the project developed not only a whole new set of parallel programming ideas, but also a set of experts in parallel programming.
Bibliography . Barnes GH et al () The Illiac IV computer. IEEE Trans Comput C-():–. The paper summarizes hardware and software of the Illiac IV . Abel N et al. TRABQUIL, a language for an array processing computer. In: Proceedings AFIPS SJCC, vol . AFIPS Press, Montvale NJ, pp –. The TRANQUIL language is introduced . Kuck D, Sameh A () Parallel computation of eigenvalues of real matrices, In: Information Processing , vol II. North-Holland, Amsterdam, pp –. Describes parallel eigenvalue algorithm . Hord RM () The Illiac IV. Computer Science Press, Rockville. A complete description of the project. Out of print now
ILUPACK Matthias Bollhöfer , José I. Aliaga , Alberto F. Martín , Enrique S. Quintana-Ortí Universitat Jaume I, Castellón, Spain TU Braunschweig Institute of Computational Mathematics, Braunschweig, Germany
Definition ILUPACK is the abbreviation for Incomplete LU factorization PACKage. It is a software library for the iterative solution of large sparse linear systems. It is written in FORTRAN and C and available at http://ilupack. tu-bs.de. The package implements a multilevel incomplete factorization approach (multilevel ILU) based on a special permutation strategy called “inverse-based pivoting” combined with Krylov subspace iteration methods. Its main use consists of application problems such
I
as linear systems arising from partial differential equations (PDEs). ILUPACK supports single and double precision arithmetic for real and complex numbers. Among the structured matrix classes that are supported by individual drivers are symmetric and/or Hermitian matrices that may or may not be positive definite and general square matrices. An interface to MATLAB (via MEX) is available. The main drivers can be called from C, C++, and FORTRAN.
Discussion Introduction Large sparse linear systems arise in many application areas such as partial differential equations, quantum physics, or problems from circuit and device simulation. They all share the same central task that consists of efficiently solving large sparse systems of equations. For a large class of application problems, sparse direct solvers have proven to be extremely efficient. However, the enormous size of the underlying applications arising in -D PDEs or the large number of devices in integrated circuits currently requires fast and efficient iterative solution techniques, and this need will be exacerbated as the dimension of these systems increases. This in turn demands for alternative approaches and, often, approximate factorization techniques, combined with iterative methods based on Krylov subspaces, reflecting as an attractive alternative for these kinds of application problems. A comprehensive overview over iterative methods can be found in []. The ILUPACK software is mainly built on incomplete factorization methods (ILUs) applied to the system matrix in conjunction with Krylov subspace methods. The ILUPACK hallmark is the so-called inverse-based approach. It was initially developed to connect the ILUs and their approximate inverse factors. These relations are important since, in order to solve linear systems, the inverse triangular factors resulting from the factorization are applied rather than the original incomplete factors themselves. Thus, information extracted from the inverse factors will in turn help to improve the robustness for the incomplete factorization process. While this idea has been successfully used to improve robustness, its downside was initially that the norm of the inverse
I
I
ILUPACK
factors could become large such that small entries could hardly be dropped during Gaussian elimination. To overcome this shortcoming, a multilevel strategy was developed to limit the growth of the inverse factors. This has led to the inverse-based approach and hence the incomplete factorization process that has eventually been implemented in ILUPACK benefits from the information of bounded inverse factors while being efficient at the same time []. A parallel version of ILUPACK on shared-memory multiprocessors, including current multicore architectures, is under development and expected to be released in the near future. The ongoing development of the parallel code is inspired by a nested dissection hierarchy of the initial system that allows to map tasks concurrently to independent threads within each level.
The Multilevel Method To solve a linear system Ax = b, the multilevel approach of ILUPACK performs the following steps: . The given system A is scaled by diagonal matrices Dl and Dr and reordered by permutation matrices Πl , Π r as ˆ A → Dl ADr → Π Tl Dl ADr Π r = A. These operations can be considered as a preprocessing prior to the numerical factorization. They typically include scaling strategies to equilibrate the system, scaling and permuting based on maximum weight matchings [], and, finally, fill-reducing orderings such as nested dissection [], (approximate) minimum degree [, ], and some more. . An incomplete factorization A ≈ LDU is next comˆ where L, U T are unit lower puted for the system A, triangular factors and D is diagonal. Since the main objective of ILUPACK (inverse-based approach) is to limit the norm of the inverse triangular factors, L− and U − , the approximate factorization process is interlaced with a pivoting strategy that cheaply estimates the norm of these inverse factors. The pivoting process decides in each step to reject a factorization step if an imposed bound κ is exceeded, or to accept a pivot and continue the approximate factorization otherwise. The set of rejected rows and columns is permuted to the lower and right end of the matrix. This process is illustrated in Fig. .
The pivoting strategy computes a partial factorization ⎛B E ⎞ ⎟ ˆ =⎜ PTAP ⎜ ⎟ ⎝F C ⎠ ⎛L ⎞ ⎛D U D U ⎞ B B E ⎟ + R, ⎟⎜ B B =⎜ ⎟ ⎟⎜ ⎜ SC ⎠ ⎝LF I ⎠ ⎝ where R is the error matrix, which collects those entries of Aˆ that were dropped during the factorization and “SC ” is the Schur complement consisting of all rows and columns associated with the rejected pivots. By construction the inverse triangular factors satisfy − − ⎛U U ⎞ ⎛L ⎞ B B E ⎜ ⎟ ⎟ ⎜ , ⎜ ⎟ ⎟ ⎜ ⪅ κ. I ⎠ LF I ⎠ ⎝ ⎝
. Steps and are successively applied to Aˆ = SC until SC is void or “sufficiently dense” to be efficiently factorized by a level BLAS-based direct factorization kernel. When the multilevel method is applied over multiple levels, a cascade of factors LB , DB , and UB , as well as matrices E, F are usually obtained (cf. Fig. ). Solving linear systems via a multilevel ILU requires a hierarchy of forward and backward substitutions, interlaced with reordering and scaling stages. ILUPACK employs Krylov subspace methods to incorporate the approximate factorization into an iterative method. The computed multilevel factorization is adapted to the structure of the underlying system. For real symmetric positive definite (SPD) matrices, an incomplete Cholesky decomposition is used in conjunction with the conjugate gradient method. For the symmetric and indefinite case, symmetric maximum weight matchings [] allow to build × and × pivots. In this case ILUPACK constructs a blocked symmetric multilevel ILU and relies on the simplified QMR [] as iterative solver. The general (unsymmetric) case typically uses GMRES [] as default driver. All drivers in ILUPACK support real/complex and single/double precision arithmetic.
ILUPACK
factorized
pending
current
rejected
I
accept eTk L−1,U −1 ek ≤ k approximate continue factorization
factorization reject
rejected pivots
eTk L−1,U −1 ek > k
current factorization step
compute SC
finalize level
ILUPACK. Fig. ILUPACK pivoting strategy
linear system Ax = b. From this point of view,
0
L− AU − = D + L− RU −
500
ensures that the entries in the error matrix R are not amplified by some large inverse factors L− and U − . The second and more important aspect links ILUPACK’s multilevel ILU to algebraic multilevel − methods. The inverse Aˆ can be approximately
1,000
1,500
2,000
written as ⎛⎛(L D U )− ⎞ − ⎟ ⎜ B B B Aˆ ≈ P⎜ ⎟ ⎜⎜ ⎠ ⎝⎝
2,500
3,000
3,500
0
500
1,000
1,500
2,000
2,500
3,000
3,500
ILUPACK. Fig. ILUPACK multilevel factorization
Mathematical Background The motivation for inverse-based approach of ILUPACK can be explained in two ways. First, when the approximate factorization A = LDU + R is computed for some error matrix R, the inverse triangular factors L− and U − have to be applied to solve a
⎛−U − U ⎞ E B ⎟ S− (−L L− I )) PT . +⎜ ⎟ C ⎜ F B ⎠ ⎝ I In this expression, all terms on the right hand side are constructed to be bounded except S− C . Since in gen− ˆ eral A has a relatively large norm, so does S− C . In principle this justifies that eigenvalues of small modulus are revealed by the approximate Schur complement SC (a precise explanation is given in []). SC serves as some kind of coarse grid system in the sense of discretized partial differential equations. This observation goes hand in hand with the observation that the inverse-based pivoting approach selects pivots similar to coarsening strategies in algebraic multilevel methods,
I
I
ILUPACK
which is demonstrated for the following simple model problem: −− uxx (x, y)−uyy (x, y) = f (x, y) for all (x, y) ∈ [, ] , u(x, y) = on ∂[, ] . This partial differential equation can be easily discretized on a square grid Ω h = {(kh, lh) : k, l = , . . . , N +} , where h = N+ is the mesh size, using standard finite difference techniques. Algebraic approaches to multilevel methods roughly treat the system as if the term −− uxx (x, y) is hardly present, i.e., the coarsening process treats it as if it were a sequence of onedimensional differential equations in the y-direction. Thus, semi-coarsening in the y-direction is the usual approach to build a coarse grid. ILUPACK inversebased pivoting algebraically picks and rejects the pivots precisely in the same way as in semi-coarsening. In Fig. , this is illustrated for a portion of the grid Ω h using κ = . In the y-direction, pivots are rejected after – steps while in the x-direction all pivots are kept or rejected (in blue and red, resp.). The number of rejected pivots in ILUPACK strongly depends on the underlying application problem.
As a rule of thumb, the more rows (and columns) of the coefficient matrix satisfy ∣aii ∣ ≫ ∑j/=i ∣aij ∣, the less pivots tend to be rejected.
The Parallelization Approach Parallelism in the computation of approximate factorizations can be exposed by means of graph-based symmetric reordering algorithms, such as graph coloring or graph partitioning techniques. Among these classes of algorithms, nested dissection orderings enhance parallelism in the approximate factorization of A by partitioning its associated adjacency graph G(A) into a hierarchy of vertex separators and independent subgraphs. For example, in Fig. , G(A) is partitioned after two levels of recursion into four independent subgraphs, G(,) , G(,) , G(,) , and G(,) , first by separator S(,) , and then by separators S(,) and S(,) . This hierarchy is constructed so that the size of vertex separators is minimized while simultaneously balancing the size of the independent subgraphs. Therefore, relabeling the nodes of G(A) according to the levels in the hierarchy leads to a reordered matrix, A → Φ T AΦ, with a structure amenable to efficient parallelization. In particular, the leading diagonal blocks of
80
75
70
65
60
55
50 30
35
40
45
50
ILUPACK. Fig. ILUPACK pivoting for partial differential equations. Blue and red dots denote, respectively, accepted and rejected pivots
I
ILUPACK
S(1,1)
G(2,1)
G(2,2)
G(3,1)
G(3,3)
S(2,1)
S(2,2)
G(3,2)
G(3,4)
A → ΦTAΦ
(1,1)
A
G(A)
(2,1)
(2,2)
G(3,1) G(3,2) G(3,3) G(3,4) Nested dissection hierarchy
(3,1)
(3,2)
(3,3)
(3,4)
Task dependency tree
ILUPACK. Fig. Nested dissection reordering
ΦT AΦ associated with the independent subgraphs can be first eliminated independently; after that, S(,) and S(,) can be eliminated in parallel, and finally separator S(,) is processed. This type of parallelism can be expressed by a binary task dependency tree, where nodes represent concurrent tasks and arcs dependencies among them. State-of-the-art reordering software packages e.g., METIS (http://glaros.dtc.umn.edu/gkhome/ views/metis) or SCOTCH (http://www.labri.fr/perso/ pelegrin/scotch), provide fast and efficient multilevel variants [] of nested dissection orderings. The dependencies in the task tree are resolved while the computational data and results are generated and passed from the leaves toward the root. The leaves are responsible for approximating the leading diagonal blocks of Φ T AΦ, while those blocks which will be later factorized by their ancestors are updated. For example, in Fig. , colors are used to illustrate the correspondence between the blocks of Φ T AΦ to be factorized by tasks (,), (,), and (,). Besides, (,) only updates those blocks that will be later factorized by tasks (,) and (,). Taking this into consideration, Φ T AΦ is decomposed into the sum of several submatrices, one local block per each leaf of the tree, as shown in Fig. . Each local submatrix is composed of the blocks to be factorized by the corresponding task, together with its local contributions to the blocks that are later factorized by its ancestors, hereafter referred as contribution blocks. This strategy significantly increases the degree of parallelism,
I because the updates from descendant nodes to an ancestor node can also be performed locally/independently by its descendants. The parallel multilevel method considers the following partition of the local submatrices into × block matrices ⎛A A ⎞ X V ⎟, Apar = ⎜ ⎟ ⎜ ⎝AW AZ ⎠ where the partitioning lines separate the blocks to be factorized by the task, i.e., AX , AW , and AV , and its contribution blocks, i.e., AZ . It then performs the following steps: . Scaling and permutation matrices are only applied to the blocks to be factorized,
Apar
⎛ Π T ⎞ ⎛D ⎞ l ⎟⎜ l ⎟ →⎜ ⎟⎜ ⎟ ⎜ ⎝ I⎠ ⎝ I⎠ ⎛ A A ⎞ ⎛D ⎞ X V ⎟⎜ r ⎟ ×⎜ ⎟⎜ ⎜ ⎟ ⎝AW AZ ⎠ ⎝ I ⎠ ⎛Π ⎞ ⎛Aˆ Aˆ ⎞ r V ⎟=⎜ X ⎟ = Aˆ par . ×⎜ ⎟ ⎜ ⎟ ⎜ ˆ ˆ ⎝ I ⎠ ⎝AW AZ ⎠
I
ILUPACK
(1,1)
(2,1)
(2,2)
(3,1)
(3,3)
(3,2)
=
+
(3,4)
+
+
ΦTAΦ To be factorized by (3,2) Local contributions from (3,2) to (2,1)
contribution blocks
Local contributions from (3,2) to (1,1)
Local submatrix
ILUPACK. Fig. Matrix decomposition and local submatrix associated to a single node of the task tree
Factorized
Pending
Current
Rejected
accept
(1,1)
eTk L−1,U −1 ek ≤ k
(3,2)
(3,1)
approx. reject
factor.
(2,2)
(2,1) (3,3)
(3,4)
continue factorization
eTk L−1,U −1 ek > k compute SC current factorization step
finalize local level
ILUPACK. Fig. Local incomplete factorization computed by a single node of the task tree
. These blocks are next approximately factorized using inverse-based pivoting, while Aˆ Z is only updated. For this purpose, during the incomplete
factorization of Aˆ par , rejected rows and columns are permuted to the lower and right end of the leading block Aˆ X . This is illustrated in Fig. .
I
ILUPACK
(3,1)
+
= (1,1) (2,2)
(2,1)
Merge contributions
(2,1) (3,1)
(3,2)
(3,3)
(3,4)
(3,2)
Apar
Apar
ILUPACK. Fig. Parent nodes construct their local submatrix from the data generated by their children
I This step computes a partial factorization ⎛P T ⎜ ⎜ ⎝
⎛P ⎞ ⎟Aˆ par ⎜ ⎟ ⎜ I⎠ ⎝
⎛ LB X ⎜ ⎜ = ⎜ LF X ⎜ ⎜ ⎝LF W
I
⎛ BX ⎞ ⎜ ⎟=⎜ ⎟ ⎜ ⎜ FX I⎠ ⎜ ⎝F W
⎞ ⎛D B X U B X ⎟⎜ ⎟⎜ ⎜ ⎟ ⎟⎜ ⎟⎜ I⎠ ⎝
EX CX CW
DB X UE X SC X SC W
EV ⎞ ⎟ ⎟ CV ⎟ ⎟ ⎟ CZ ⎠ DB X UE V ⎞ ⎟ ⎟ + R, SC V ⎟ ⎟ ⎟ SCZ ⎠
where the inverse triangular factors approximately satisfy ⎛ LB X ⎜ ⎜ ⎜ LF X ⎜ ⎜ ⎝LF W
I
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ I⎠
−
,
⎛UBX ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
UE X I
UE V ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ I ⎠
−
⪅ κ.
. Steps and are successively applied to the Schur complement until SCX is void or “sufficiently small”, i.e., Apar and its × block partition are redefined as ⎛ A A ⎞ ⎛S SC V ⎞ X V ⎟ := ⎜ CX ⎟. Apar = ⎜ ⎜ ⎟ ⎜ ⎟ ⎝AW AZ ⎠ ⎝SCW SCZ ⎠ . The task completes its local computations and the result Apar is sent to the parent node in the task dependency tree. When a parent node receives the
data from its children, it must first construct its local submatrix Apar . To achieve this, it incorporates the pivots rejected from its children and accumulates their contribution blocks, as shown in Fig. . If the parent node is the root of the task dependency tree, it applies the sequential multilevel algorithm to the new constructed submatrix Apar . Otherwise, the parallel multilevel method is restarted on this matrix at step . To compare the cascade of approximations computed by the sequential multilevel method and its parallel variant, it is helpful to consider the latter as an algorithmic variant, which enforces a certain order of elimination. Thus, the parallel variant interlaces the algebraic levels generated by the pivoting strategy of ILUPACK with the nested dissection hierarchy levels in order to expose a high degree of parallelism. The leading diagonal blocks associated with the last nested dissection hierarchy level are first factorized using inverse-based pivoting, while those corresponding to previous hierarchy levels are only updated, i.e., the nodes belonging to the separators are rejected by construction. The multilevel algorithm is restarted on the rejected nodes, and only when it has eliminated the bulk of the nodes of the last hierarchy level, it starts approximating the blocks belonging to previous hierarchy levels. Figure compares the distribution of
I
ILUPACK
200
200
160
160
120
120
80
80
40
40
0
0
40
80
120
160
0
200
0
40
80
120
160
200
ILUPACK. Fig. ILUPACK pivoting strategy (left) and its parallel variant (right). Blue and red dots denote, respectively, accepted and rejected nodes Parallel scalability
18
Memory scalability
p=2, perfect binary tree p=2, #leaves in [p,2p] p=4, perfect binary tree p=4, #leaves in [p,2p] p=8, perfect binary tree p=8, #leaves in [p,2p] p=16, perfect binary tree p=16, #leaves in [p,2p]
14 12 10
p=1 p=2, perfect binary tree p=2, #leaves in [p,2p] p=4, perfect binary tree p=4, #leaves in [p,2p] p=8, perfect binary tree p=8, #leaves in [p,2p] p=16, perfect binary tree p=16, #leaves in [p,2p]
3.5 Mem. parallel/Mem. sequential
16 Speedup (vs sequential ILUPACK)
8 6 4
3
2.5
2
1.5
2 1 0
1
2 log10(n/1709)
3
0
1
2 log10(n/1709)
3
ILUPACK. Fig. Parallel speedup as a function of problem size (left) and ratio among the memory consumed by the parallel multilevel method and that of the sequential method (right) for a discretized -D elliptic PDE
accepted and rejected nodes (in blue and red, resp.) by the sequential and parallel inverse-based incomplete factorizations when they are applied to the Laplace PDE with discontinuous coefficients discretized with a standard × finite-difference grid. In both cases, the grid was reordered using nested dissection. These diagrams confirm the strong similarity between the sequential inverse-based incomplete factorization and its parallel variant for this particular example. The experimentation with this approach in the SPD case reveals that this compromise has a negligible impact on
the numerical properties of the inverse-based preconditioning approach, in the sense that the convergence rate of the preconditioned iterative method is largely independent on the number of processors involved in the parallel computation.
Numerical Example The example illustrates the parallel and memory scalability of the parallel multilevel method on a SGI Altix CC-NUMA shared-memory multiprocessor with Intel Itanium@. GHz processors
ILUPACK
sharing GBytes of RAM connected via a SGI NUMAlink network. The nine linear systems considered in this experiment are derived from the linear finite element discretization of the irregular -D elliptic PDE [−div(A grad u) = f ] in a -D domain, where A(x, y, z) is chosen with positive random coefficients. The size of the systems ranges from n = , –, , equations/unknowns. The parallel execution of the task tree on sharedmemory multiprocessors is orchestrated by a runtime which dynamically maps tasks to threads (processors) in order to improve load balance requirements during the computation of the multilevel preconditioner. Figure displays two lines for each number of processors: dashed lines are obtained for a perfect binary tree with the same number of leaves as processors, while solid lines correspond to a binary tree with (potentially) more leaves than processors (up to p leaves). The higher speedups revealed in the solid lines demonstrate the benefit of dynamic scheduling on shared-memory multiprocessors. As shown in Fig. (left), the speedup always increases with the number of processors for a fixed problem size, and the parallel efficiency rapidly grows with problem size for a given number of processors. Besides, as shown in Fig. (right), the memory overhead associated with parallelization relatively decreases with the problem size for a fixed number of processors, and it is below . for the two largest linear systems; this is very moderate taking into consideration that the amount of available physical memory typically increases linearly with the number of processors. These observations confirm the excellent scalability of the parallelization approach up to cores.
Future Research Directions ILUPACK is currently parallelized for SPD matrices only. Further classes of matrices such as symmetric and indefinite matrices or unsymmetric, but symmetrically structured matrices are subject to ongoing research. The parallelization of these cases shares many similarities with the SPD case. However, maximum weight matching, or more generally, methods to rescale and reorder the system to improve the size of the diagonal entries (resp. diagonal blocks) are hard to parallelize []. Although the current implementation of ILUPACK is designed for shared-memory
I
multiprocessors using OpenMP, the basic principle is currently being transferred to distributed-memory parallel architectures via MPI. Future research will be devoted to block-structured algorithms to improve cache performance. Block structures have been proven to be extremely efficient for direct methods, but their integration into incomplete factorizations remains a challenge; on the other hand, recent research indicates the high potential of block-structured algorithms also for ILUs [, ].
Related Entries Graph Partitioning Linear Algebra Software Load Balancing, Distributed Memory Nonuniform Memory Access (NUMA) Machines Shared-Memory Multiprocessors
Bibliographic Notes and Further Reading Around , a first package also called ILUPACK was developed by H.D. Simon [], which used reorderings, incomplete factorizations, and iterative methods in one package. Nowadays, as the development of preconditioning methods has advanced significantly by novel approaches like improved reorderings, multilevel methods, maximum weight matchings, and inversebased pivoting, incomplete factorization methods have changed completely and gained wide acceptance among the scientific community. Besides ILUPACK, there exist several software packages based on incomplete factorizations and on multilevel factorizations. For example, Y. Saad (http://www-users.cs.umn.edu/∼saad/ software/) et al. developed the software packages SPARSKIT, ITSOL, and pARMS, which also inspired the development of ILUPACK. In particular, pARMS is a parallel code based on multilevel ILU interlaced with reorderings. J. Mayer (http://iamlasun.mathematik. uni-karlsruhe.de/∼ae/iluplusplus.html) developed ILU++, also for multilevel ILU. MRILU, by F.W. Wubs (http://www.math.rug.nl/∼wubs/mrilu/) et al., uses multilevel incomplete factorizations and has successfully been applied to problems arising from partial differential equations. There are many other scientific publications on multilevel ILU methods, most of them especially tailored for the use in partial differential equations.
I
I
Impass
ILUPACK has been successfully applied to several large scale application problems, in particular, to the Anderson model of localization [] or the Helmholtz equation []. Its symmetric version is integrated into the software package JADAMILU (http://homepages.ulb.ac.be/∼jadamilu/), which is a Jacobi-Davidson-based eigenvalue solver for symmetric eigenvalue problems. More details on the design aspects of the parallel multilevel preconditioner (including dynamic scheduling) and experimental data with a benchmark of irregular matrices from the UF sparse matrix collection (http://www.cise.ufl.edu/research/sparse/ matrices/) can be found in [, ]. The computed parallel multilevel factorization is incorporated to a Krylov subspace solver, and the underlying parallel structure of the former, expressed by the task dependency tree, is exploited for the application of the preconditioner as well as other major operations in iterative solvers []. The experience with these parallel techniques has revealed that the sequential computation of the nested dissection hierarchy (included in, e.g., METIS or SCOTCH) dominates the execution time of the whole solution process when the number of processors is large. Therefore, additional types of parallelism have to be exploited during this stage in order to develop scalable parallel solutions; in [], two parallel partitioning packages, ParMETIS (http://glaros.dtc. umn.edu/gkhome/metis/parmetis/overview) and PTSCOTCH (http://www.labri.fr/perso/pelegrin/scotch), are evaluated with this purpose.
Bibliography . Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES () Design, tuning and evaluation of parallel multilevel ILU preconditioners. In: Palma J, Amestoy P, Dayde M, Mattoso M, Lopes JC (eds) High performance computing for computational science – VECPAR , Toulouse, France. Number in Lecture Notes in Computer Science, pp –. Springer, Berlin/Heidelberg . Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES () Exploiting thread-level parallelism in the iterative solution of sparse linear systems. Technical report, Dpto. de Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (submitted for publication) . Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES () Evaluation of parallel sparse matrix partitioning software for parallel multilevel ILU preconditioning on shared-memory multiprocessors. In: Chapman B et al (eds) Parallel computing: from multicores and GPUs to petascale. Advances in parallel computing, vol . IOS Press, Amsterdam, pp –
. Amestoy P, Davis TA, Duff IS () An approximate minimum degree ordering algorithm. SIAM J Matrix Anal Appl (): – . Bollhöfer M, Grote MJ, Schenk O () Algebraic multilevel preconditioner for the Helmholtz equation in heterogeneous media. SIAM J Sci Comput ():– . Bollhöfer M, Saad Y () Multilevel preconditioners constructed from inverse–based ILUs. SIAM J Sci Comput ():– . Duff IS, Koster J () The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J Matrix Anal Appl ():– . Duff IS, Pralet S () Strategies for scaling and pivoting for sparse symmetric indefinite problems. SIAM J Matrix Anal Appl ():– . Duff IS, Uçar B (Aug ) Combinatorial problems in solving linear systems. Technical Report TR/PA//, CERFACS . Freund R, Nachtigal N () Software for simplified Lanczos and QMR algorithms. Appl Numer Math ():– . George A, Liu JW () The evolution of the minimum degree ordering algorithm. SIAM Rev ():– . Gupta A, George T () Adaptive techniques for improving the performance of incomplete factorization preconditioning. SIAM J Sci Comput ():– . Hénon P, Ramet P, Roman J () On finding approximate supernodes for an efficient block-ILU(k) factorization. Parallel Comput (–):– . Karypis G, Kumar V () A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput ():– . Saad Y () Iterative methods for sparse linear systems, nd edn. SIAM Publications, Philadelphia, PA . Saad Y, Schultz M () GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput :– . Schenk O, Bollhöfer M, Römer RA () Awarded SIGEST paper: on large scale diagonalization techniques for the Anderson model of localization. SIAM Rev :– . Simon HD (Jan ) User guide for ILUPACK: incomplete LU factorization and iterative methods. Technical Report ETA-LR, Boeing Computer Services
Impass Deadlocks
Implementations of Shared Memory in Software Software Distributed Shared Memory
InfiniBand
Index All-to-All
InfiniBand Dhabaleswar K. Panda, Sayantan Sur The Ohio State University, Columbus, OH, USA
Synonyms Interconnection network; Network architecture
Definition The InfiniBand Architecture (IBA) describes a switched interconnect technology for inter-processor communication and I/O in a multiprocessor system. The architecture is independent of the host operating system and the processor platform. InfiniBand is based on a widely adopted open standard.
Discussion Introduction As the commodity clusters started getting popular around mid-nineties, the common interconnect used for such clusters was Fast Ethernet ( Mbps). These clusters were identified as Beowulf clusters []. Even though it was cost effective to design such clusters with Ethernet/Fast Ethernet, the communication performance was not very good because of the high overhead associated with the standard TCP/IP communication protocol stack. The high overhead was because of the TCP/IP protocol stack being completely executed on the host processor. The network interface cards ((NICs) or commonly known as network adapters) on those systems were also not intelligent. Thus, these adapters did not allow any overlap of computation with communication or I/O. This led to high latency, low bandwidth, and high CPU requirement for communication and I/O operations on those clusters. The above limitations led to a goal for designing a new and converged interconnect with open standard. Seven industry leaders (Compaq, Dell, IBM, Intel, Microsoft, HP, and Sun) formed a new InfiniBand Trade association []. The charter for this association was “to
I
design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking and storage technologies.” Many other companies participated in this consortium later. The members of this consortium deliberated and defined the InfiniBand architecture specification. The first specification (Volume , Version .) was released to the public on October , . Since then this standard is becoming enhanced periodically. The latest version is .. was released in January . The word “InfiniBand” is coined from two words “Infinite Bandwidth”. The architecture is defined in such a manner that as the speed of computing and networking technologies improve over time, the InfiniBand architecture should be able to deliver higher and higher bandwidth. During the inception in , InfiniBand was delivering a link speed of . Gbps (payload data rate of Gbps due to the underlying / encoding and decoding). Now in , it is able to deliver a link speed of Gbps (payload data rate of Gbps).
Communication Model The communication model comprises of an interaction of multiple components in the interconnect system. The major components and their interaction are described below.
Topology and Network Components At a high level, IBA serves as an interconnection of nodes, where each node can be a processor node, an I/O unit, a switch or a router to another network, as illustrated in Fig. . Processor nodes or I/O units are typically referred to as “end nodes,” while switches and routers are referred to as “intermediate nodes” (or sometimes as “routing elements”). An IBA network is subdivided into subnets interconnected by routers. Overall, the IB fabric is comprised of four different components: (a) channel adapters, (b) links and repeaters, (c) switches, and (d) routers. Channel adapters: A processor or I/O node connects to the fabric using channel adapters connecting them to the IB fabric. These channel adapters consume and generate IB packets. Most current channel adapters are equipped with programmable direct memory access (DMA) engines with protection features implemented in hardware. Each adapter can have one or more physical ports connecting it to either a
I
I
InfiniBand
Processor Node CPU
CPU
CPU
CPU
CPU Mem
HCA
Processor Node
Processor Node
HCA
CPU
Mem
CPU
CPU
Fabric
CPU
HCA
Mem
HCA
HCA
Switch
Switch
RAID Subsystem
Router
Switch
Other IB Subnets WANs LANs Processor Nodes
SCSI SCSI
Processor
SCSI SCSI SCSI
Mem
TCA
TCA
Switch
Switch
Controller
Drivers
SCSI Graphics Fiber Channel hub and FC devices
TCA IO Module
TCA IO Module
TCA IO Module
TCA IO Module
Video Ethernet
Switch
Storage Subsystem
I/O Chassis TCA IO Module
I/O Chassis TCA IO Module
TCA IO Module
TCA IO Module
Consoles
TCA IO Module
Switch TCA IO Module
HCA = InfiniBand Channel Adapter in processor node TCA = InfiniBand Channel Adapter in I/O node
InfiniBand. Fig. Typical InfiniBand Cluster (Courtesy InfiniBand Standard)
switch/router or another adapter. Each physical port itself internally maintains two or more virtual channels (or virtual lanes) with independent buffering for each of them. The channel adapters also provide a memory translation and protection (MTP) mechanism that translates virtual addresses to physical addresses and validates access rights. Each channel adapter has a globally unique identifier (GUID) assigned by the channel adapter vendor. Additionally, each port on the channel adapter has a unique port GUID assigned by the vendor as well. Links and repeaters: Links interconnect channel adapters, switches, routers, and repeaters to form an IB network fabric. Different forms of IB links are currently available, including copper links, optical links, and printed circuit wiring on a backplane. Repeaters are transparent devices that extend the range of a link. Links and repeaters themselves are not accessible in the IB architecture. However, the status of a link (e.g., whether it is up or down) can be determined through the devices which the link connects.
Switches: IBA switches are the fundamental routing components for intra-subnet routing. Switches do not generate or consume data packets (they generate/consume management packets). Every destination within the subnet is configured with one or more unique local identifiers (LIDs). Packets contain a destination address that specifies the LID of the destination (DLID). The switch is typically configured out-of-band with a forwarding table that allows it to route the packet to the appropriate output port. This out-of-band configuration of the switch is handled by a separate component within IBA called as the subnet manager, as we will discuss later in this chapter. It is to be noted that such LID-based forwarding allows the subnet manager to configure multiple routes between the same two destinations. This, in turn, allows the network to maximize availability by rerouting packets around failed links through reconfiguration of the forwarding tables. Routers: IBA routers are the fundamental routing component for inter-subnet routing. Like switches, routers do not generate or consume data packets; they
InfiniBand
simply pass them along. Routers forward packets based on the packet’s global route header and replace the packet’s local route header as the packet passes from one subnet to another. The primary difference between a router and a switch is that routers are not completely transparent to the end nodes since the source must specify the LID of the router and also provide the global identifier (GID) of the destination. IB routers use the IPv protocol to derive their forwarding tables.
Messaging Communication operations in InfiniBand are initiated by the consumer by setting up a list of instructions that the hardware executes. In IB terminology, this facility is referred to as a work queue. In general, work queues are created in pairs, referred to as queue pairs (QPs), one for send operations and one for receive operations. The consumer process submits a work queue element (WQE) to be placed in the appropriate work queue. The channel adapter executes WQEs in the order in which they were placed. After executing a WQE, the channel adapter places a completion queue entry (CQE) in a completion queue (CQ); each CQE contains or points to all the information corresponding to the completed request. This is illustrated in Fig. . Send operations: There are three classes of send queue operations: (a) Send, (b) Remote Direct Memory Access (RDMA), and (c) Memory Binding. For a send operation, the WQE specifies a block of data in the consumer’s memory. The IB hardware uses this information to send the associated data to the destination. The send operation requires the consumer process Consumer
Work Queue WQE Work Queue
Work Request
WQE
WQE
WQE
WQE
Work Queue WQE
Work Completion
WQE
Completion Queue CQE CQE CQE CQE
Hardware
InfiniBand. Fig. Consumer Queuing Model (Courtesy InfiniBand Standard)
I
to prepost a receive WQE, which the consumer network adapter uses to place the incoming data in the appropriate location. For an RDMA operation, together with the information about the local buffer and the destination end point, the WQE also specifies the address of the remote consumer’s memory. Thus, RDMA operations do not need the consumer to specify the target memory location using a receive WQE. There are four types of RDMA operations: RDMA write, RDMA write with immediate data, RDMA read, and Atomic. In an RDMA write operation, the sender specifies the target location to which the data has to be written. In this case, the consumer process does not need to post any receive WQE at all. In an RDMA write with immediate data operation, the sender again specifies the target location to write data, but the receiver still needs to post a receive WQE. The receive WQE is marked complete when all the data is written. An RDMA read operation is similar to an RDMA write operation, except that instead of writing data to the target location, data is read from the target location. Finally, there are two types of atomic operations specified of RDMA operations: RDMA write, RDMA write with immediate data, RDMA read, and Atomic. In an RDMA write operation, the sender specifies the target location to which the data has to be written. In this case, the consumer process does not need to post any receive WQE at all. In an RDMA write with immediate operation requests for the same location simultaneously, the second operation is not started till the first one completes. However, such atomicity is not guaranteed when another device is operating on the same data. For example, if the CPU or another InfiniBand adapter modifies data while one InfiniBand adapter is performing an atomic operation on this data, the target location might get corrupted. Receive operations: Unlike the send operations, there is only one type of receive operation. For a send operation, the corresponding receive WQE specifies where to place data that is received, and if requested places the receive WQE in the completion queue. For an RDMA write with immediate data operation, the consumer process does not have to specify where to place the data (if specified, the hardware will ignore it). When the message arrives, the receive WQE is updated with
I
I
InfiniBand
the memory location specified by the sender, and the WQE placed in the completion queue, if requested.
Overview of Features IBA describes a multilayer network protocol stack. Each layer provides several features that are usable by applications : () link layer features, () network layer features, and () transport layer features.
Link Layer Features CRC-based data integrity: IBA provides two forms of CRC-based data integrity to achieve both early error detection as well as end-to-end reliability: invariant CRC and variant CRC. Invariant CRC (ICRC) covers fields that do not change on each network hop. Variant CRC (VCRC), on the other hand, covers the entire packet including the variant as well as the invariant fields. Buffering and flow control: IBA provides an absolute credit-based flow control where the receiver guarantees that it has enough space allotted to receive N blocks of data. The sender sends only N blocks of data before waiting for an acknowledgment from the receiver. The receiver occasionally updates the sender with an acknowledgment as and when the receive buffers get freed up. Note that this link-level flow control has no relation to the number of messages sent, but only to the total amount of data that has been sent. Virtual lanes, service levels, and QoS: Virtual lanes (VLs) are a mechanism that allow the emulation of multiple virtual links within a single physical link as illustrated in Fig. . Each port provides at least and up to virtual lanes (VL to VL). VL is reserved exclusively for subnet management traffic. Congestion control: Together with flow-control, IBA also defines a multistep congestion control mechanism for the network fabric. Specifically, congestion control when used in conjunction with the flow-control mechanism is intended to alleviate head-of-line blocking for non-congested flows in a congested environment. To achieve effective congestion control, IB switch ports need to first identify whether they are the root or the victim of a congestion. A switch port is a root of a congestion if it is sending data to a destination faster than it can receive, thus using up all the flow-control credits available on the switch link. On the other hand, a port is a victim of a congestion if it is unable to send
data on a link because another node is using up all of the available flow-control credits on the link. In order to identify whether a port is the root of the victim of a congestion, IBA specifies a simple approach. When a switch port notices congestion, if it has no flow-control credits left, then it assumes that it is a victim of congestion. On the other hand, when a switch port notices congestion, if it has flow-control credits left, then it assumes that it is the root of a congestion. This approach is not perfect: it is possible that even a root of a congestion may not have flow-control credits remaining at some point of the communication (for example, if the receiver process is too slow in receiving data); in this case, even the root of the congestion would assume that it is a victim of the congestion. Thus, though not required by the IBA, in practice, IB switches are configured to react with the congestion control protocol irrespective of whether they are the root of the victim of the congestion or not. Static rate control: IBA defines a number of different link bit rates, such as x SDR (. Gbps), x SDR ( Gbps), x SDR ( Gbps), x DDR ( Gbps), and x QDR (Gbps), and so on. It is to be noted that these rates are signaling rates; with an b/b data encoding, the actual maximum data rates would be Gbps (x SDR), Gbps (x SDR), Gbps (x SDR), Gbps (x DDR), and Gbps (x QDR), respectively. To simultaneously support multiple link speeds within a fabric, IBA defines a static rate control mechanism that prevents faster links from overrunning the capacity of ports with slower links. Using a static rate control mechanism, each destination has a timeout value based on the ratio of the destination and source link speeds.
Network Layer Features IB routing: IB network-layer routing is similar to the link-layer switching, except for two primary differences. First, network-layer routing relies on a global routing header (GRH) that allows packets to be routed between subnets. Second, VL (the virtual lane reserved for management traffic) is not respected by IB routers since management traffic stays within the subnet and is never routed across subnets. Flow labels: Flow labels identify a single flow of packets. For example, these can identify packets belonging to a single connection that need to be delivered in order, thus allowing routers to optimize traffic routing when
InfiniBand
Attribute Scalability (M processes on N processor nodes communicating with all processes on all nodes)
Reliable Connection
Reliable Datagram
M2*N QPs required on each processor node, per CA
Unreliable Datagram
M QPs required M QPs required M2*N QPs required on each proces- on each proces- on each processor sor node, per CA. sor node, per CA. node, per CA.
Corrupt data detected Data delivery guarantee
Reliability
Data order guaranteed
Raw Datagram (both IPv6 & ethertype) 1 QP required on each end node, per CA.
Yes, per connection
No guarantees
Yes, packets from any one source QP are ordered to multiple destination QPs.
No
Unordered and duplicate packets are detected.
Yes
No
Yes
No
Reliable. Errors are detected at both the requestor and the responder. The requestor can transparently recover from errors (retransmission, alternate path, etc.) without any involvement of the client application. QP processing is halted only if the destination is inoperable or all fabric paths between the channel adapters have failed.
Unreliable. Packets with some types of errors may not be delivered. Neither source nor destination QPs are informed of dropped packets.
Unreliable. Packets with errors, including sequence errors, are detected and may be logged by the responder. The requestor is not informed.
Unreliable. Packets with errors are not delivered. The requestor and responder are not informed of dropped packets.
No
No
Yes : RDMA WRITEs No: RDMA READs & ATOMICs
No
RDMA and ATOMIC operations
Yes
Yes
Bind memory window
Yes
Yes
No
Yes
No
IBA unreliable multicast support
No
No
Yes
No
No
No
No
No
No
Yes
Raw multicast
Yes Data delivered exactly once
Data loss detected Error recovery
Unreliable Connection
I
Message size
Message size 0 to 231 bytes. Smaller max size may be negotiated by Connection Management. A message may consist of multiple packets.
Single PMTU Message size 0 to 231 Single PMTU packet datagrams bytes. Smaller max packet datagrams – 0 to 4,096 bytes. size may be negoti– 0 to 4,096 bytes. ated by Connection Management. A message may consist of multiple packets.
Connection oriented?
Connected. The client connects the local QP to one and only one remote QP. No other traffic flows over these QPs.
Connectionless. No prior connection is needed for communication.
Connectionless. Appears connectionless to the client – uses one or more End-to-End contexts per CA to provide reliability service.
Connected. The client connects the local QP to one and only one remote QP. No other traffic flows over these QPs.
Connectionless. No prior connection is needed for communication.
InfiniBand. Fig. IB Transport Services (Courtesy InfiniBand Standard)
using multiple paths for communication. IB routers are allowed to change flow labels as needed, for example, to distinguish two different flows which have been given the same label. However, during such relabeling, routers ensure that all packets corresponding to the same flow will have the same label.
Transport Layer Features IB transport services: IBA defines four types of transport services (reliable connection, reliable datagram, unreliable connection, and unreliable datagram) and two types of raw communication services (raw IPv datagram and raw Ethertype datagram) to allow for
I
I
InfiniBand
encapsulation of non-IBA protocols. The transport service describes the degree of reliability and how the data is communicated. As illustrated in Fig. , each transport service provides a trade-off in the amount of resources used and the capabilities provided. For example, while the reliable connection provides the most features, it has to establish a QP for each peer process it communicates such, thus requiring a quadratically increasing number of QPs with the number of processes. Reliable and unreliable datagram services on the other hand utilize fewer resources since a single QP can be used to communicate with all peer processes. Automatic path migration: The Automatic Path Migration (APM) feature of InfiniBand allows a connection end point to specify a fallback path, together with a primary path for a data connection. All data is initially sent over the primary path. However, if the primary path starts throwing excessive errors or is heavily loaded, the hardware can automatically migrate the connection to the new path, after which all data is only sent over the new path. Message-level flow control: Together with the linklevel flow control, for reliable connection, IBA also defines an end-to-end message-level flow-control mechanism. This message-level flow control does not deal with the number of bytes of data being communicated, but rather with the number of messages being communicated. Specifically, a sender is only allowed to send as many messages that use up receive WQEs as there are receive WQEs posted by the consumer. That is, if the consumer has posted receive WQEs, after the sender sends out send or RDMA write with immediate data messages, the next message is not communicated till the receiver posts another receive WQE. Shared Receive Queue (SRQ): Introduced in the InfiniBand . specification, shared receive queues (SRQs) were added to help address scalability issues with InfiniBand memory usage. When using the RC transport of InfiniBand, one QP is required per communicating peer. To prepost receives on each QP, however, can have very high memory requirements for communication buffers. To give an example, consider a fully connected MPI job of K processes. Each process in the job will require K - QPs, each with n buffers of size s posted to it. Given a conservative setting of n = and s = KB, over MB of memory per process
would be required simply for communication buffers that may not be used. Given that current InfiniBand clusters now reach K processes, maximum memory usage would potentially be over GB per process in that configuration. Recognizing that such buffers could be pooled, SRQ support was added, so instead of connecting a QP to a dedicated RQ, buffers could be shared across QPs. In this method, a smaller pool can be allocated and then refilled as needed instead of pre-posting on each connection. Extended Reliable Connection (XRC): eXtended reliable connection (XRC) provides the services of the RC transport, but defines a very different connection model and method for determining data placement on the receiver in channel semantics. This mode was mostly designed to address multi-core clusters and lower memory consumption of QPs. For one process to communicate with another over RC, each side of communication must have a dedicated QP for the other. There is no distinction as to the node in terms of allowing communication. In XRC, a process no longer needs to have a QP to every process on a remote node. Instead, once one QP to a node has been set up, messages can be routed to the other processes by giving the address/number of a shared receive queue (SRQ). In this model, the number of QPs required is based on the number of nodes in the job rather than the number of processes.
Management and Services Together with the regular communication capabilities, IBA also defines an elaborate management semantics and several services. The basic management messaging is handled through special packets called as management datagrams (MADs). The IBA management model defines several management classes as described below. Subnet Management: The subnet management class deals with discovering, initializing, and maintaining an IB subnet. It also deals with interfacing with diagnostic frameworks for handling subnet and protocol errors. For subnet management, each IB subnet has at least one subnet manager (SM). An SM can be a collection of multiple processing elements, though they appear as a single logical management entity. These processes can internally make management decisions, for example, to manage large networks in a distributed fashion. Each
InfiniBand
channel adapter, switch, and router, also maintains a subnet management agent (SMA) that interacts with the SM and handles management of the specific device on which it resides as directed by the SM. An SMA can be viewed as a worker process, though in practice it could be implemented as hardware logic. Subnet Administration: Subnet administration (SA) deals with providing consumers with access to information related to the subnet through the subnet management interface (SMI). Most of this information is collected by the SM, and hence the SA works very closely with the SM. In fact, in most implementations, a single logical SM processing unit performs the tasks of both the SM and the SA. The SA typically provides information that cannot be locally computed to consumer processes (such as data paths, SL-to-VL mappings, and partitioning information). Further, it also deals with inter-SM management such as handling standby SMs. Communication Management: Communication management deals with the protocols and mechanisms used to establish, maintain, and release channels for RC, UC, and RD transport services. At creation, QPs are not ready for communication. The CM infrastructure is responsible for preparing the QPs for communication by exchanging appropriate information. Performance Management: The performance management class allows performance management entities to retrieve performance and error statistics from IB components. There are two classes of statistics: (a) mandatory statistics that have to be supported by all IB devices and (b) optional statistics that are specified to allow for future standardization. There are several mandatory statistics supported by current IB devices including the amount of bytes/packets sent and received, transmit queue depth at various intervals, number of ticks during which the port had data to send but had no flow-control credits, etc. Device Management: Device management is an optional class that mainly focuses on devices that do not directly connect to the IB fabric, such as I/O devices and I/O controllers. An I/O unit (IOU) containing one or more IOCs is attached to the fabric using a channel adapter. The channel adapter is responsible for receiving packets from the fabric and delivering them to the relevant devices and vice versa. The device management class does not deal with directly managing the end
I
I/O devices, but rather focuses on the communication with the channel adapter. Any other communication between the adapter and the I/O devices is unspecified by the IB standard and depends on the device vendor.
InfiniBand Today To achieve high performance, adapters are continually updated with the latest in speeds (SDR, DDR and QDR) and I/O interface technology. Earlier InfiniBand adapters supported only PCI-X I/O interface technology, but have now moved to PCI-Express (x and x) and Hyper-Transport. More recently, QDR cards have moved to the PCI-Express . standard to further increase I/O bandwidth. Although InfiniBand adapters have traditionally been add-on components in expansion slots, they have recently started moving onto the motherboard. Many boards including some designed by Tyan and SuperMicro include adapters directly on the motherboard. In , Mellanox Technologies released the fourth generation of their adapter called ConnectX. This was the first InfiniBand adapter to support QDR speed. This adapter in addition to increasing speeds and lowering latency included support for Gigabit Ethernet. Depending on the cable connected to the adapter port, it either performs as an Ethernet adapter or an InfiniBand adapter. The current version of ConnectX adapter, ConnectX- supports advanced offloading of arbitrary lists of send, receive and wait tasks to the network interface. This enables network offloaded collective operations, which are critical to scaling messaging libraries to very large systems. The IBA specification does not specify the exact software interface through which hardware is accessed. OpenFabrics is an industry consortium that has specified an interface through which IB hardware can be accessed. It also includes support for other RDMA capable interconnects, such as iWARP. A few current usage environments for IB are mentioned below: Message passing interface: MVAPICH [] and MVAPICH [] are implementations of MPI over InfiniBand. They are based on MPICH [] and MPICH [], respectively. As of January , these stacks are used by more than , organizations in countries around
I
I
InfiniBand
the world. There have been more than , downloads from the Ohio State University site directly. These stacks are empowering many powerful computers, including “Ranger,” which is the sixth ranked cluster according to the November TOP rankings. It has , cores and is located at the Texas Advanced Computing Center (TACC). MVAPICH/MVAPICH work with many existing IB software layers, including the OpenFabrics Gen interface. These software stacks are also available in an integrated manner with the software stacks of many other InfiniBand vendors, server vendors, and Linux distributors (such as RedHat and SuSE). The OpenMPI project [] is yet another implementation of MPI. Open MPI also works on InfiniBand using the OpenFabrics interface and is widely used. In addition to the open-source implementations of MPI, there are several commercial implementations of MPI, such as Intel MPI and Platform MPI. File systems: Modern parallel applications use terabytes and petabytes of data. Thus, parallel file systems are significant components in designing largescale clusters. Multiple features (low latency, high bandwidth and low CPU utilization) of IB are quite attractive for designing such parallel file systems. There are implementations of widely used parallel file systems such as Lustre [] and Parallel Virtual File Systems [] over InfiniBand using OpenFabrics interface. Data-centers: Together with scientific computing domains such as the MPI and parallel file-systems, IB is also making significant inroads into enterprise computing environments such as web data centers and financial markets. There are two ways in which socketsbased applications can execute on top of IB. In the first approach, IB provides a simple driver (known as IPoIB) that emulates the functionality of Ethernet on top of IB. This allows the kernel-based TCP/IP stack to communicate over IB. However, this approach does not utilize any of the advanced features of IB and thus is restricted in the performance it can achieve. The second approach utilizes the Sockets Direct Protocol (SDP) []. SDP is a byte-stream transport protocol that closely mimics TCP socket’s stream semantics. It is an industry-standard specification for IB that utilizes advanced capabilities provided by the network stack (such as the hardware offloaded protocol stack and RDMA) to achieve high
performance without requiring modifications to existing sockets-based applications. Recently, companies such as RNA Networks are providing memory virtualization using IB. These products are based on the core concept of RDMA and os-bypass protocols.
Future Directions Strong features and mechanisms of IB, together with the continuous advancements in IB hardware and software products, are leading many high-end clusters and parallel file systems. In less than years of its introduction, the open standard IB is already claiming more than % share in TOP systems. This market share is expected to grow significantly during the next few years. Multiple parallel file systems, storage systems, database systems, and visualization systems are also using InfiniBand. As indicated in this entry, InfiniBand is also gradually moving into campus backbones and WANs. This is opening up multiple new applications and environments to utilize InfiniBand. IB x QDR products are currently available in the market in a commodity manner. These are getting to be used in modern clusters with multi-core nodes (– cores per node). IB Roadmap also talks about the availability of FDR (Fourteen Data Rate) products during the time frame. The FDR technology promises to use / bit encoding and deliver Gbps data rate. The next generation IB products with EDR (Enhanced Data Rate) technology, supporting Gbps data rate, are planned for . The future IB technologies promise to build high performance, scalable and balanced systems with next generation multi-core nodes (– cores per node) and accelerate the field of highend computing.
Bibliographic Notes and Further Reading Open-standard high-performance networking technologies, such as InfiniBand have spurred a lot of interest among researchers. This is an active area of research. There are several sub-domains: scalability and performance, routing, fault-tolerance, design of filesystems etc. The reader is encouraged to peruse [–] for more details on scalability and performance. Routing and
Instruction-Level Parallelism
management are covered in [–]. Fault-tolerance issues are covered in [, ].
Bibliography . Becker DJ et al () Beowulf: a parallel workstation for scientific computation. In International parallel processing symposium, Honolulu, . InfiniBand Trade Association. InfiniBand Architecture Specification, vol , Release .. http://www.infinibandta.com . Network-Based Computing Laboratory. MVAPICH: MPI for InfiniBand, GigE/iWARP and RoCEE. http://mvapich.cse. ohio-state.edu/ . Liu J, Jiang W, Wyckoff P, Panda DK, Ashton D, Buntinas D, Gropp W, Toonen B () Design and implementation of MPICH over infiniBand with RDMA support. In: Proceedings of international parallel and distributed processing symposium (IPDPS ’), Santa Fe, April . Gropp W, Lusk E, Doss N, Skjellum A () A high-performance, portable implementation of the MPI, message passing interface standard. Technical report, Argonne National Laboratory and Mississippi State University . Liu J, Jiang W, Wyckoff P, Panda DK, Ashton D, Buntinas D, Gropp B, Tooney B () High performance implementation of MPICH over InfiniBand with RDMA support. In: IPDPS, Santa Fe, . Garbriel E, Fagg GE, Bosilica G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS () Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the th European PVM/MPI users’ group meeting, Budapest, . Cluster File System Inc. Lustre: scalable clustered object storage. http://www.lustre.org/ . PVFS Developer Team. Parallel Virtual File System, Version . http://www.pvfs.org/pvfs/, November . SDP Specification. http://www.rdmaconsortium.org/home . Sur S, Chai L, Jin H-W, Panda DK () Shared receive queue based scalable MPI design for infiniband clusters. In: International parallel and distributed processing symposium (IPDPS), Rhodes Island . Sur S, Chai L, Jin H-W, Panda DK () RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: International symposium on principles and practice of parallel programming (PPoPP), New York . Koop M, Kumar R, Panda DK () Can software reliability outperform hardware reliability on high performance interconnects? A case study with MPI over InfiniBand. In: nd ACM international conference on supercomputing (ICS), June . Koop M, Sridhar J, Panda DK () Scalable MPI design over InfiniBand using eXtended reliable connection. In: IEEE international conference on cluster computing (Cluster ), Tsukuba, September
I
. Koop M, Jones T, Panda DK () Reducing connection memory requirements of MPI for InfiniBand clusters: a message coalescing approach. In: th IEEE international symposium on cluster computing and the grid (CCGrid), Rio de Janeiro, May . Liu J, Mamidala A, Panda DK () Fast and scalable collective operations using remote memory operations on VIA-based clusters. In: International parallel and distributed processing symposium (IPDPS ’), Nice, April . Skeie T, Lysne O, Theiss I () Layered shortest path (lash) routing in irregular system area networks. In: Proceedings of the th international parallel and distributed processing symposium, IPDPS ’, Washington, DC, . IEEE Computer Society, pp . Koop VAM, Moody A, Mamidala A, Narravula S, Panda DK () Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: th IEEE international symposium on cluster computing and the grid (CCGrid), Rio de Janeiro, May . Vishnu A, Mamidala AR, Panda DK () Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM. In: Workshop on system management tools on large scale parallel systems, held in conjunction with IPDPS, Denver . Hoeer T, Schneider T, Lumsdaine A () Optimized routing for large-scale InfiniBand networks. In: th Annual IEEE symposium on high performance interconnects (HOTI ), New York, August . Gao Q, Yu W, Huang W, Panda DK () Applicationtransparent checkpoint/restart for MPI programs over InfiniBand. In: Proceedings of the international conference on parallel processing (ICPP), Colombus, August . Ouyang X, Gopalakrishnan K, Panda DK () Accelerating checkpoint operation by node-level write aggregation on multicore systems. In: Submitted to international symposium on cluster computing and the grid (CCGrid), May
Instant Replay Debugging
Instruction-Level Parallelism Instruction-level parallelism is the possibility of scheduling multiple machine instructions from a single program so that their executions overlap in time. Instruction-level parallelism was exploited by some early machines such as the CDC and IMB /.
I
I
Instruction Systolic Arrays
Today, superscalar, EPIC, and VLIW machines take advantage of this form of parallelism.
Related Entries Branch Predictors Control Data EPIC Processors HPS Microarchitecture IBM Power Architecture IBM System/ Model Intel Core Microarchitecture, x Processor Family Modulo Scheduling and Loop Pipelining Multiflow Computer Superscalar Processors VLIW Processors
Instruction Systolic Arrays
sophisticated ways to get the highest level of Instruction Level Parallelism (ILP) from sequential instruction streams, using out-of-order, superscalar pipelining. The on-chip memory hierarchy employs multiple levels of inclusive caches, uses different kinds of hardware prefetching techniques, and is partly shared among the cores on the chip. These processors belong to the family of xcompatible processors, originating in the processor from Intel in . Their Instruction Set Architecture (ISA) – i.e., the programming interface – is classified as CISC (Complex Instruction Set Computing), which means that the assembly code is compact but quite complex and irregular. The -bit and -bit extensions of the original x ISA are called IA- and Intel by Intel, respectively. Starting their success in Personal Computers (PC) from IBM, processors with these ISAs are dominating today in desktop and laptop machines, but are also used heavily in the server market as well as in HPC systems.
Systolic Arrays
Discussion
Intel Celeron Intel Core Microarchitecture, x Processor Family
Intel Core Microarchitecture, x Processor Family Josef Weidendorfer Technische Universität München, München, Germany
Synonyms Core-Duo / Core-Quad Processors; Pentium; Intel Celeron
Definition The Intel Core Microarchitecture is the name of the internal architecture of multicore processor implementations from Intel Corporation introduced in . The processing cores of the Core Microarchitecture use
This entry describes Intels Core Microarchitecture as example of a machine with Instruction Level Parallelism (ILP), that is, it focuses on how parallelism is exploited in a processor core of this micro-architecture, executing a single sequential instruction stream. For detailed description of techniques used as well as more about these processors when embedded into shared-memory multiprocessors, see related entries as mentioned in the according section at the end of this entry. Before the description of the micro-architecture, a short overview of microprocessor components, ILP techniques, and the x ISA is given as basis. One has to note that the term “processor architecture” itself is used in different ways, one being the pure programming interface (i.e., the ISA), and the other describing the inner workings of the processor, i.e., the micro-architecture.
Introduction Microprocessor components Processor cores of Intels x family, as almost any processor core nowadays, work according to the vonNeumann architecture: the main components are a control unit, an arithmetic logic unit (ALU), memory, and
Intel Core Microarchitecture, x Processor Family
Control unit
Aritmethic logic unit
Memory
I/O
Intel Core Microarchitecture, x Processor Family. Fig. Components of a von-Neumann architecture
an I/O subsystem as well as connections between all components. This can be seen in Fig. . The first two make up a Central Processing Unit (CPU). The memory consists of a one-dimensional array of equal-sized memory cells, which both hold data used in computation done by the ALU as well as program code to be executed by the control unit. This makes it possible to treat program code as data, thus vastly simplifying mechanisms needed, for example, to load/store data and code from/to I/O devices. It also allows programs to generate new code, or modify their code. However, as memory regularly needs to be accessed both from the control unit as well as the ALU, this becomes a potential bottleneck. In fact, memory performance (access latency and bandwidth) is an limiting factor for the performance of modern processors (see entry Memory Wall, also called the von-Neumann bottleneck). The reason is that memory on the one hand, and control unit/ALU on the other hand, are manufactured as separate chips, such that memory accesses have to cross chip boundaries. This is done for more flexibility, for example, enabling the use of different manufacturing technologies. In the von-Neumann architecture, a program is executed by using an instruction pointer. An instruction at the memory pointed to by the instruction pointer is fetched from memory, and afterward interpreted by the control unit. For every instruction type, the control unit has to know the exact sequence of actions to do, be it the execution of a calculation via the ALU, probably involving loading/storing data from memory, or directly influencing the instruction pointer, which otherwise is just incremented to point at the next instruction. As can be seen from the above description, fast program execution requires fast memory access. This can be improved by not forcing every memory access to reach out to the off-chip memory module, which
I
is relatively far away. Instead, buffers are introduced which are able to hold copies of memory cells. The fastest buffers are called the register file, and are accessed explicitly by mentioning register numbers in the instructions, and adding instructions to move data between memory and registers. Further, larger onchip buffers are added, so-called caches, which are able to transparently hold copies of memory cells. As access time to these buffers are reciprocal to buffer sizes, it is beneficial to have a hierarchy of caches. Together with register files on the one end and main memory on the other end, this is called the memory hierarchy. As memory accesses for fetching instructions are different from other accesses, it is beneficial to provide separate buffers for these uses, that is, have separate instruction and data caches on the first level. On the second, larger level, a unified cache is used. One way to make a processor run faster is increasing the clock frequency, another one is to do multiple things simultaneously. Regarding the latter, from the point of view of a programmer, the worst way of doing things simultaneously is the requirement to explicitly have to specify what can be done in parallel and what not. Therefore, processor architects first try to exploit parallelism which is implicitly available in a single sequential instruction stream. An easy way is to increase bit-level parallelism: as long as it is useful for programs without wasting to much resources, it is the best to work with data paths as wide as possible. This is reflected by the x ISA starting as -bit architecture with the , going to -bit with the , and recently to -bit. The respective bit-width is used as register width and width of signal paths for data and memory addresses. Similar is data-level parallelism (DLP): if multiple data can be worked on in the same way, these can be packed in vector registers with multiple calculations done synchronously in one clock cycle in enhanced ALUs, following the SIMD (Single Instruction Multiple Data) concept. The according x ISA enhancements are called MMX and SSE (see Vector Extensions, Instruction-Set Architecture (ISA)). Another type of parallelism possible with one instruction stream is called Instruction Level Parallelism (ILP). It exploits the fact that sub-actions of one or multiple instructions (if there are no data dependencies) can
I
I
Intel Core Microarchitecture, x Processor Family
be done at the same time. The techniques are described in the next section in more detail, as this is needed to understand the micro-architecture of modern processors. The last type of parallelism is called Thread Level Parallelism (TLP), where multiple instruction streams work in parallel to increase performance of a program. This needs special support at different levels: hardware must be added to allow for multiple processors/processor cores to be put together, for example, special care is needed to keep the multiple copies found in different caches consistent. Instructions for synchronization and communication among the processors have to be added. Also, the software, starting with the Operating System, needs to be able to control multiple streams of execution. Finally, correctness of a parallel program is far more difficult to check. Therefore, on smaller systems with only one processor chip, for a long time, only execution of a single instruction stream at a time was supported. However, after exploiting DLP/ILP to the maximum, only increasing the clock frequency was available for improving performance. From the beginning of microprocessor technology, it could be observed that there was a constant development of miniaturization. This observation manifested in the so-called Moore’s Law: every month, the number of transistor budget available for use on one chip doubles. Astonishingly, this “law” holds true up to now and into the foreseeable future. Processor manufactures were able to extract ever increasing computing power from the exponentially growing transistor budget, by enlarging caches, increasing bit- and data-level parallelism as well as ILP to the maximum (the latter making the control units extremely complex). At the same time, miniaturization allowed to increase the clock frequency. However, this came to a halt in recent years. The high clock rates reached, together with the small structure widths, result in a lot of energy lost because of leakage current in the chip, and thus requires expensive cooling. To be able to further sell processors, the only way for processor manufactures was to add TLP inside one chip while keeping clock rate constant. This was the beginning of the multicore era, with the need to feed current processor chips with parallel programs in the hope to further get exponential performance increase for programs from Moore’s law still proving true. Thus, modern processes consist of multiple control/execution units, working in parallel. Yet,
the cores can share resources, the most obvious being caches.
Instruction Level Parallelism As written before, Instruction Level Parallelism (ILP) exploits the fact that sub-actions of one or multiple consecutive instruction in one sequential instruction stream can be executed simultaneously. The most important principle is that instructions can be subdivided into sub-actions, also called stages. Each instruction stage is executed in one clock cycle by different hardware. While one instruction “flows” through the different hardware stages (the so-called pipeline), multiple instructions can be in execution at the same time, as shown in Fig. . Typical stages are IF (instruction fetch), ID (instruction decode), MA (memory access), EX (execute calculation), and WB (write back). For a given chip manufacturing technology (typically identified by a structure width metric such as nm, nm, or nm), the time needed for a signal to reliably bypass a gate level is fixed. Smaller stages need less successive gate levels inside, therefore the clock rate can be higher. Yet, the pipeline design makes sure that at every clock cycle one instruction will finish execution. Thus, by increasing the number of stages (with every stage responsible for simpler sub-actions), the performance is increased. However, the number of stages is limited by a few factors: simplifications cannot be done indefinitely, but more important, different conflict cases can appear: ●
Resource conflicts happen when hardware resources shared by multiple stages need to be accessed at the same time, such as access paths to registers/caches. Then, the later instructions have to wait. This can be overcome by duplicating resources, for example, by adding multiple access ports to register files.
Instr. 1 Instr. 2 Instr. 3 Instr. 4
IF
ID MA EX WB IF
time
ID MA EX WB IF
ID MA EX WB IF
ID MA EX WB
Intel Core Microarchitecture, x Processor Family. Fig. Pipelined execution of instructions
Intel Core Microarchitecture, x Processor Family
●
Data conflicts happen when there is a data dependence between instructions, allowing the execution of the later instruction not to proceed in the pipeline as long as a given result is not available. To avoid this, a compiler should schedule instructions with data dependencies as far away from each other as possible. Anti-dependencies can be avoided by using temporary (shadow) registers. Another possibility is to allow instructions to be reordered, that is, instructions to overtake each other in the pipeline (socalled out-of-order). Aside from the need for complex dependence checking hardware, a final stage needs to be added to get instruction retirement in correct order again. ● Control conflicts arise when the next instruction to be executed is not known. This happens with conditional branches, where the condition is yet to be calculated. Thus, the pipeline cannot be kept filled. To overcome this, the compiler should schedule calculation of branch conditions way before the branch. Another solution is to predict the condition, and to speculatively execute instructions following the branch. If the prediction is wrong, results of already, but wrongly executed stages have to be thrown away. To minimize this problem, sophisticated branch prediction algorithms are used, which store the history of previous branch decisions in a so-called branch target buffer (BTB). Complementing the pipelining principle described up to now, another technique to exploit ILP is superscalarity. For every stage of a pipeline, there are multiple hardware units available. This allows to feed multiple instructions into the superscalar pipeline in one clock cycle. Typically, implementation involves multiple queues filling up with instructions to be executed by different execution units. Extra care has to be taken to meet data dependencies. The number of instructions that can be executed by duplicated execution units for the same pipeline stages is called the issue width of the pipeline at this stage. For the execution stage involving the ALU, usually, there is not a complete duplication done. Instead, floating point units, especially for multiplication or division, are available only once, as these take up much space. Therefore, the issue width of the execution stage usually is higher than the rest of the pipeline to minimize the probability that
I
there are not enough ALUs of needed type available for the instructions in flow. With this knowledge, the components of the Core Pipeline in the later section are easy to explain. For more details on Pipelining, see the corresponding entry.
CISC Versus RISC The programming interface of a processor is called its Instruction Set Architecture (ISA). The ISA of xcompatible processors is classified as CISC (Complex Instruction Set Computing), which means that the assembly code is compact but quite complex and irregular. When the first x processor was developed by Intel, this was the standard way: CISC is found in the Motorola , the Zilog Z, and in IBM mainframe CPUs at that time. A compact ISA was a benefit, as memory was expensive. As programmers often used assembly language, instructions for specific use cases and complex addressing modes made sense. Later in the s, compilers did not make use of such complex assembly instructions, and thus, processors with simpler ISA were designed, reducing resource consumption of the decoding step. Such designs belong to the RISC (Reduced Instruction Set Computing) idea. They use regular instructions of same size, have load/store instructions as only means to access memory, and work with large register files. Examples are the PowerPC, MIPS, SPARC, and ARM processor designs. Traditionally, each CISC instruction was executed by running a microcode program. Starting with the Intel Pentium (see below), instructions are decoded into a sequence of microoperations, similar to RISC instructions. Therefore, aside from the added cost of a decoding step, there is not much difference between CISC and RISC implementations. However, the compactness of CISC code reduces memory pressure, and allows for smaller L instruction caches.
History of x Processors In , Intel introduced its processor, a -bit architecture meant to compete with the Z processor famous at that time. Clocking at MHz, it consisted of around , transistors, and was able to address MB of external memory by adding -bit offsets to special segment registers providing the upper bits. The had seven general-purpose registers plus the stack
I
I
Intel Core Microarchitecture, x Processor Family
pointer, a flag register, and the segment registers. Using an extended version called , IBM sold its first PC in with an Intel processor. While being able to only do integer calculation, it was possible to add the floating point coprocessor . A significant enhancement to the x family happened in with the introduction of the i processor, which was clocked at MHz, added a -bit execution mode, enlarged registers and most importantly, introduced virtual memory with -bit flat address spaces. The successor i integrated the floating point unit. Intel’s x processor success strongly relates to the early success of the component-based design of the IBM PC, resulting in a mass market for x-compatible processors, and thus also competition. To make it more difficult for competitors to place their products, Intel registered the trademark “Pentium” as name of the next processor update in . As the first in the x family, this processor (“P”) added kB instruction and kB data caches on-chip, and allowed for an off-chip L cache. Further, a superscalar pipeline design was introduced, with two instruction decoders and two ALUs. To reduce pipeline stalls, it introduced a branch prediction unit. A later version, the “Pentium MMX,” extended the x ISA by data-level parallelism with vector registers and operations working on bytes at once. This parallelism has to be found by the compiler, and correct data layout needs to be used. Yet, this technique proved successful for a set of applications, and is still being extended (Intel recently announced x processors supporting vector operations with -byte width, marketed as LNI: Larrabee New Instructions). Similar instructions can, for example, be found in the PowerPC (called AltiVec) or SPARC (VIS). The micro-architecture following P in , called P, increased the superscalarity in the decoding stage to three, and execution stage to five. To effectively allow to exploit these units, support for out-of-order execution in the pipeline was added. While increasing the complexity significantly with hardware for data dependence checking, more physical registers, and new stages, for example, for register renaming and instruction reordering, it significantly improves ILP even with badly scheduled code by some compiler. The P needs an increased number of pipeline stages, for example, in total stages for execution of integer instructions. As further improvement, the L caches were doubled in size, and chips were packaged together
with modules for kB up to MB of L caches. In the following years, the P architecture got successive improvements - partly triggered by competition from the company AMD: the “Pentium-III” implementation had improved, -byte wide vector support (SSE), onchip L caches, and almost reached GHz clock rate in . At this time, the clock rate was a selling point for x compatible processors. Competitor AMD had a small advantage in this regard. Therefore, the next micro-architecture from Intel, Netburst, was designed to reach extremely high clock rates. While forcing L cache sizes back to of kB again, more importantly, pipeline stages had to be simplified, resulting in stage numbers of up to in later implementations. However, performance often does not increase accordingly because of much higher probability of pipeline conflicts, and thus pipeline stalls. Netburst has integer and floating point ALUs. However, together with the extremely deep pipeline, one instruction stream rarely can exploit these execution units. As important enhancement, Intel integrated, for the first time, a form of thread-level parallelism (TLP) into its processors. Simultaneous Multithreading (SMT), by Intel called Hyperthreading, keeps two instruction stream contexts with own register sets for one superscalar pipeline. The pipeline is filled with instructions from both streams, effectively reducing pipeline conflicts and enhancing utilization of the ALUs. Intel promised a performance increase up to %. However, with a big drawback: programs need to be parallelized. An interesting enhancement of Netburst is the Trace Cache, an instruction cache containing up to k decoded microoperations. Further, the L size was improved up to MB. An important addition to the ISA in the later implementations of Netburst was actually only in reaction to competitor AMD: Intel had to introduce a -bit extension, compatible to the -bit extension of AMD processors. The problem was that Intel originally intended to reserve -bit for its new Itanium architecture for the high-end server and HPC market. With x processors supporting -bit, customers could choose the cheaper platform for their needs. To emphasize intentions, the Itanium ISA was called IA- by Intel, and the -bit x-ISA was renamed IA- at that time. After first naming the -bit x extension EMT, then IA-e, it is now simply called “Intel ” to avoid confusion.
Intel Core Microarchitecture, x Processor Family
In some way, Hyperthreading allowed Intel to estimate the acceptance of TLP on a chip, and lead to increased interest in multithreaded programming and language extensions such as OpenMP. On the other hand, the race for highest clock rates showed the limits reached: together with the small structure lengths of technology, leakage effects raised the power consumption and heat disposal, requiring expensive cooling techniques. The solution is to cut the clock rate and add more processor cores on a chip instead. The last implementation of Netburst, “Pentium D”, was built as dual-core, combining two single-core dies in one package. The multicore era for x-compatible processors had started.
The Core Microarchitecture After the architectural problems of Netburst, the following micro-architecture from Intel, just called “Core,” is explicitly designed as basis for multicore processors. For a performance increase in contrast to the previous single-core processors, the clock rate can be reduced
Pipeline Structure A simplified version of the pipeline of the Core Microarchitecture with its stages is shown in Fig. . The instruction fetch stage is able to fetch bytes per cycle, filled from a kB instruction cache, and assisted by branch prediction. The prediction uses a branch target buffer (BTB) which holds the history of last branch decisions indexed by instruction address.
16 byte / cycle Instruction Fetch and Predecode BTB / RSB
Instruction Queue Macro Fusion 5 Instr / cycle
Instruction decode phase
3x
Simple
Decode 1x Complex
3 µOp / cycle
Microcode table
3 µOp / cycle MicroOp Fusion
Rename / Allocate Reorder Buffer Execute phase
Scheduler
3x
ALU FP / Int / SSE
Retirement
with multiple cores. This allows for more complex pipeline stages than found in Netburst, and thus a reduced pipeline depth. The Core architecture goes back to stages, resulting in reduced potential for pipeline stalls, and increased performance when compared to Netburst at same clock rate. Intel used its successful P architecture as base for Core. The first implementations, called “Core-” were introduced in , first as dualcore, later as quad-core designs. The quad-core versions using the Core Microarchitecture always can be seen as coupling two dual-core designs on one chip. The later processor implementations for servers (Xeon) contain a shared L for all four cores.
32 kB Instr. Cache
Instruction fetch phase
I
Load / Store
Store Buffer
32 kB L1 Data Cache
Intel Core Microarchitecture, x Processor Family. Fig. Block diagram of the Core Microarchitecture
I
I
Intel Core Microarchitecture, x Processor Family
Further, to accurately predict function returns, it has a -entry Return Stack Buffer (RSB). The prefetched bytes are put into the predecoding stage, where length of instructions is determined, and decoding of any prefix bytes is done (the need for a two-phase decoding shows the reached level of complexity of the x ISA). Predecoded instructions are put into a instruction queue. From there, four decoders can decode up to five original x instructions per cycle. This is possible as some often-used two-instruction sequences are recognized at once (called Macro-Fusion). One of the decoders is able to handle complex instructions by looking up corresponding sequences of operations in a microcode table. The generated sequence of microoperations (μOps) is checked for possibility of so-called μOp-fusion. Some primitive sequences can be joined to form more complex microoperations. This saves bandwidth in the pipeline. An example would be a comparison operation directly followed by a branch. After decoding into microoperations, mapping of ISA registers to physical registers is done (register renaming/allocation). This enables reduction of data dependencies (such as write-after-write hazards), and allows for more ILP to be exploited by out-of-order execution. The following stage is the reorder buffer, where dependencies are checked. Here, operations are allowed to overtake each other. The next stage, the scheduler, is able to dispatch up to six microoperations per cycle. It feeds three ALUs and another three execution units responsible for memory access. All ALUs can handle -bit vector operations at a throughput of one result per clock cycle. At most, this allows eight floating point operations to be executed in one cycle (via SSE instructions). The results of the ALUs lead to notifications back to the reorder buffer, allowing waiting operations then to be dispatched by the scheduler according to previously calculated data dependencies. All loads and stores to memory first pass a store buffer before the memory access operations are retired. The store buffer allows loads to directly use the result of store instructions, and allows multiple stores to nearside memory cells to be combined, thus increasing the effective bandwidth to the L data cache. Finally, the retirement stage is able to complete the execution of up to four microoperations per cycle, by reordering their retirement in program order of original instructions.
Cache Subsystem Figure shows memory hierarchies used in implementations of the Core Microarchitecture, including the cache subsystem. The Core Microarchitecture has separate L instruction and data caches, each with a size of kB. Separate L caches for different uses is important to reduce access conflicts, but also allows to tune the caches for the different access behavior. The unified L cache is up to MB large, serving L caches of two cores. With a quad-core design, this means that there are two L caches, as can be seen in the figure. For best exploitation, each L as well as L cache has its own prefetch units, with all load and store operations between cores and Ls done with -bit width. Both the L data as well as the L cache use a write-back scheme, that is, writes done are kept in the fastest caches as long as possible, reducing store transactions. L data caches employ two kinds of prefetchers each: one to detect streaming accesses, the other to detect stride accesses by single instructions, that is, access behavior is detected based on instruction address. Similarly, the L caches have stream prefetchers. In addition, the x ISA (part of instructions added with SSE) allow for software prefetching, that is, loading instructions without effect aside from loading data into caches. Together, hardware and software prefetching can prove quite effective to reduce cache misses. The memory system of multiple processors with the Core architecture uses a bus to access memory (the so-called, Intelproprietary Frontside Bus). On the bus, a chipset is attached with memory controllers talking to the memory modules. Thus, all cores in the system share the bus. In server systems, more expensive chipsets allow to have point-to-point connections to each processor chip. This way, multiple processors can access memory at the same time. To assure cache coherence, that is, to make sure that processor cores do not use outdated data because of modified data in other caches, a cache coherency protocol has to be used when there are multiple caches at the same level. This is done on the one side by observing bus transactions done by other processors (“snooping” the bypassing traffic), but because of write-back, also by asking other cores at load time whether they have modified copies. On the level of a multiprocessor system with one Frontside bus, the so-called MESI-protocol is used
Intel Core Microarchitecture, x Processor Family
32 kB L1 I $
32 kB L1D $
> Core 1 >
32 kB L1 I $
2 - 6 MB L2 Cache
Quad-Core
Dual-Core > Core 0 >
I
32 kB L1D $
> Core 0 >
32 kB L1 I $
32 kB L1D $
> Core 1 >
32 kB L1 I $
32 kB L1D $
> Core 2 >
32 kB L1 I $
2 - 6 MB L2 Cache
Bus Interface
Frondside Bus
32 kB L1D $
> Core 3 >
32 kB L1 I $
32 kB L1D $
2 - 6 MB L2 Cache
Bus Interface
Frondside Bus
Intel Core Microarchitecture, x Processor Family. Fig. Memory hierarchies used in implementations of the Core Microarchitecture
(the four characters represent the four possible states for a cache line), which belongs to the class of snooping protocols. However, in servers with point-to-point connections to the chipset, all the bus transactions usually need to be broadcast to allow for snooping, thus reducing the benefit. These systems employ so-called Snoop Filters, which allows to reduce this coherency traffic in given situations.
New Developments The next micro-architecture from Intel is the Nehalem architecture. Its main changes are a large L cache shared among all four or eight cores on a processor chip. Instead, the unified L cache is now private for each core, and reduced in size to kB. Further, all cores (with their L/L) can now be driven at different clock rates, and put to sleep independently. The socalled Turbo mode allows to even raise the clock rates of single cores when temperature allows. In effect, this leads to faster program execution when only a few cores are actually used on a chip. Another significant change is the move of memory controllers into the chip, allowing to directly attach main memory. Chips are coupled via a high-speed, point-to-point interconnect, called QPI (Quick-Path Interface). A consequence for multi-chip/socket systems is that distance to memory is different, depending on which core accesses the memory. Up to
the Core Microarchitecture, Intel multiprocessor systems used a Frontside bus, resulting in uniform memory access (UMA). With the introduction of Nehalem, multiprocessor systems now use nonuniform memory access (NUMA), which AMD systems had employed for quite some time. While local memory accesses are faster, remote accesses can become an obstacle to program performance, thus needing new optimization techniques in programs and the operation system. Also, the cache coherence protocol has to be adapted. Intel enhanced MESI by a fifth state (F), which allows to directly forward modified cache lines to other chips without writing to memory (this is similar to the fifth state of the MOESI protocol used by AMD). For some time, graphics processors have improved quite significantly in computation power, driven by D games. Recently, this power is made available to programmers via GPGPU interfaces (General Purpose Graphics Processing Unit). For their purposes, graphics processes contain huge arrays of ALUs, driven in clusters by control units. They can be seen as using an extreme version of the vector extensions of standard processors. Intel recently announced a processor with a large number of simplified cores, but with very wide vector units ( byte), named Larrabee. Its advantage is that each core is x-compatible. However, as a core is probably a lot slower than a current Nehalem core, the chip is built as graphics processor. With more
I
I
Intel® Parallel Inspector
transistor budget available, it can be expected that such simple, but specific cores with wider vector units are mixed with more complex ones, resulting in heterogeneous multicore processors. This also would mean that graphics units are moved into the same chip as the main processor cores, similar to the x floating point coprocessor in i times.
Related Entries Cache Coherence Instruction-Level Parallelism Memory Wall Pipelining Power Wall
. Patterson DA, Hennessy JL () Computer organization and design: the hardware/software interface, th edn. Morgan Kaufmann, Burlington, MA, USA . Culler D, Singh JP, Gupta A () Parallel computer architecture: a Hardware/Software Approach. Morgan Kaufmann, San Francisco, CA, USA . Microprocessor report. In-stat, reed business information. Reed Elsevier. http://www.mdronline.com . Intel and IA- architectures software developer’s manual, October . Intel and IA- architectures optimization reference manual, October . Stokes J () Inside the machine: an illustrated introduction to microprocessors and computer architecture. No Starch Press, Inc., San Francisco, CA, USA . http://www.sandpile.org . http://arstechnica.com
Shared-Memory Multiprocessors Superscalar Processors Vector Extensions, Instruction-Set Architecture (ISA)
Bibliographic Notes and Further Reading A good introductive book on computer architecture is []. A detailed, step-by-step introduction into computer architecture and microprocessor design, using the MIPS ISA, is given in []. From the same authors, [] provides more in-depth reading with practical evaluations of concepts. A good coverage of how to build large parallel machines, again with evaluation of concepts with practical applications, is provided in []. For coverage of current microprocessor trends, see []. The best reference about the Intel and IA Instruction Set Architectures, including overviews of the micro-architectures, is the Software Developer’s Manual together with the Optimization Reference Manual from Intel Corporation [, ], both available for free download from their web site. For a more introductive text on microprocessors, especially also from Intel, see []. On the Internet, a lot of technical information for x processors can be found on [], and articles about recent chips, for example, on [].
Bibliography . Tanenbaum AS () Structured computer organization, th Edn. Prentice Hall, Upper Saddle River, NJ, USA . Hennessy JL, Patterson DA () Computer architecture – a quantitative approach, th edn. Morgan Kaufmann, San Francisco, CA
Intel® Parallel Inspector Paul Petersen Intel Corporation, Champaign, IL, USA
Synonyms Data race detection; Deadlock detection; Illegal memory access
Definition
Intel Parallel Inspector is a product that integrates with the Microsoft Parallel Studio to provide support to software developers in analyzing parallel software for problems relating to the correct execution of an application.
Discussion Introduction Intel Parallel Inspector is designed as a component of the Intel Parallel Studio, a suite of software development tools to aid in the creation of parallel applications. Software development involves many different aspects from the creation of specifications, to writing the implementation code, testing, debugging, and tuning the resulting application. All of these aspects are true when writing any kind of software, with additional challenges arising when writing software which exploits concurrency via threading or parallelism. Parallel Inspector serves as an intelligent assistant to notify the developer about defects in the parallel application that could
Intel® Parallel Inspector
cause incorrect behavior. In particular, Parallel Inspector focuses on a few core problems. These are misuse of synchronization which can result in data races or deadlocks, and incorrect memory accesses resulting in illegal or invalid memory references in parallel applications.
I
own usage. Typically, other regions of the application’s address space are unallocated, and memory references to the unallocated regions of the address space are considered to be illegal. Similarly, requests to write into read-only pages are also illegal (such as attempting to write to the code pages).
Requirements for Parallel Inspector
Intel Parallel Inspector works with the Microsoft Visual Studio or development environment on the Windows XP, Vista, or Windows operating system. It can either be installed by itself or as part of the Parallel Studio bundle. Intel Parallel Inspector operates by analyzing the dynamic behavior of a native code application via dynamic instrumentation. The typical application is written using the C or C++ language. It is preferable to analyze applications that are built with debugging symbols to map the dynamic behavior to the source code. It is also helpful to tailor the inputs driving the test cases to be small and focused on the portions of the applications that the developer wishes to analyze. Using the smallest inputs possible, avoid redundant work during the analysis where similar memory access patterns are repeatedly verified.
Defects in Parallel Software The defining characteristic of threaded software is the potential for concurrent access to memory. To be correct, these memory accesses need to follow certain rules: . When reading or writing, the memory address must be valid. . When reading, contents of the memory address must be initialized. . The memory access must be properly synchronized. Each of these rules is examined in turn to see how Parallel Inspector can assist the software developer.
Invalid Memory Accesses First, consider if a memory address is valid. In the memory address space for an application, different parts of the address space are allocated for various purposes. For example, part is allocated for the instructions comprising the code of the application. Part is/Parts are allocated as the stack(s) for the threads. Part is/Parts are allocated for the heap(s) used by the application. Part is/Parts are reserved by the operating system for its
Uninitialized Memory Accesses Second, consider if it is legal to access a memory address. Writes to any valid memory address are legal even if the memory is uninitialized. However, reads from uninitialized memory addresses are illegal as the value read may be nondeterministic and cause the behavior of the application to be undefined. A specific challenge for uninitialized memory access detection is the reading from uninitialized memory of a value that is ultimately not used. For example, a word containing bytes might only have its lower byte initialized. A common programming pattern would be to load all bytes via a word load instruction, and then mask off all but the lowest byte. By throwing away the high bytes in the mask operation the uninitialized memory cannot affect the behavior of the application, even though it was physically read. Similarly the same concept can apply to larger structures. Consider a structure containing a “char” variable followed by a “double” variable struct Ex { char c; double d; }; Alignment rules may require that field “d” be aligned on an byte boundary. This forces padding bytes to be inserted after “c", yielding a structure with bytes. The difficulty arises from structure assignment. struct Ex x, y; y.c = ‘a’; y.d = .; ... x = y; ... Often structure assignment is implemented by copying all bytes, of which of these bytes are uninitialized. Often, heuristics can filter these memory copy operations and avoid issue false-positive reports.
I
I
Intel® Parallel Inspector
Synchronization Usage The last category of problems relates to synchronization of memory accesses. The best case is where memory is not shared between threads, and thus no synchronization is needed. However, often this is not possible as the shared memory locations are needed for communication of data between threads. Three cases arise when considering synchronization. . Incorrect lock usage can cause deadlock . Synchronization is not used, or not used consistently . Synchronization is used, but at an incorrect granularity
Deadlock Incorrect lock usage can expose the potential for a deadlock at runtime. A deadlock occurs when two or more threads each owns a lock that another thread is attempting to acquire, and will not release its lock until it acquires its next requested lock. This situation forms a cycle in the lock ownership/acquisition graph, as illustrated in the following example: Thread : AcquireLock( A ) AcquireLock( B ) ... Thread : AcquireLock( B ) AcquireLock( A ) ... The lock acquisition graph for Thread is A → B. For Thread it is B → A. Merging these lock graphs graph results in the cycle A → B → A, which indicates the potential for deadlock in the application.
Data Races The second case in that synchronization is not used, or that it is used inconsistently. An example of this situation is: Thread : ... AcquireLock( A ) X = X + F() ReleaseLock( A )
Thread : ... X = X + F() Thread : ... AcquireLock( B ) X = X + F() ReleaseLock( B ) In this example, the first update to X is protected by the lock A, but the second update to X is not protected by any lock. Therefore, it is possible that the operations of Thread and Thread interleave in a way that alters the applications behavior. Similarly, Thread has the problem that it is using a different lock than Thread , which means that mutual exclusion is not enforced between any of Threads , , or .
Atomicity Violations The third case is when synchronization is used at the wrong granularity. In a serial program, all operations between I/O statements appear atomic to an outside observer, since the I/O statements are the only mechanism by which to interact or observe the execution of a serial program. In a multi-threaded program this is not true. The multiple threads that may exist can potentially interact or observe the state of other threads in the application at any point during the execution. Because of this, developers need to be aware of the concept of atomicity. Consider a statement such as: X = F(X) Let X be an arbitrary object, and F be an arbitrary function applied to X. During this computation, various parts of X may change to intermediate states that are then used to compute the final value of X. If no other entity can observe or modify any of these intermediate states, then F() is an atomic function, and the atomicity of the transformation X = F(X) has been preserved. Consider the following example: struct Node { struct Node *next; double data; }; Node *head; int length;
Intel® Parallel Inspector
void Initialize() { head = ; length = ; } void Add( Node *p ) { p->next = head; head = p; length = length + ; } Node *Remove() { Node *p = head; length = length - ; head = p->next; return p; } double GetAverage() { Node *p = head; double answer = ; for (int I = ; I < length; ++I) { answer += p->data; p = p->next; } return (sum / length); } Assume that the application requires that the invariant that “length” is exactly the number of elements in the linked list. The GetAverage() function depends on this to compute the correct answer. Both the Add() and Remove() functions change the value of “length.” If either function is called in parallel with the GetAverage() function, then an inconsistent state of the data structure may be observed. To fix this, the example must enforce atomicity to maintain the invariant that the value of length is exactly the same as the number of nodes in the list. Intel Parallel Inspector is not designed to detect atomicity violations in an otherwise well-synchronized parallel program. Instead, Parallel Inspector detects the lack of synchronization. To reason about atomicity with Parallel Inspector, the recommended approach is to remove synchronization suspected to be at the wrong granularity, find the set of data races this
I
creates, and then reason about your algorithm to determine if the algorithm is correct if the atomicity region is large enough to enclose all affected memory accesses.
Intel® Parallel Inspector Defect Model
When Parallel Inspector detects a defect in the parallel software, it records this fact. The record contains the static and dynamic location of the occurrence of the defect as well as supporting information that may be necessary to help the developer understand how the defect occurred or what may be necessary to fix the defect. Many detected defects may be related to the same root cause in the application. For example, if the developer forgot to synchronize the update to a particular object, then that update may cause a data race with all other uses of that object. Similarly, if the same object is referenced by several threads, then this set of data races must be fixed by acquiring the same synchronization object to cover all usages of that object. To aid the user in organizing the defects, Parallel Inspector defines three organizational concepts: an Observation, a Problem, and a Problem Set. A defect in the application is composed of Observations. An Observation is defined as a fact about an action performed by the application. This may be a memory Read action, a memory Write action, a Lock Acquire action, a Memory allocation action, or possibly a Memory deallocation action. Each Observation has a representative call stack, and specific location (instruction) in the application which triggered this action. Some Observations are fundamental to a defect. For example, in a data-race defect, the memory accesses (either Read or Write) trigger the existence of this defect. Other Observations are incidental. For example, it may be useful to know where the application allocated the object involved in the data race, but typically it is not the allocation which triggered the detection of the defect. A Problem contains one or more Observations related to the same defect. A Problem also merges all of the dynamic instances of a defect into a single instance of the defect. For example, a utility function may access a shared variable that is unsynchronized, which causes
I
I
Intel® Parallel Inspector
a data race. This function may be called from many different locations in the application, representing many different dynamic occurrences. This first level of defect organization focuses the user on the source code of the application and collapses all of these dynamic instances into a single Problem. The second level of organization performed by Parallel Inspector is to organize Problems into Problem Sets. Two Problems are grouped into the same Problem Set when they share a significant Observation. For example, consider Problem that writes to an object at location S, and reads the object by a different thread at location S. If another Problem exists that also writes the object at location S, but reads the object by some other thread at location S, then Problem and are placed into the same Problem Set. What this indicates is that the pair-wise relationship of a data race between two threads should be extended to a group-wise relationship between a set of threads and a set of memory location. This implies that since S is related to S and S and also related to S, then it is quite likely that S is also related to S, and needs to be considered when developing the solution to fix the data race. These three levels of organization of application defects in Parallel Inspector allow the developer to interact with the reported information in appropriate ways. For example, a “Detail” view is provided that organizes the defects according to the Observations. This allows quick access to any Problem or Problem set based on characteristics of any Observation contained in a Problem Set. For example, the developer may want to know which data races are connected with an object allocated at a specific location in the program. Or, the developer may be concerned about which Observations are reported to be involved with a specific binary module (.DLL or.EXE file) or with a specific function or file. If the developer recently edited these functions or files, or recompiled this module the defect may relate to a modification of the application recently performed by the developer.
Implementation Intel Parallel Inspector is designed to observe an application as it executes. This has the benefit that the tools have accurate information about the state of
the application as it runs, but has the disadvantage of being limited to the subset of the application behavior covered by the workload used during testing. Since Parallel Inspector relies on dynamic program behavior to detect defects, it is primarily a diagnostic tool aiding to pinpoint the root cause of a defect. Parallel Inspector uses the Pin Dynamic Instrumentation system []. This system is a set of APIs which translate and instrument the native assembly language instructions. Pin operates on a trace of instructions translating them with a native-code to native-code JIT compiler, and inserting instrumentation which drives an online analysis engine. This instrumentation observes modification to memory (load or store actions), as well as memory allocation and deallocation, threading and synchronization actions. Two different analysis engines are included in Parallel Inspector. The Memory Inspection analysis engine maintains a bitmap describing the valid and initialized state of the address space. Allocation actions create valid address regions; write actions create initialized address regions. Subsequent memory accesses are checked against this bitmap to determine if they are invalid of uninitialized. The memory allocation and deallocation operations are matches to help determine which memory regions have never been deallocated, and further analysis is done to determine which memory regions are inaccessible (i.e., a memory leak). The Thread Inspection analysis engine employs a “happens-before” [] methodology for determining where a data race may exist in the application. The “happens-before” relationship is computed by observing the synchronization performed by the appli cation. Two threads are ordered by a “happensbefore” arc when the second thread’s execution is conditional on the first thread’s execution. For example, the first thread releases a lock, which the second thread subsequently acquires. In this execution of the program, any memory reference that occurs before the first thread releases the lock must “happen-before” any memory reference that occurs after the second thread acquires the lock. Using this methodology, a data race occurs whenever the same memory location is accessed (at least one write) by two threads, and there does not exist a “happens-before” ordering relationship between these two memory accesses.
Intel® Parallel Studio
A special class of memory access is flagged by the Thread Inspector as deserving additional attention. This class of memory accesses is known as “cross thread stack accesses.” When a thread accesses memory belonging to the stack of another thread, typically it does this by dereferencing a pointer published by the thread that owns the stack. The challenge is not only enforcing mutual exclusion to stack locations when they are modified, but the existence of the stack location needs to be maintained during while threads access that stack location. In a strict parent–child relationship, where the parent publishes one of its stack locations to its children, and then waits for the children to terminate, then it is guaranteed that the lifetime of the stack location encloses the lifetime of the remote thread which has obtained a reference to that stack. In any other situation, the developer must be vigilant in understanding the memory reference pattern to determine if it is a legal parallel programming pattern. Additionally, Thread Inspector determines the existence of actual and potential deadlocks caused by the synchronization executed by the application. A lock hierarchy acquisition graph is inferred from the lock acquire and release actions observed in an applications execution. The analysis engine checks for cycles in this graph to detect if a deadlock exists.
I
Bibliography . Gabb H, Kakulavarapu P (eds) () Developing multithreaded applications: a platform consistent approach, v. Feb . Wang L, Xu X () Parallel software development with Intel threading analysis tools. Intel Tech J ():– . Banerjee U, Bliss B, Ma Z, Petersen P () Unraveling data race detection in the Intel thread checker. Presented at the first workshop on Software Tools for Multi-core Systems (STMCS), in conjunction with IEEE/ACM international symposium on code generation and optimization (CGO), Manhattan, New York, March . Banerjee U, Bliss B, Ma Z, Petersen P () A theory of race detection. In: International symposium on software testing and analysis, Proceedings of the workshop on parallel and distributed systems: testing and debugging, Portland, pp – . Intel Parallel Inspector Help () http://software.intel.com/ sites/products/documentation/studio/inspector/en-us//ug_ docs/index.htm . Hazelwood K, Lueck G, Cohn R () Scalable support for multithreaded applications on dynamic binary instrumentation systems. In: Proceedings of the international symposium on memory management ISMM ’, Dublin, Ireland, – June . ACM, New York, pp –. DOI= http://doi.acm.org/./.
Intel® Parallel Studio Paul Petersen Intel Corporation, Champaign, IL, USA
Related Entries
Intel Parallel Studio
Intel Thread Profiler
Intel Threading Building Blocks (TBB)
Bibliographic Notes and Further Reading
The Parallel Inspector is a product in the Intel Parallel Studio. This product is the evolution of ongoing work, which was initially released from Kuck & Associates, Inc (KAI) as the product Assure, and then subsequently refined and released as the Intel Thread Checker after the acquisition of KAI by Intel in . The main evolution of this technology has been to move from source-to-source instrumentation, to static binary instrumentation, to finally the Pin-based dynamic binary instrumentation that is used in the Parallel Inspector.
Definition
Intel Parallel Studio is a product that integrates with Microsoft∗ Visual Studio IDE that extends the IDE with facilities to assist in the creation of software for multi-core computers.
Introduction
Intel Parallel Studio is a product that encompasses the life cycle of parallel software development. If you are starting from a serial application, the tools provided can generate high performance executables, inform you of where the hotspots exist in your algorithms, and find memory leaks and invalid memory access that may be corrupting your application’s execution. Continuing in the parallel software life cycle, tools exist that assist the user in determining the most appropriate places to introduce parallelism. Once the application
I
I
Intel® Parallel Studio
is enabled for parallel execution, the remaining tools in Parallel Studio come into play. Problems unique to parallel programming such as deadlocks, data-races, synchronization overhead, or insufficient concurrency are easily exposed through the parallel analysis tools.
Requirements for Parallel Studio
Intel Parallel Studio works with the Microsoft Visual Studio , , or development environment on the Windows XP, Vista, or Windows operating system.
Background Intel Parallel Studio is a software development product created by Intel that uses the Microsoft Visual Studio Integrated Development Environment. Its purpose is to enable the development of programs for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors. Parallel Studio consists of four component parts, each of which has a number of features: ●
Parallel Composer consists of a high performance C++ compiler, a number of performance libraries (Integrated Performance Primitives), Threading
Building Blocks, and an extension to the Visual Studio debugger for parallel programs. ● Parallel Advisor assists the developer in the analysis and modeling of where to best add parallelism to an application. ● Parallel Inspector serves as an intelligent assistant to notify the developer about defects in the parallel application that could cause incorrect behavior. In particular Parallel Inspector focuses on a few core problems. These are misuse of synchronization that can result in data-races or deadlocks, and incorrect memory accesses resulting in illegal or invalid memory references in parallel applications. ● Parallel Amplifier helps the developer to understand the performance implications of their application providing profiling support to determine the hotspots, bottlenecks due to synchronization, and an analysis of the concurrency present in the application. Parallel Studio is focused on native code development for C/C++ programming.
Workflow Parallel software development builds on top of sequential software development in a number of ways. First you must understand the opportunities for
Intel® Parallel Studio
Intel® Parallel Advisor
Intel® Parallel Composer
Intel® Parallel Inspector
Intel® Parallel Amplifier
Model, and Transform Assist with modeling of where to best add parallelism to an application
Compile, Build, and Debug Compiler, libraries of scalable threaded code, and debugger for efficient, effective parallelism
Verify Detects hard-tofind memory and threading errors, and performs rootcause analysis for defects
Tune Parallelism performance tuning for application optimization
Intel® Parallel Studio. Fig. Intel parallel studio product components
Intel® Thread Profiler
parallelism. Second, you must express the parallelism in the application. Third, you must verify that your expression of parallelism does not have unexpected defects. Finally, you must determine if you are getting the expected benefit from the use of parallelism. Intel Parallel Studio (Fig. ) provides the capabilities you need to tackle all four of these areas. Starting with a high performance C/C++ compiler, adding parallelism-specific analysis for opportunities, correctness, and performance gives the developer a head start.
Related Entries
Intel Parallel Inspector Intel Thread Profiler
Intel Threading Building Blocks (TBB)
Bibliographic Notes and Further Reading
The Intel Parallel Studio brings together products to aid the developer in producing parallel programs for Microsoft∗ Windows.
Bibliography . Intel Parallel Studio Help () http://software.intel.com/ en-us/articles/intel-parallel-studio-documentation/
I
Intel Thread Profiler can collect data on programs written using OpenMP∗ or using native threading APIs (Windows∗ API or POSIX∗ threads). “∗ Other names and brands may be claimed as the property of others”.
Threading Methodology Threading for performance is a process with several steps. This process starts with discovering possible parallelism in an existing codebase or algorithm and expressing that parallelism through some threading library or framework (e.g., OpenMP, Intel Threading Building Blocks, native threading APIs). Next, the threaded code must be debugged and analyzed for correctness with testing and possibly with tool support (e.g., Intel Parallel Inspector). Then the performance of the code is analyzed and the code tuned. This process is iterative – changes suggested by performance tuning need to be tested for correctness as well. Intel Thread Profiler can provide assistance in the performance tuning part of the process. A typical usage model is to run the target application under Intel Thread Profiler and start looking at regions where the concurrency is lower than expected. Next, particular synchronization objects are examined to see which ones are responsible for the most waiting. Finally, the source locations where those objects are used can be examined to understand the root cause of the performance problem.
Implementation Details
Intel® Thread Profiler Mark Dewing, Douglas Armstrong Intel Corporation, Champaign, IL, USA
Definition
The Intel Thread Profiler is a tool to assist in optimizing multithreaded applications. It shows developers where various performance bottlenecks and impediments to scalability occur in the program.
Discussion Introduction Intel Thread Profiler collects data relevant for multithreaded performance and helps the user understand the cause of various performance problems.
To collect data on OpenMP programs, Intel Thread Profiler uses a version of the OpenMP runtime library that collects statistics, and writes them to a file. For native threads, binary instrumentation is used to intercept various threading and blocking APIs. A graphical interface integrated into the VTuneTM Performance Environment reads the data file and displays it for interactive analysis (VTune is a trademark of Intel Corporation in the USA and other countries).
Profiling OpenMP The OpenMP mode of Intel Thread Profiler displays the amount of time spent in parallel regions and serial regions. It also shows the amount of imbalance in parallel regions, and the amount of overhead due to synchronization and locks. This can be displayed per region, as well as an aggregated summary (Fig. ).
I
I
Intel® Thread Profiler
Intel® Thread Profiler. Fig. OpenMP region view in Thread Profiler
The tabs show a breakdown of the time in categories: ● ● ● ●
Summary – time for the whole program Regions – time per serial or parallel region Threads – time per thread Data – grid view of performance data
The screen shot shows the regions view with two parallel regions – one with a large amount of imbalance and one with almost no imbalance.
Profiling and Tracing Native Threads When analyzing applications using native threads, different data is collected and displayed to the user, as shown in Fig. .
Timeline Pane The lower pane in Fig. is the timeline view, where the threads are on the vertical axis and time is on the horizontal. The activity of each thread is shown (either waiting or in a ready-to-run state). Transition lines show connections between threads – where one thread releases a lock and another thread acquires that lock.
There are several thread states: ●
Active – the thread is either running on a core, or is ready to run and is waiting in the run queue. ● Wait – the thread is blocked waiting for some resource to become available (synchronization or I/O). ● Spin – the thread is actively waiting in a synchronization construct. This thread state is only detected in libraries or user codes that have added API calls to inform Thread Profiler of spin waits. The bars in profile view and line segments in timeline view can also be colored by concurrency and/or behavior. The concurrency types are: ● No Thread Active – none of the application’s threads are active. ● Serial – only a single thread is active. ● Underutilized – the number of active threads is less that the number of cores in the system. ● Fully Utilized – the number of active threads is equal to the number of cores on the machine. This is the ideal case for best utilization of CPU resources. ● Overutilized – there are more active threads than cores on the machine. This can lead to some
Intel® Thread Profiler
I
I
Intel® Thread Profiler. Fig. Intel Thread Profiler display for native threads
performance loss due to excess context switches and thrashing the cache.
●
The behavior applies to time segments along the critical path ●
Critical Path – The thread on the critical path is active, and not holding a resource needed by the next thread on the critical path (compare with “impact time”). ● Blocked on critical path – The thread on the critical path is waiting for an external event – typically a system event (a timer or file I/O) or a user event (mouse or keyboard). (If it were waiting on another thread, that thread would have been on the critical path instead).
Impact on critical path – The current thread is active, but the next thread on the critical path is waiting. Typically, this means the current thread is holding a lock or other resource while the next thread waits.
Profile Pane The upper pane in Fig. is the profile view, which displays wait time and time on the critical path, grouped in various ways – by concurrency, thread, object, object type, or source location. The groupings in more detail are ●
None – shows the overall times for the region visible in timeline view ● Concurrency – time a given number of threads are active in the system
I
Intel® Thread Profiler
●
Thread – list of threads created in the process and the amount of time each thread spends active or waiting ● Object Type – waiting caused by objects, grouped by type (all the mutexes in one group, all the critical sections in another, etc) ● Object – waiting caused by individual objects (individual mutexes, critical sections, etc) ● Source Location - time associated with the source location where the wait occurs and the source location where the object is released.
Source View The user can navigate to the source code where the wait occurred. The source code where the object is released is also shown. Critical Path The critical path links transitions between threads into a continuous path from the beginning to the end of the program. The critical path is visible in the timeline view, and data aggregated over the critical path is shown in the profile view. If the performance is improved anywhere along the critical path, the program’s run time will be reduced. Improving the performance elsewhere is less likely to reduce the program run time (it may affect load or CPU usage, however).
Tuning Threaded Code Imbalance in Parallel Work Imbalance occurs when one or more threads in a parallel region are idle waiting for other threads in the parallel region to finish working. In the OpenMP
Intel® Thread Profiler. Fig. Lock convoy in timeline view
profiler, there is a specific time category for imbalance. In the native threads mode of Intel Thread Profiler, imbalance can be seen visually in timeline view, or it manifests as underutilized time in the profile view. The two regions in Fig. are duplicates of the same code and differ only in how the work is scheduled on different cores. The amount of work varies per iteration, and in the first loop, work is assigned statically to different cores. That is, each core executes the same number of loop iterations. The second loop assigns the work dynamically to each core, and results in a much better load balance.
“Hidden” Synchronization Often the developer is unaware of the parallel behavior of various subsystems and what locks they may hold. Intel Thread Profiler reveals where these subsystems may be serializing or reducing the concurrency of the higher-level code. Excessive Synchronization If a lock is held and released frequently, the threads may spend an excessive amount of time locking overhead, reducing the amount of time available for actual work. Excess Hold Times and Lock Convoys Threads may hold locks longer than necessary, and this will manifest in underutilized time. This may also lead to a lock convoy, which is easily spotted in the timeline view (Fig. ).
Intel® Threading Building Blocks (TBB)
Hot Locks
A hot lock will show up in the object grouping in the profile view – the hot lock will accrue a large amount of waiting time. Further grouping by source location can pinpoint where and how the lock is being used.
Release History Intel Thread Profiler . was first released in in conjunction with Intel Thread Checker for Windows. This version of Intel Thread Profiler only included support for OpenMP. Intel Thread Profiler . was released in and added support for native threads. Intel Thread Profiler . was released in and featured an enhanced user interface. Intel VTuneTM Amplifier XE was released in , replacing the functionality of Intel Thread Profiler with a tighter integration of Intel Thread Profiler and Intel VTuneTM Analyzer capabilities.
Bibliography . Gabb H, Kakulavarapu P (eds) () Developing multithreaded applications: a platform consistent approach, v .. Intel Corporation, Santa Clara, CA . Wang L, Xu X () Parallel software development with Intel Threading Analysis tools. Intel Technical Journal ():– . Intel Thread Profiler Help, v . () Intel Corporation, Santa Clara, CA
Intel® Threading Building Blocks (TBB) Arch D. Robison Intel Corporation, Champaign, IL, USA
Synonyms Intel TBB
Definition
Intel Threading Building Blocks (Intel TBB) is a C++ library for shared-memory parallel programming.
Discussion Introduction Intel Threading Building Blocks (Intel TBB) is an open-source C++ library for parallel programming of
I
multi-core processors. It is notable as a commercially supported library for parallel programming targeting mainstream developers. It has won two “Jolt Productivity Awards.” The library has evolved since its initial release in . This entry describes Intel TBB version . released in . The library supports both high-level and low-level specifications of parallel control flow. At the high level are parallel “algorithms,” similar in spirit to ISO C++’s
I
I
Intel® Threading Building Blocks (TBB)
not limiting what compiler is used. C++x lambda functions simplify syntactic issues of the library approach.
Terminology The notation [a,b) denotes a half-open interval that includes a and excludes b. A C++ functor is an object that defines operator() as a member. Some examples in this entry create functors via C++ax lambda expressions. The lambda expressions used in this entry start with [&], [=], or []. Below is an example. template
Parallel Algorithms Most of the parallel algorithms in Intel TBB are C++ templates that act as parallel control structures. These structures compose functors in a particular parallel pattern. The simplest pattern is parallel_invoke(f,g), which executes functors f and g, possibly in parallel if resources permit. Extended variants allow more functor arguments. Parallel loops (See Loops, Parallel) are supported by several templates. The basic form parallel_for(range, func) applies functor func over an iteration space defined by range. For example, the following expression computes a[i]+=b[i] for i in the half-open interval [,n): parallel_for(blocked_range
Semantics
R(const R&);
Copy constructor
∼R();
Destructor
R.empty();
True if range is empty; false otherwise
R.is_divisible();
True range can be split into two parts
R(R&, split);
Splitting constructor
auto x = y; declares x to be of the same type as y.
Intel® Threading Building Blocks (TBB)
Intel TBB provides blocked_range, blocked_ ranged, and blocked_ranged respectively for one-, two-, and three-dimensional ranges. Programmers may define their own ranges. An interesting implementation technique that demonstrates the power of recursive ranges is the internal implementation of template parallel_sort(first,last,compare). It uses parallel_for over a special range that represents the key sequence. The parallel_for recursively subdivides the key sequence until is_divisible() returns false. Each subdivision invokes the range’s splitting constructor, which is defined to do a partitioning step of quicksort. The functor argument to the parallel_for deals with the base case of the recursion. It serially sorts key sequences that are not subdivided. Hence parallel_for sometimes goes far beyond iteration by expressing a parallel divide-and-conquer pattern. Intel TBB has a shorthand form for the common case of a one-dimensional space of integral type. The call parallel_for(lower,upper,step,func) is a parallel equivalent of: for(auto i=lower; i
I
if(n.right) f.add(*n.right); n.data += ; }); The initial work list consists of Root. The functor takes a node and adds its children to the work list. The parallel_do terminates when all work items have been processed.
Reduction and Scan The template parallel_reduce performs parallel reduction (See Reductions). It has several forms. The form discussed here is parallel_reduce (range,identity,func, reduction). The range argument is a recursively divisible range. The identity argument is the identity element of the reduction operation. The functor func has signature range × value → value . It should do a reduction over iterations space range using value as the initial reduction value. The functor reduction has signature value × value → value . It should reduce value and value to value . The reduction operation must be associative, but does not have to be commutative. The following example shows parallel_reduce used to reduce the first n elements of an array x of type T. It assumes that T() constructs the identity element for +. parallel_reduce(blocked_range
I
I
Intel® Threading Building Blocks (TBB)
uses a hybrid algorithm that does parallel evaluation only if idle threads are available [].
Partitioning Hints An optional partitioner argument can be specified as a hint for optimizing execution. The default strategy is auto_partitioner, which applies a heuristic for picking the sizes of the subranges, based on the number of hardware threads and distribution of the work-load []. The other two partitioners are simple_partitioner and affinity_partitioner. Specifying simple_partitioner causes the range to be subdivided as finely as the range argument permits. The following example is similar to the previous example, except that it specifies a grain size of , for the blocked_range and a simple_partitioner, which causes the parallel_for to subdivide the range into subranges no bigger than ,. parallel_for(blocked_range
Parallel Pipeline The parallel_pipeline template executes a linear pipeline of filters. Each filter is a functor. The first filter generates values. Each subsequent filter maps input values to output values. The last filter consumes its input values. The pipeline terminates when all items have been processed. The filters can be parallel, serial_out_of_order, or serial_in_order. The pipeline invokes filters concurrently subject to the following constraints: ●
A given serial filter can process only one item at a time. ● Any two serial_in_order filters process items in the same respective order. Parallelism is achieved in two ways. First, parallel filters can process any number of items in parallel. Second, different filters can run concurrently. For instance, a three-stage pipeline might have a serial_in_order stage for reading chunks of input, a parallel stage for processing each chunk independently, and a serial_in_order stage for writing the output. The chunks will be written in the same order that they are read. For another example, a three-stage pipeline consisting of a parallel, serial_in_order, and a parallel stage is isomorphic to a Doacross loop (See Loops, Parallel), where the intervening serial_in_order stage handles the cross-iteration dependency.
Task Scheduler The parallel algorithms build upon a task scheduler. The scheduler is non-preemptive and based on work stealing. Tasks may be spawned or enqueued. Enqueued tasks are used when first-come first-serve behavior is desired, typical of parts of programs that deal with reacting to external events such as user input. Recursively spawned tasks are used to achieve scalability of computations that run to completion. Typically, a parent task spawns child tasks and then continues when the child tasks finish. The task interface supports two approaches for the waiting. In the blocking-style idiom, the parent explicitly waits for the children to complete. In the continuation-passing idiom, the parent specifies a continuation task to be executed when the children complete. The continuation-passing idiom is generally more tedious to write, but enables
Intel® Threading Building Blocks (TBB)
the same space and time bounds as Cilk (See Cilk), because in continuation-passing style, even though tasks may wait on other tasks, no thread waits on another thread. The following code demonstrates the blocking-style idiom. It derives a class RecursiveTask from the Intel TBB base class tbb::task, and overrides method execute() to specify the actions of the task. The task applies function Work across the values in a half-open interval [l,u) via recursive decomposition. class RecursiveTask: public tbb::task { int lower, upper; tbb::task* execute() { // Override virtual method task::execute() if(upper-lower==) { Work(lower); } else { int midpoint = lower+(upper-lower)/; // = two children + one for the wait. set_ref_count(); // Create and spawn two child tasks spawn(*new(allocate_child()) Recursive Task(lower, midpoint)); spawn(*new(allocate_child()) Recursive Task(midpoint, upper)); // Wait for my children to complete wait_for_all(); } return NULL; // Effect of non-null value explained later. } public: // Construct task that evaluates Work(i) for i∈[l,u) RecursiveTask(int l, int u): lower(l), upper(u) {} }; The wait_for_all is written separately to simplify exposition. Often the spawning of the last child and waiting are written using the shorthand spawn_and_wait_ for_all, which optimizes away some scheduler overhead.
I
The following code shows how the root task for the parallel recursion is spawned. The net effect of the code is to evaluate Work(i) over [,n). void TreeSpawn(int n) { if(n>) tbb::task::spawn_root_and_wait( *new(task::allocate_root()) RecursiveTask(,n)); } Root tasks are always waited on in a blocking manner, so that algorithms can present an ordinary call/return interface to users that does not return until work is completed. However, for efficiency Intel TBB algorithms often employ continuation-passing style internally below the root task. The following code shows a continuation-passing variant of RecursiveTask. class RecursiveTask: public task { int lower, upper; task* execute() { // override virtual method task::execute() if(upper-lower==) { Work(lower); return NULL; } else { int midpoint = lower+(upper-lower)/; // Allocate continuation. In this example, the // continuation exists only for // synchronization, and thus can be a // tbb::empty_task. task* c = new(allocate_continuation()) tbb::empty_task; // Continuation has two children. c->set_ref_count(); // Create right child of c and spawn it. spawn(*new(c->allocate_child()) RecursiveTask(midpoint, upper)); // Reuse *this as left child of c. recycle_as_child_of(*c); upper = midpoint; // Bypass scheduler - specify *this as next // task to run. return this; } }
I
I
Intel® Threading Building Blocks (TBB)
public: RecursiveTask(int lower_, int upper_): lower(lower_), upper(upper_) {} }; The example shows two common optimizations for recursive spawning. First, the parent recycles itself as its left child, which avoids the overhead of creating another task and copying state from parent to child. Second, the parent does not spawn its left child, but instead returns a pointer to it. By convention, the scheduler executes that task next, skipping the overhead of pushing/popping the deque to be described shortly. To improve cache reuse, it is sometimes desirable to have a task y run on the same thread on which a previous related task x ran. The task scheduler interface supports this preference via two methods. One method records where x ran, as an abstract affinity_id. The other method lets the programmer specify that the scheduler should try to run y in the place with the same affinity_id, unless load balancing requires otherwise. This approach of “replay affinity” does not require knowledge of the machine topology. The current implementation of the task scheduler uses work stealing. Each thread has both a deque (double-ended queue) of spawned tasks and a mailbox of advertisements. The advertisements are for tasks in other deques that have an affinity_id for the owner of the mailbox. A thread spawns a task by pushing it onto its deque. If the task has an affinity_id, then the thread also sends an advertisement to the mailbox for the thread with affinity_id. The threads share a queue-like structure that holds enqueued tasks. A thread chooses the next task to run by the first rule below that applies: . Execute the task returned by last executed task. . Pop a task from the thread’s deque, choosing the most recently pushed task. . Dequeue an advertisement from the thread’s mailbox and steal the advertised task. . Pop an enqueued task from a global queue-like structure, in approximately first-in first-out order. . Randomly choose another thread and pop the least recently pushed task from it.
Each rule corresponds to a particular concern. Rule enables tail recursion by tasks. Rule encourages locality. Rule implements “replay affinity.” Rules and balance load. Rules and are particularly important for nested parallelism. Consider traversing a tree. If done fully parallel, the traversal can become a parallel breadth-first traversal, taking space proportional to the size of the tree. It is better to use just enough of the potential parallelism to keep the hardware threads busy, and let each busy hardware thread do sequential depth-first traversal of some subtree. Rule corresponds to sequential evaluation order. Rule expands actual parallelism just enough to keep an otherwise idle processor busy.
Exceptions and Cancellation When a parallel algorithm throws an exception, related work should be canceled and the exception propagated to the thread that initiated the algorithm. Intel TBB exception handling performs these actions based on task_group_context nodes. Each task belongs to a task_group_context node. Typically all tasks of a parallel algorithm instance belong to the same node. The nodes form a tree. The tree of task_group_context nodes provides a structured summary of the nesting of parallel algorithms. An algorithm typically creates a task_group_ context, and then invokes spawn_root_and_wait() on a root task associated with that task_group_context. When a running task throws an exception, the following happens: . The library temporarily captures the exception. . The library cancels the task’s task_group_context node and its descendent nodes in the tree. . The library cancels all tasks in each canceled node. A canceled task skips invoking its execute() method unless it is already running, in which case it runs to completion. A task’s destructor always runs, even if it is canceled, thus permitting proper cleanup. . After all canceled tasks are destroyed, the library rethrows the exception to the caller that was waiting on the root task. Figure shows propagation of an exception x. The small circles denote tasks. A colored circle denotes a
Intel® Threading Building Blocks (TBB)
I
caller
R1 T2 T1 R2
caller
R1
caller
caller
R1 T2
caller
R1 T2
R2
Intel® Threading Building Blocks (TBB). Fig. Propagation of an exception x up through a task tree
task that is running. Each large dashed circle denotes a task_group_context node associated with the tasks inside it. The sequence shows:
task graphs may have much less structure and in the case of continuation-passing style, are decoupled from the thread stacks.
. Task T’s method execute() throws exception x, which is temporarily caught by the library. . Tasks are canceled in the task_group_context containing T and its descendant task_group_ context. . Task T waits for root task R to complete. . Root task R waits for task T to complete. . The library rethrows x to the caller of spawn_root_ and_wait(R).
Task Groups
The waiting in steps and are necessary because the only implicit cancellation point is just before task execution begins. To improve response time to cancellation, a task can explicitly poll its task_group_context and return early. A task_group_context also can be used to explicitly cancel a parallel algorithm, from either inside or outside the algorithm. Thus, the task_group_context tree provides a structured way to process cancellation and exception handling, even though the underlying
Intel TBB provides a higher-level task interface task_group that simplifies the writing of blocking-style tasking, particularly in conjunction with C++x lambda expressions. The following example uses task_group to execute Work(i) for i∈[lower,upper). void Recurse(int lower, int upper) { if(upper<=lower) { // Do nothing } else if(upper-lower==) Work(lower); else { int midpoint = lower+(upper-lower)/; tbb::task_group g; g.run([&] {Recurse(midpoint,upper);}); g.run_and_wait([&] {Recurse(lower, midpoint);}) } }
I
I
Intel® Threading Building Blocks (TBB)
Containers Sometimes threads must inspect and modify a common container of data. The containers provided by the C++ standard library generally do not permit concurrent modification, and thus actions on them must be serialized. This serialization can become an impediment to scalability. Intel TBB provides five containers that permit concurrent inspection and modification. The containers are independent of the Intel TBB task scheduler. They may be used in conjunction with any threading package. A concurrent_queue is an unbounded queue that permits multiple producers and consumers to operate on the same queue. There is no support for blocking pop operations. Consumers use try_pop, which returns a failure status if the queue is empty. A concurrent_bounded_queue adds support for blocking and putting an upper bound on the queue size. The blocking push operation waits if the queue is full. The blocking pop operation waits if it is empty. A concurrent_vector is a vector that supports concurrent access and growing. For example, multiple threads can execute v.push_back(item) on the same concurrent_vector v without corrupting it. The growth operations do not move existing items, so unlike for std::vector, it is safe to access existing elements while the vector is growing. Besides push_back, there are two other growth operations: ● ●
grow_by(n) appends n consecutive elements. grow_to_at_least(n) appends zero or more consecutive elements to make the vector at least size n.
All growth operations return an iterator pointing to the first element appended. In the case where grow_to_at_least(n) appends no elements, it returns an iterator pointing to just beyond the nth element. A concurrent_unordered_map is a hash table of (key,value) pairs that supports concurrent insert and find operations, along with safe concurrent traversal during these operations. The interface closely resembles the C++x unordered_map interface. A concurrent_hash_map is a thread-safe hash table of (key,value) pairs that supports concurrent insert, find, and erase operations, but not concurrent traversal. A fundamental problem of concurrent find and erase operations on the same key is that they may race in
a way such that the find operation returns a reference to an item that is promptly destroyed by the erase operation. Class concurrent_hash_map addresses the problem by combining lookup with locking. The find and insert operations return the found (key,value) pair via an accessor object. The accessor acts as a pointer to the item with an implicit lock on it. An accessor implies a writer lock. A const_accessor implies a reader lock. Concurrent erasure of an item waits until the item has no accessors pointing to it.
Thread Local Storage Intel TBB abstracts thread-local storage via containerlike objects that have a separate element per thread. Method local() returns a reference to a thread’s element, which is created lazily upon first access by its associated thread. Copying such a container copies all of its elements, retaining the element to thread mapping. There are two such container classes. The more general class is enumerable_thread_specific. The example below shows how it can be used for reduction. typedef enumerable_thread_specific <double> ets_type; ets_type local_sums; parallel_for(, n, [&](int i) { local_sums.local() += a[i]; }); double global_sum = ; for(ets_type::const_iterator i=result.begin(); i!=result.end(); ++i) global_sum += *i; Alternatively, the final reduction can be written more concisely using method combine: double global_sum = local_sums.combine (std::plus<double>()); Class combinable
Memory Allocator Because malloc scales poorly on some systems, Intel TBB provides a scalable memory allocator. It also provides an allocator for cache-aligned allocations. All the allocators have interfaces similar to the
Intel® Threading Building Blocks (TBB)
C++ std::allocator. The library also provides a C level interface that is analogous to the traditional malloc interface. Intel TBB provides support for automatically replacing calls to the system-defined allocators with calls to the scalable allocator. The implementation of the scalable allocator uses techniques similar to Hoard []. Each thread maintains its own size-segregated lists of free memory blocks. If a thread frees a block that was allocated by another thread, the freeing thread sends the block back to the originating thread.
Mutexes Intel TBB provides a variety of mutex classes, which all implement a common set of signatures for ordinary mutual exclusion. The common set of signatures permits code to be parameterized with respect to the mutex type. The different types are tailored to different semantic or performance needs; for example, efficiency under light contention versus fairness under heavy contention. Most of the mutex classes have corresponding reader-writer variants that provide for shared/exclusive locking. All of the mutex classes support a locking pattern where lock acquisition is represented by construction of a scoped_lock object as shown below: std::list<std::string> MyList; tbb::mutex MyMutex; void PrependElement(std::string x) { tbb::mutex::scoped_lock lock(MyMutex); MyList.push_front(x); // protected by lock on mutex. } Destruction of the scoped_lock object causes the lock to be released. The pattern has the advantage that the programmer cannot forget to release the lock, and the lock is released even if push_front throws an exception. A subset of the mutex classes support an unscoped locking interface that follows C++x conventions [].
fetch-and-add. For example, the following code atomically decrements RefCount and executes Foo() if it became zero. extern atomic
Future Directions The immediate future direction of Intel TBB is driven by the C++x draft. Intel TBB works with standard C++ compilers, without special support, except that it currently depends upon vendor extensions for atomic operations and memory consistency rules because C++ lacks a memory model for multi-threading. The C++x draft addresses this issue. C++ prevents propagation of an exception across threads. Intel TBB works around this limitation by propagating an approximation of the original exception. Intel TBB implementations built on top of C++x can propagate the original exception across threads. C++x also introduces lambda expressions, which were not present when Intel TBB was initially designed. Interfaces in Intel TBB have been evolving to work better with lambda expressions. For example, TBB ., introduced overloads of parallel_reduce that specifically target ease of use with lambda expressions. Intel TBB has been evolving to support a subset of interfaces that are compatible with Microsoft’s Parallel Patterns Library (PPL). Examples of existing PPL-compatible interfaces are parallel_invoke, task_group, as well as the short forms of parallel_for.
Atomic Operations Intel TBB provides template class atomic
I
Related Entries Cilk Loops, Parallel OpenMP
I
I
Interactive Parallelization
Bibliographic Notes and Further Reading
Intel TBB continually evolves. The most up-to-date description is its reference manual []. The corresponding tutorial [] describes use cases for various features. The work-stealing approach to task scheduling is adapted from Cilk []. (See Cilk). There are important differences. In Cilk, the spawning thread immediately executes the child, and the parent is stolen. In Intel TBB, a thread that spawns a child continues execution, while the spawned child may be stolen. The Cilk semantics are simulated via continuation-passing style. Where blocking style is used, Cilk guarantees on space and time do not apply. The algorithm templates in TBB generally use continuation-passing style internally, with a single blocking-style task at the root. The mechanism for auto_partitioner does introspection on stealing []. The mechanism for affinity_partitioner [] is derived from a locality-guided work-stealing technique described by Acar, Blelloch, and Blumofe [].
Bibliography . Acar U, Blelloch G, Blumofe R () The data locality of workstealing. In: Proceedings of th annual ACM symposium on parallel algorithms and architectures, SPAA ’. Bar Harbor, Maine. ACM, New York, pp – . Berger E, McKinley K, Blumofe R, Wilson, P () Hoard: a scalable memory allocator for multithreaded applications. In: Proceedings of th international conference on architectural support for Programming Languages and Operating Systems, ASPLOS . Cambridge, Massachusetts. ACM, New York, pp – . Frigo M, Leiserson C, Randall K () The implementation of the cilk- multithreaded language. In: Proceedings of the ACM SIGPLAN ‘ conference on programming language design and implementation, PLDI ’. Montreal, Quebec, June . ACM, New York, pp –. DOI=http://doi.acm.org/./. . Intel Corporation. Intel Threading building blocks tutorial. http://www.threadingbuildingblocks.org/documentation.php . Intel Corporation. Intel Threading building blocks reference manual. http://www.threadingbuildingblocks.org/ documentation.php . ISO/IEC JTC/SC/WG C++ Standards Committee, Working draft, standard for programming language C++, N. http:// www.open-std.org/jtc/sc/wg/docs/papers//n.pdf . MacDonald S, Szafron D, Schaeffer J () Rethinking the pipeline as object-oriented states with transformations. In: th international workshop on high-level parallel programming models and supportive environments, IPDPS . Santa Fe,
New Mexico, April . IEEE, pp –. DOI=http://doi. ieeecomputersociety.org/./HIPS.. . Reinders J () Intel threading building blocks. O’Reilly Media, Sebastopol, California . Robison A, Voss M, Kukanov A () Optimization via reflection on work stealing in TBB. In: Proceedings of nd IEEE symposium on parallel and distributed processing. New York. DOI=http://dx.doi.org/./IPDPS..
Interactive Parallelization FORGE
Interconnection Network Buses and Crossbars Cray XT and Seastar -D Torus Interconnect Ethernet Hypercubes and Meshes InfiniBand Interconnection Networks Myrinet Quadrics
Interconnection Networks Sudhakar Yalamanchili Georgia Institute of Technology, Atlanta, GA, USA
Synonyms Multicore networks; Multiprocessor networks
Definition High-performance computing architectures utilize interconnection networks as the communication substrate between the processor cores, memory modules, and I/O devices.
Discussion Introduction In their most general form, interconnection networks are a central component of all computing and communication systems from the internal interconnects of chip-scale embedded architectures to geographic-scale systems such as wide area
Interconnection Networks
networks and the Internet. This section focuses on interconnection networks as they are used in multiprocessor and multicore systems. Specifically, the section addresses basic systems concepts and principles for interconnecting processors, cores, and memory modules. These concepts and principles are not exclusive to this application domain and in fact have also found application in networks used in storage systems, system area networks, and local area networks, albeit with distinct engineering manifestations tailored to their particular domains. Interconnection networks have their roots in telephone switching networks and packet switching data networks and, in fact, the earliest multiprocessor networks were adaptations of those networks. They have since evolved to incorporate new principles and concepts that accommodate the distinct engineering and performance constraints of multiprocessor and multicore architectures. In particular, the lack of standards such as that found in the wide-area and local-area communities created opportunities for meeting performance goals via customized solutions, which in turn has been largely responsible for the diversity of implementations. However, the shared goal of high-performance multiprocessor communication has led to a common set of principles and concepts organized around a network stack that can be viewed as being comprised of the following layers: routing layer, switching layer, flow control layer, and physical layer. These layers collectively operate to realize network performance and correctness properties, for example, deadlock freedom. The physical layer comprises signaling techniques for synchronized transfer of data across links. Switching techniques determine when and how messages are forwarded through the network. Specifically, these techniques determine the granularity and timing with which resources such as buffers and switch ports within a router are requested and released. For example, in packet switching a packet is forwarded only after it has been received in its entirety at a router whereas in virtual cut through switching bytes of a packet can be forwarded once routing information is available and the corresponding output port is free. Forwarding and reception are overlapped. The design of switching techniques is closely coupled to the behavior of flow control protocols, which ensure the availability of buffering resources as packets traverse the network.
I
Finally, routing algorithms determine the path taken by a message through the network. This section describes basic interconnection network concepts covering topology, routing, correctness, and common metrics. The discussion includes the routing layer and periodically exposes dependencies on the lower level layers particularly with respect to deadlock freedom.
Topology Routing of communication between nodes is the most basic property of an interconnection network and is intimately tied to the pattern of interconnections between nodes where a node may be a processor, a core, a memory module, or an integrated processor-memory module. Such an interconnection pattern is referred to as the topology of the network and defines the set of available paths between communicating nodes. A node may be a source or destination of a message that is encapsulated in one or more packets each of which is comprised of a header that contains routing and control information and an optional body that contains data. A crossbar switch provides a direct connection between any pair of network nodes. It is also the most expensive interconnect in terms of hardware and power. Thus, as system size grows we encounter a variety of topologies that represent various performance and cost trade-offs. The most common topologies can be broadly classified into shared medium, direct, indirect, and hybrid.
Shared Medium Networks The bus is the most common shared medium network and the first widely used interconnect in multiprocessor architectures. A bus can have several hundred signals organized as address, data, and control. Only one node is allowed to transmit at a time as determined by the arbitration policy that arbitrates between concurrent requests for the bus. Such shared medium buses do have several advantages for low node count systems. They provide an efficient broadcast medium in support of one-to-all or one-to-many communication patterns. The shared bus also serves to serialize all traffic between nodes. When used as a processor-memory interconnect, this property is exploited in cache-based multiprocessors to support coherence protocols for shared memory multiprocessors []. Routing on the
I
I
Interconnection Networks
bus is straightforward since all devices snoop the address lines. However, if the bus is used to connect devices of differing speeds, such as processors to memory, bus utilizations can degrade rapidly if a transaction (e.g., reading from memory) holds the bus for the duration of the transaction. As processor and memory speed disparities grew, low utilizations led to split phase transactions where requests and responses are split into two separate transactions. As a result, multiple memory transactions can be interleaved on the bus increasing bus utilizations. Physical constraints limit bus lengths and consequently system sizes. This has also become true for chip multiprocessors, and thus the use of buses can be expected to be limited to smaller sized systems at future technology nodes.
Direct Networks To achieve scalability beyond that which is feasible with shared medium networks, nodes are interconnected by point-to-point links. A node includes a router whose input and output links are connected to routers at some set of neighboring nodes producing a direct network. An example of a processor node in a D mesh network is illustrated in Fig. where one of the router links connects within the local node. Each unidirectional link realizes the injection and ejection channels.
The topology refers to the overall interconnection pattern between nodes (actually between the routers to which the node is connected). A common topology, particularly for on-chip networks, is a D mesh as shown in Fig. . Formally, an n-dimensional mesh has k × k × . . . × kn− × kn− nodes, where ki is the number of nodes along dimension i and is referred to as the radix for dimension i, and where ki ≥ and ≤ i ≤ n−. Each node X is identified by n coordinates, (xn− , xn− , . . . , x , x ), where ≤ xi ≤ ki − for ≤ i ≤ n − . Two nodes are connected if their coordinates differ by in only one dimension. Except for nodes on the boundary, all nodes are directly connected to n other nodes. Thus, we can have a × mesh where each interior node is connected to four other nodes or a × × D mesh where each interior node is connected to six other nodes. Now, messages between a pair of nonadjacent nodes are routed through the intermediate routers. A symmetric generalization of the mesh is a k-ary n-cube []. Unlike the mesh, nodes at the edge have a link that “wraps around” the dimensions as shown in Fig. for the -ary -cube. Now all nodes have the same number of neighbors. Formally, a k-ary n-cube has k × k × . . . × kn− × kn− nodes, ki nodes along each dimension i and two nodes X and Y are neighbors if and only if yi = xi for all i, ≤ i ≤ n − , except for dimension j, where yj = (xj ± ) mod k. Every node has
CPU Cache hierarchy
Memory controller
Cache hierarchy
Memory controller
CPU
Interconnection Networks. Fig. An example of network nodes
Interconnection Networks
Binary hypercube
I
2D torus
Unbalanced binary tree
Balanced binary tree
I
Flattened butterfly
Irregular network
Cube connected cycles
Debruijn network
Interconnection Networks. Fig. Example direct network topologies
n neighbors. When n = , the k-ary n-cube collapses to a bidirectional ring with k nodes. When the radix in all dimensions are equal to k, we have kn nodes. When k = , the topology becomes the binary hypercube or binary n-cube as shown in Fig. which illustrates a -node binary hypercube. A generalization of the k-ary n-cube is the generalized hypercube where rather than being connected to the immediate neighbors in each dimension, each node is connected to every node in each dimension. In graph theoretic terms, as we proceed from meshes through k-ary n-cubes to generalized hypercubes, additional links are added creating a denser interconnect. From a practical standpoint, as additional links are created, feasible channel widths change since the number of I/Os at an individual router have some maximum value. Finally, in all of the cases above, we note that the radix in each dimension can be distinct. These networks are often referred to as orthogonal networks
recognizing the independence of the routing behavior in each dimension (see Routing (Including Deadlock Avoidance)). The landscape of direct interconnection network topologies is only limited by the imagination and many variants have been proposed and will no doubt continue to be proposed to match engineering constraints as technologies evolve. The preceding are the most common classes of networks used in current generation systems. Figure illustrates some other common examples from the current time frame. A D flattened butterfly topology used in on-chip networks is equivalent to a D generalized hypercube. Various forms of balanced and unbalanced trees have been proposed motivated by the logarithmic distance properties between nodes. Others such as de Bruijn networks [], star networks [], and cube connected cycles [] present different trade-offs between network diameter (longest path between nodes) and wiring
I
Interconnection Networks
density. We can anticipate other direct networks to evolve in the future. A final observation is that all of the topologies in the preceding discussion exhibit some form of regularity in structure, the edge routers in the mesh notwithstanding. A final class of networks that can be distinguished are irregular networks. Rather than a mathematical description, such networks are typically represented as graphs of interconnections and can be constructed to form any type of topology. An example is shown in Fig. . These networks are usually motivated by packaging constraints or customization, for example a specific topology in embedded systems where the topology is customized to the specific communication interactions between nodes. As a result of their irregularity these networks rely on distinct classes of routing techniques that are described in Routing.
to perfectly shuffling a deck of N cards, where the deck is cut into two halves and shuffled perfectly, that is, the inputs are shuffled to the outputs. This function is referred to as the shuffle function and the network in the figure is the well-known Omega network. The preceding is a compact way of expressing the connections between stages. Note that every ouput (of a stage) is connected to exactly one input (of the next stage of switches), and therefore these connections can be represented as a permutation of the inputs. Consequently, such networks are referred to as permutation networks. In general, such networks can be constructed with t × t switches and correspondingly n = logt N stages for connecting t n nodes. Not all stages need to have the same interconnect pattern. For example, consider the following interconnect function.
Indirect Networks In general, direct topologies support interconnections where some routers may simply serve as connections to other routers and are not attached to any node such as a processor, core, or memory. These are the class of indirect networks. Like direct topologies, these may be organized into regular or irregular configurations. A large class of regular indirect networks are multistage interconnection networks (MINs) an example of which is illustrated in Fig. . Nodes are connected to the inputs and outputs of the network. Each × router – typically referred to as a switch in these topologies – can route a packet from one of its input ports to one of its ouput ports. There are N = n input ports, N = n output ports, and log N stages of switches with N/ switches in each stage. The interconnections between stages of switches are important – they ensure that a single path exists between every input port and every output port. In particular, consider the interconnection between stages and in Fig. . The connections can be described by the function
The preceding function refers to Butterfly function forming the corresponding network of the same name illustrated in Fig. a. Note each stage uses a distinct butterfly ensuring every input has a path to every output. Many such MINs have been defined based on different permutation functions(see [, ] for a sampling). Perhaps, the oldest and best known among these is the Clos Network []. The Clos network is derived from a recursive decomposition of the crossbar with the first level decomposition shown in Fig. a. A crossbar is decomposed into a -stage network. Consider a network with N = r × n inputs. The first stage comprises of r, n × m switches while the center stage comprises m, r × r switches. The last stage comprises r, m × n switches. Each switch at the first stage has a connection to every switch in the middle stage. Each switch in the center stage has a connection to every switch at the last stage. Thus, there are m paths between every input and output. Each of the individual switches can be further and recursively decomposed with n = m = and r = N/ until the network is reduced to (log N)− stages of × switches. This network is known as the Benes network. The MINs are characterized based on their ability to simultaneously establish connections between inputs and outputs. If a path through the MIN prevents other paths between free inputs and outputs from being set up, they are referred to as blocking networks. MINs
σ (xn− xn− . . . x x ) = xn− . . . x x xn− For example, output of stage is connected to input of the next stage of switches (as shown in Fig. ) and similarly output is connected to input . This connection can be thought of as analogous
βi (xn− . . . xi+ xi xi− . . . x x ) = xn− . . . xi+ x xi− . . . x xi
Interconnection Networks
000
I
000
001
001
010
010 011
011
011
100
100 101
101
101
110
110
111
111 Stage 0
Stage 1
Stage 2
Interconnection Networks. Fig. A MIN with a shuffle interconnection between stages
a
000
000
000
001
001
001
010
010
S 010
011
011
011
100
100
100
101
101
101 D
110
110
110
111
111
b
111
Interconnection Networks. Fig. Some Example MINs. (a) The butterfly network. (b) A bidirectional MIN network
for which any free input can be connected to any free output are referred to as non-blocking while MINs where any free input can be connected to any free output by rearranging (rerouting) existing paths are referred to as rearrangeable. The preceding MINs with log N stages are blocking. Note that adding extra stages
introduces multiple paths between inputs and outputs until the network becomes rearrangeable and eventually non-blocking. Clos networks are non-blocking when m ≥ n − and rearrangeable when m ≥ n. Finally, the preceding discussion was based on MINs with unidirectional links routing packets from input to
I
I
Interconnection Networks
0 n–1
0 0
0
0
000
n–1
001
1
1
1
010 011
100 101 r-1
a
nxm switches
m-1 rxr switches
r-1 110 mxn switches
b
111
Interconnection Networks. Fig. Some additional MINs. (a) Crossbar decomposition of the clos network. (b) An example fat tree
output. However, MINs can operate with bidirectional links and packets can be routed to “any” output port of an intermediate switch leading to routing paths shown in Fig. b. Such MINs are referred to as bidirectional MINs. A careful look at the structure of bidirectional MINs will reveal that there exist multiple paths between any source–destination pair, an example of which is shown in the figure. A fat tree is an indirect network that is topologically a balanced binary tree topology with the modification that the bandwidth of the links increases as one gets closer to the root (hence the label “fat”) []. This is typically achieved by adding more links in parallel closer to the root. Routing is straightforward in a fat tree – a message is routed up to the nearest common ancestor and then down to the destination. Thus, local communication is restricted to sub-trees. Fat trees are universal in the sense that it is the most efficient network for a given amount of hardware (see [] for definitions of efficiency and universality). A butterfly BMIN with turnaround routing can be viewed as a fat tree – consider the dashed boxes of Fig. b as single internal switches in a fat tree. In this case, it can be viewed as having four × switches in the first stage, two × switches in the second stage, and a single × switch in the last stage (the root of the tree). Modern fat tree topologies have evolved into a parametric family
as a function of (a) the number of levels in the tree, (b) the number of switches at each level, and (c) the switch sizes. As with direct networks, the preceding MINs are considered regular, and it is possible to consider a range of irregular indirect networks of arbitratry topologies with nodes connected to an arbitrary subset of switches. As before, this has a clear impact on the routing protocols.
Hybrid Networks As the name suggests, hybrid networks are customized networks that couple various topological concepts described so far. For example, we might have a mesh of clusters where nodes within a cluster are interconnected by a high-speed bus. Or, we may have a hierarchy of ring interconnections. Hybrid networks are often motivated by the target packaging environment and incorporation of new technologies. For example, continued progress in optical communication, especially silicon nanophotonics, will lead to networks where interconnects and topologies are optimized for a combination of optical and electrical signaling.
Routing The routing algorithm establishes a path through the network and can be distinguished by the number
Interconnection Networks
of candidate paths that can be produced between a source and destination. Deterministic routing protocols provide a single path between every source–destination pair. Oblivious routing algorithms produce routes independent of the state of the network, for example independent of local congestion. While deterministic routing algorithms are oblivious, the converse may not be true. For example, the source may cycle through alternative (deterministic) paths to the destination for successive packets in an attempt to balance network traffic across all links while not being influenced by the current state of any router. Another example is Valiant’s two-phase routing protocol [] where a packet is first transmitted to a randomly chosen node and then routed deterministically to the destination. Adaptive routing algorithms operate in response to network conditions in making routing decisions at each router by selecting from a set of alternative output ports at intermediate routers and thereby exposing multiple alternative paths to a destination. All of the preceding routing algorithms may cause a packet to follow a minimal path to the destination or permit non-minimal paths. Implementations of routing algorithms admit to several alternatives. In distributed routing, each router implements a routing function that provides an output port(s) for each packet based on some combination of destination and/or input channel and/or source. Several implementations of distributed routing are feasible including table look-up or finite state machine implementations. In source routing, the source node prepends an encoding of the entire path to the destination, for example, as a sequence of output ports at each intermediate router. Each router along the path simply queries this header to determine the local output port along which to forward the packet. Source routing is invariably deterministic. Many refined characterizations of routing algorithms have also emerged over the years. In multiphase routing algorithms, packets employ different routing algorithms during different phases of transmission. In optimistic routing algorithms, packets can freely choose any path to the destination since deadlock freedom is ensured via recovery rather than avoidance (see Deadlocks). Alternatively, in conservative routing algorithms, packets advance only when progress to the destination along the selected path is
I
assured. Principle features of routing algorithms and an accompanying taxonomy can be found in [].
Deadlock and Livelock For any given topology, the routing algorithms must realize deadlock-free and livelock-free operation. Deadlock refers to the network state where some set of messages are indefinitely blocked waiting on resources held by each other. Livelock refers to a packet state, where a packet is routed through the network indefinitely and is never delivered. The presence of faulty components can lead to both deadlock and livelock behavior if mechanisms to prevent them are not implemented. An example of a deadlocked configuration of messages is shown in Fig. where messages are being transmitted between sources Si and destinations Di in a D mesh using a deterministic shortest path routing and input buffered switches (nodes attached to each router are not shown). Every message is waiting on a buffer used by another message producing a cycle of dependencies where each message is transitively waiting on itself. When routing is adaptive more complex configurations of deadlocked messages can occur where each message is blocked waiting on multiple buffers occupied by multiple messsages and the set of messages cannot make progress since each message is transitively waiting on itself along all feasible output ports at the intermediate router. Deadlock avoidance mechanisms are designed such that packets request and wait for resources (e.g., buffers) in such a way as to avoid the formation of deadlocked configurations of messages or otherwise guarantee progress. Routing algorithms for a topology are designed in conjunction with switching techniques to avoid deadlock. The most commonly employed principle, which is also common to deadlock avoidance in operating systems, is to provide a global ordering of resources and for all packets to request resources in this order. Specific applications of this principle are topology specific and are described in Routing (Including Deadlock Avoidance). In contrast, deadlock recovery mechanisms promote optimistic routing of messages in the absence of any deadlock avoidance rules []. Deadlocked configurations of messages are detected and recovery mechanisms are invoked that reestablish forward
I
I
Interconnection Networks
D3
S2
S1
D4
D2
S3
S4
D1
Interconnection Networks. Fig. An instance of deadlock
progress of packets. The application of deadlock recovery is motivated in network configurations where the probability of deadlock occurrence is very low. This is often the case in adaptively routed networks with large numbers of virtual channels. All deadlock freedom proofs rely on the consumption assumption – messages reaching a destination are eventually consumed. However, the end points can introduce dependencies between incoming packet channels and outgoing packet channels since there may be dependencies between message types, for example processing of a message may lead to the injection of new messages. Consequently, it is possible to introduce a dependency between injection and ejection buffers external to the networks and which is not addressed by routing restrictions in the network. This can potentially introduce cyclic dependencies and therefore deadlock referred to as protocol or message-dependent deadlock []. For example, in shared memory systems
message traffic consists of request packets (for cache lines) that produces response traffic (cache lines). Traffic is often separated into two virtual networks – the request network and response network. Thus, protocol handlers at the end points can ensure that each network satisfies the consumption assumption and that the end point is never indefinitely blocked waiting on injection (full) or ejection (empty) channels.
Deadlock-Free Routing The literature on packet data networks is rich with techniques for deadlock-free routing (avoidance). This section will be focused on principles and common techniques that are utilized in multiprocessor interconnection networks that in one form or another enforce ordering constraints to ensure deadlock freedom.
Interconnection Networks
The simplest example of deadlock free routing is in D meshes using X–Y routing where a packet is routed in the X dimension eliminating the X-offset (find the right column) before routing in the Y dimension (eliminate the Y-offset) – also known as dimension order routing. This avoids deadlocked configurations since no message traversing the Y dimension can request a channel in the X dimension and create a cycle of dependencies as illustrated in Figure . This deadlock-free routing algorithm is naturally extended to n dimensions where all packets follow the same order of dimensions in eliminating offsets in each dimension. The Turn Model generalizes dimension order restrictions on routing by observing that packets must be prevented from moving from one class of channels to another, that is, performing a turn []. In dimension order routing in D meshes packets are prevented from turning from the Y dimension to the X dimension. Many Turn-based deterministic and adaptive routing algorithms have been developed in the literature. One of the principle attractions of the Turn Model is that it can be applied to networks without virtual channels. Figure a shows all possible turns for a D mesh. Figure b shows those turns (dashed arrows) that are restricted for X-Y routing. Note that two turns are restricted in each cycle. However, restricting the use of a single turn in each cycle is sufficient to avoid deadlock. An example is illustrated in Figure c. Deadlock-free routing in irregular networks can be achieved with similar ordering constraints. For example, a distinguished node can be identified in the network (e.g., the root of a spanning tree). A labeling algorithm labels all channels as either being up and directed toward the root or down and being directed away from the root. Packet routing is now constrained to use all up channels followed by all down channels. A packet cannot turn from a down channel to an up channel avoiding the creation of cyclic dependencies [].
a
b
I
The wrap-around links in multidimensional tori introduce hardware cycles within a dimension and therefore dimension order routing is not deadlock free (consider the case where each node in a row is simultaneously transmitting to a destination h-hops away in the same row). For deterministic routing the work of Dally and Seitz [] showed that the necessary and sufficient condition for deadlock-free routing is an acyclic channel dependency graph – intuitively if a packet can be routed from channel A to channel B, there is a dependency from A to B. Acyclic dependency graphs are achieved in tori with the presence of two virtual channels per physical channel – say VC and VC – and placing restrictions on how these channels may be used. A single link in each dimension is chosen (sometimes called the dateline link). The routing algorithm constrains the packets being routed across this link to use one virtual channel class. Coupled with the routing rules across other links, cycles in the channel dependency graph and hence deadlock is avoided. The requirement of acyclic channel dependency graphs can be implemented in many ways by creatively restricting the use of virtual channels in regular and irregular networks. The theory of deadlock free adaptive routing protocols was developed by Duato [, ]. Intuitively, cycles may exist in the channel dependency graph as long as there exists an acyclic subgraph that enables all messages to be routed between any source destination pair – sometimes referred to as an escape path or escape channel at a router. Virtual channels provide a natural medium for implementing adaptive routing. In the case of tori, each physical link supports three virtual channels – VC, VC, and VC. The first two provide the acyclic subgraph necessary for deadlock free routing. VC is the adaptive channel that can be used by a packet to cross any physical link to the destination. The theory has been developed for wormhole, packet, and virtual cut-through implementations [–]. In general,
c
Interconnection Networks. Fig. Possible and restricted turns in a D mesh. (a) Possible turns in a D mesh; (b) Restricted turns for X–Y routing; (c) Minimum turn restrictions for deadlock freedom
I
I
Interconnection Networks
by suitably specifying routing restrictions, adaptive routing protocols can be developed for regular and irregular networks.
A second physical constraint is the number of data signals on a router referred to as the node degree, node size, or pin-out. The importance of node degree stems from the fact that it is a practical limit on routers. Further, the topology, bisection width, and node degree Metrics The construction of an interconnection network is a are closely related. Consider the implementation of a process of trade-offs between various network design k-ary n-cube with the bisection width constrained to be parameters such as topology, channel widths, and N bits. The maximum channel width determined by this switch degree and the physical constraints of the bisection width is W bits and is given by k target implementation environment such as on-chip vs. () Bisection width = Wkn− = N ⇒ W = in-rack vs. inter-rack. The consequences of the choice The channel width directly impacts message latency of network parameters are captured in several common and hence performance. The specific relationships metrics. between channel widths, node degree, and bisection A practical constraint encountered by interconnection networks is the wiring density. Since networks must bandwidth vary significantly whether the network is be implemented in three dimensions, for an N node implemented on-chip, intra-rack, or inter-rack. These network the available wiring space can be seen as metrics also provide a uniform basis for comparision growing as O(N ) which corresponds to the area of between various topologies. the surface of a D cube containing the N nodes. Network traffic can grow as O(N). As system sizes Related Entries are scaled, wiring limits are eventually encountered. A Networks, Direct common metric to capture wiring limits is the bisec- Networks, Multistage tion width of a network defined as the minimum Cray XT and Seastar -D Torus Interconnect number of wires that must be cut when the network is divided into two equal sets of nodes. The intuition Bibliographic Notes and Further is that in the worst case, every node on one side of Reading the minimum cut will be communicating with every Each of the major topics discussed in this section have node on the other side of the cut maximally stressing been subjected to a rich and diverse development. the communication bandwidth of the network. For Entries can be found in the encyclopedia for all levels of regular topologies such as k-ary n-cubes the minimum the interconnection network stack. At the lowest level, bisection width is orthogonal to the dimension with the entry on Flow Control elaborates the issues and the largest radix. For irregular topologies, the min-cut solutions for managing buffer resources as packets defines the bisection width. Bisection bandwidth is progress through a network. One level up, Switching usually computed as the bandwidth across these wires Techniques are closely coupled with flow control and implicitly considering only data signals. The bisection interact with higher-level routing protocols that are widths of several topologies where each link between covered in the entry on Routing (Including Deadlock Avoidance) to determine no-load latency properties of adjacent nodes is W bits wide are listed in Table . an interconnection network. However, workloads can place highly nonuniform demands on the network. The entry on Congestion Management introduces a variety Interconnection Networks. Table Examples of bisection of techniques for managing such nonuniform demands width and node size of some common net-works (t × t and enhancing performance. A very important special switches are assumed in the MIN) case of traffic patterns is collective communication, Network Bisection width Node size and various approaches toward support for such k-ary n-cube Wkn− Wn communication patterns can be found in the entry NW Binary n-cube nW on Network Support for Collective Communication. n-dimensional mesh Wkn− Wn High-peformance network designs are especially MIN NW tW challenging when the network must be resilient to
I/O
faults. Basic issues and approaches can be found in the entry on Fault Tolerant Networks. Finally, the injection and ejection interfaces to the network must be designed to deliver network performance to the end point hosts. Basic architectural and operational principles are covered in Network Interface. A combined coverage of fundamental architectural, theoretical, and system concepts and a distillation of key concepts can also be found in two texts [, ]. Advanced treatments can be found in papers in most major systems and computer architecture conferences with the texts contributing references to many seminal papers in the field.
Acknowledgments Feedback and comments from Mitchelle Rasquinha and Jeffrey Young are gratefully acknowledged.
I
. Patterson D, Hennessey J () Computer architecture: a quantitative approach. Morgan Kaufmann, Palo Alto, CA . Leiserson CE () Fat-trees: universal networks for hardwareefficient supercomputing. IEEE Trans Comput C-:– . Preparata FP, Vuillemin J () The cube-connected cycles: a versatile network for parallel computation. Commun ACM ():– . Samatham MR, Pradhan DK () The de Bruijn multiprocessor network: A versatile parallel processing and sorting network for VLSI. IEEE Trans Comput C-():– . Schroeder MD, Birrell AD, Burrows M, Murray H, Needham RM, Rodeheffer TL () Autonet: a high-speed, self-configuring local area network using point-to-point links. IEEE J Select Areas Commun ():– . Song YH, Pinkston TM () A progressive approach to handling message-dependent deadlock in parallel computer systems. IEEE Trans Parallel Distrib Syst ():– . Wu CL, Feng TY () On a class of multistage interconnection networks. IEEE Trans Comput C-:– . Valiant LG () A scheme for fast parallel communication. SIAM J Comput :–
Bibliography . Anjan KV, Pinkston TM () An efficient fully adaptive deadlock recovery scheme: DISHA. In: Proceedings of the nd international symposium on computer architecture, IEEE Computer Society, Silver Spring, MD, pp – . Akers SB, Krishnamurthy B () A group-theoretic model for symmetric interconnection networks. IEEE Trans Comput C-():– . Clos C () A study of non-blocking switching networks. Bell Syst Tech J ():– . Dally WJ () Performance analysis of k-ary n-cube interconnection networks. IEEE Trans Comput C-():– . Dally WJ, Towles B () Principles and practices of interconnection networks. Morgan Kaufman, San Francisco, CA . Dally WJ, Seitz CL () Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans Comput C-():– . Duato J () A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst ():– . Duato J () A necessary and sufficient condition for deadlockfree adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst ():– . Duato J () A theory of deadlock-free adaptive multicast routing in wormhole networks. IEEE Trans Parallel Distrib Syst ():– . Duato J () A necessary and sufficient condition for deadlockfree routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst ():– . Duato J, Yalamanchili S, Mi L () Interconnection networks. Morgan Kaufmann, San Francisco, CA . Glass CJ, Ni LM () The turn model for adaptive routing. Proceedings of the th International Symposium on Computer Architecture, Queensland, Australia, pp –
Internet Data Centers Data Centers
Inter-Process Communication Collective Communication Collective Communication, Network Support for
I/O Xiaosong Ma Oak Ridge National Laboratory, Raleigh, NC, USA North Carolina State University, Raleigh, NC, USA
Synonyms High-performance I/O
Definition Parallel I/O refers to input/output operations where multiple processes working together in executing a parallel application concurrently access secondary storage devices to exchange data between the main memory and files. The discussion in this entry will
I
I
I/O
be focused on parallel I/O in the context of HPC (High Performance Computing), especially scientific computing on large-scale parallel machines.
subsystem design crucial to delivering a balanced overall performance.
Hardware Environments Introduction I/O is an integral part of computing, scientific and commercial both alike. For parallel scientific applications, the purpose of their execution is to generate data through simulation, or to analyze data collected from observation instruments or computer simulations (which usually will in turn produce results stored in data files). With today’s supercomputers, ,s of processes may work in parallel in executing a single parallel program and accessing the secondary storage system. Unlike other environments with high I/O concurrency, such as commercial settings where a large number of client requests are processed by web server clusters or data centers, the HPC parallel I/O workloads are generated from much more tightly coupled/synchronized computation and often demand high data bandwidth in addition to low access latency. Moreover, parallel I/O plays an important role in HPC fault tolerance – by providing efficient checkpointing and restart operations. Parallel I/O faces tremendous performance and scalability challenges in the current era of Peta-scale computing. First, the well-known performance gap between the computing components (CPU and memory) and the secondary storage devices (hard disks) keeps widening. Although disk speed has also been continuously growing in the past decades, hard disks appear slower and slower. Second, such performance gap is worsened by the limitation in parallelism. On state-of-the-art supercomputers, there are typically orders-of-magnitude more I/O client nodes compared to I/O server nodes, creating complex resource sharing and interference problems when data travel from individual compute nodes’ memory to shared interconnections, I/O servers, and storage devices. In particular, this challenge is intensified by the trend of having more CPU cores within each compute node, when the clock rate improvement predicted by the Moore’s law cannot sustain due to semiconductor and heat management limitations. Such increased computation parallelism aggravates the I/O intensity and contention levels, making careful I/O
Parallel I/O, in a more general sense, can be carried out in most distributed storage systems, such as in typical commercial/R&D settings, when storage and I/O hardware are shared by multiple users or client applications. Examples include the NAS (Network Attached Storage) and SAN (Storage-Area Networks) environments. In the context of HPC, where the term “parallel I/O” is mostly used, there are several common hardware and networking features in addition to distributed accesses to shared storage devices. HPC I/O environments, designed for efficient file accesses performed by closely synchronized parallel applications (especially parallel simulations), often possess highly homogeneous hardware at both the client and server sides, as well as strictly hierarchical and symmetrical network structure. Figure demonstrates a representative hardware and interconnection architecture designed for high-performance parallel I/O. Such an I/O environment can be viewed as a multilevel client-server architecture. On today’s supercomputers (see configurations of state-of-the-art machines on the Top list []), the client side consists of ,s or more compute nodes, where users run their parallel programs, typically in batch mode. These programs issue I/O requests through the client layer of the parallel file system, which also resides on the compute nodes. Each node is equipped with its own main memory, shared by multiple cores. Smaller clusters may have node-attached local secondary storage (hard disks), while such a configuration has become rare on supercomputers these days. The nodes are connected with each other through one or more interconnection networks (such as InfiniBand or Myrinet), to communicate with each other. Message passing has been the dominant inter-process communication scheme in the past decades and currently remains the main stream. The majority of parallel applications today adopts the MPI (Message Passing Interface) [, ] programming model. Also connected to the compute nodes are I/O nodes, which are I/O servers handling client I/O requests and run the server
I/O
I
Compute nodes
Storage Interconnection Network I/O nodes
I/O controllers
I
RAID disk arrays
I/O. Fig. Sample High Performance Computing (HPC) I/O hardware architecture
layer of the parallel file system. With modern supercomputers, it is not rare that there are redundant interconnection networks among nodes (typically with different types of topology), for better performance and fault tolerance. One of these networks can potentially be assigned as the main I/O interconnection. On supercomputers these days, the ratio of compute to I/O nodes often varies between dozens to hundreds. Unlike in the case of compute nodes, the existence and operation of I/O nodes are typically transparent to application programmers. However, the combination of parallel I/O libraries (such as MPI-IO) and parallel file systems, both to be discussed in the next section, allows user programs to perform parallel I/O and access shared files via the I/O nodes implicitly through MPI interfaces. While small or medium clusters often have disks directly attached to the I/O nodes, on large-scale supercomputers they are connected with another interconnection network, to form a SAN. A group of controllers can be placed between the I/O node layer and the disk layer, connecting hundreds or more I/O nodes to ,+ hard disks. For performance and fault tolerance, these disks are usually organized into RAID [] groups. Data striping is performed at the file system level, and further at the storage device level.
Parallel I/O Software Overview of Parallel Application Execution and I/O Scenario On today’s typical HPC platforms, parallel I/O is carried out in multiple stages of parallel jobs. Such jobs are executed mostly in batch mode, submitted through job scripts and managed by batch job schedulers/managers such as LSF, PBS, MOAB, and LoadLeveler. Jobs submitted wait in the job queue until dispatched by the scheduler, to execute on the requested number of processors/cores. Usually such a partition of computation hardware (compute nodes, including CPUs and memory, and in some cases local secondary storage resources such as hard disks and nonvolatile memory devices) is occupied by the job until it finishes or its execution time has exceeded the maximum specified in the job script. On the other hand, concurrent jobs on a parallel computer need to compete with each other in using shared resources, such as the interconnection networks and the share/parallel file system. With parallel simulations, the dominant type of applications running on s to ,s processor cores, the execution of a job typically goes through many computation timesteps in an iterative manner. Often there is certain correspondence between a
I
I/O
computation timestep with a physical timestep, therefore the simulation computes the development of the object or phenomenon being studied along the time dimension. With most simulation codes, the execution alternates between computation phases and I/O phases, by having one I/O timestep every n computation timesteps. Application developers or users can configure the frequency of such periodic I/O operations by adjusting n. There are two major types of data written during a parallel simulation run: checkpointing data that save the current status of the computation, so that the computation can be restarted after a crash, a failure, or at a given intermediate point of simulation, and result data that save an intermediate “snapshot” of the simulation, for future analysis and visualization. Concatenating multiple snapshots generated by an application run allows scientists to view a movie demonstrating the process being simulated. In contrast, often no more than two consecutive checkpoints are needed for fault-tolerance purposes. Obviously, in choosing a desired periodic I/O frequency, users have to consider the trade-off between I/O cost and resolution. More frequent I/O steps generate more detailed view of the simulation and allow for less redundant recomputation in case of hardware/software failures, but will cost more time spent on I/O during the job run and result in larger result datasets that need to be stored, migrated, and processed. In most cases, application users tune the I/O frequency so that the visible time spent on I/O operations within a run is under a certain budget, e.g., –% of the total parallel execution time (wall time). In addition to periodic checkpointing and result data output, applications may write out miscellaneous data to assist book-keeping, debugging, and performance diagnosis/analysis. Such data include application-level logs and library-level profiling data and traces. Meanwhile, for most simulation codes the demand for input is relatively small: simulation jobs typically only read input data (from files describing the initial simulation state or a certain checkpoint) during the start or restart process, to initialize data structures before the first timestep. Therefore, simulation codes are often considered write-intensive, with the exception of out-of-core applications, where the aggregate memory is not large enough to accommodate the entire data
footprint, requiring applications to repeatedly swap data between the main memory and the secondary storage. Non-simulation applications (such as analytics and parallel data base search codes), on the other hand, may have larger portions of input in their I/O workloads. Besides the I/O needs during the execution of a job, staging operations are needed to move input data to and move output data from a supercomputer center. The high-performance file systems at these centers are “scratch space” for jobs’ transient use instead of long-term storage. With parallel data transfer tools such as GridFTP, input staging and output offloading can also be carried out with multiple streams accessing files simultaneously.
HPC I/O Access Patterns HPC applications have traditionally been using several modes for performing I/O. One brute force approach is to have every process communicate their data to/from a root process (typically process ), which is in charge of all file I/O. This approach is simple and can easily coordinate data to/from multiple processes while accessing files with traditional, single-process I/O interfaces. The problem is also obvious: parallel I/O is not used and the I/O process is serialized through the root process, which quickly becomes the I/O bottleneck even in a midsize cluster. Another rather straight-forward alternative to the above approach is to have each process perform their file I/O independently, to/from separate files. This approach is also easy to implement, and in many cases results in good performance in terms of the aggregate I/O throughput. By accessing nonoverlapping sets of files, many parallel processes can go ahead with I/O without coordinating with each other to decide their data’s global offsets or synchronizing through file locks. Within each file, accesses can also be largely sequential. However, this approach brings the need of creating, managing, and using a large number of small files. Such a management overhead is prohibiting for long-running, large-scale applications, which easily generate millions of files if independent file I/O is used. Also, this approach makes it more challenging to restart an application with a different number of processes. Considering the limitations of the above two methods, many large-scale applications with significant
I/O
periodic I/O requirement choose to go with the “true parallel I/O” mode, where many processes collaboratively access one or more shared files during each I/O phase. This is often done via parallel I/O libraries (see section “Parallel I/O Libraries” below). This approach combines the benefits of the first two methods: a small number of “global” files plus concurrent file I/O accesses. Such shared-file accesses create more interesting I/O patterns, as many processes need to map their in-memory data partitions to on-disk file range partitions, to store the global data object in layouts that facilities future restart or analysis. Therefore, the access pattern heavily depends on the domain decomposition method adopted by the parallel computation model. For regular applications that work on multidimensional arrays, commonly seen patterns are similar to HPF (High Performance Fortran) [], such as BLOCK and CYCLIC. Fine-granule interleaving pattern maybe created by corresponding strided data distribution, or a mismatch between in-memory and on-disk data layouts. For example, a -D array distributed to p processes in column-major blocks may produce such an interleaving pattern when the shared array needs to be written out to disk in row-major order. Such finely interleaved accesses can be aggregated and reorganized by the parallel I/O libraries, though, to alleviate the I/O performance concern.
Parallel I/O Software Stack Between a parallel application and the storage media, parallel I/O requests usually have to go through a multilayer I/O software stack, across the user and the kernel space. Figure illustrates such a deep I/O stack and the software layers will be discussed in the following sections. Please note that performing parallel I/O does not require all layers to be included. For example, off-the-shelf clusters often do not have parallel file systems, and applications may choose not to use high-level scientific data libraries or parallel I/O libraries.
Parallel File Systems Parallel file systems aim to bring to parallel computer users a file system view that is similar to traditional serial file systems on a single computer: a unified, logical storage space. It hides from programmers
I
Application
Scientific I/O library
Parallel I/O library
Parallel file system
Hardware
I/O. Fig. Sample HPC I/O stack
and users the fact that physical storage spaces are attached to distributed I/O servers, which are in turn interconnected to the compute nodes where the application programs run. Currently, parallel file systems are installed on supercomputers and medium/large clusters, while small clusters often use distributed file systems such as NFS []. In addition to the routine features of distributed file systems, such as file and directory management, request routing, and disk space management, parallel file systems have features and strategies that are designed to fit parallel I/O demands. One important feature is file striping, which partitions file into chunks and distribute them across I/O servers and further across disks. With intelligent striping strategies, a parallel file system can achieve reasonable I/O load balance for both large and small file accesses, as well as system extensibility when new servers and disks are added. In addition, parallel file systems take into account the HPC I/O workload (multiple parallel jobs running concurrently) in making design choices regarding file block size, metadata server configuration, distributed lock management, caching/prefetching policies, fault-tolerance mechanisms, etc. As of now, the most commonly used parallel file systems include the proprietary GPFS [] (available on IBM systems), and two open-source systems: Lustre [], and PVFS [].
POSIX I/O File systems used in HPC environments usually support I/O accesses via the well-known POSIX (IEEE
I
I
I/O
Portable Operating System Interface for Computing Environments) interfaces. POSIX interfaces can be used at different levels in a parallel program’s execution: they can be directly called by the programmer to carry out I/O operations, and they are used eventually by higher level I/O middleware (such as parallel I/O libraries) to access disk files. With the former approach, these familiar interfaces (such as open, close, read, and write) allow programs extended from sequential codes to easily run on parallel file systems. By having multiple processes carry out I/O simultaneously, “parallel I/O” can be achieved simply through POSIX interfaces. However, they do not assist parallel program developers in achieving high-performance I/O, especially to perform coordinated accesses to shared files with high I/O concurrency []. Also, strict enforcement of POSIX consistency may prevent applications from applying performance optimizations.
Parallel I/O Libraries Parallel I/O libraries act as middleware between the underlying parallel file system (accessed via POSIX/POSIX-like I/O interfaces) and the parallel application. Most importantly, parallel I/O libraries provide applications with convenient interfaces to easily access shared files concurrently, as well as performance optimizations that often dramatically improve applications’ parallel I/O efficiency. For ease of use, parallel I/O libraries supply users with file I/O interfaces to simultaneously open/close shared files and share common file pointers, as well as to conduct both individual and collective I/O. In particular, collective I/O [, , , , ] allows processes to collaborate in reading/writing their portions of data from/to shared files, using data distribution descriptions that map naturally to the domain decomposition methods used widely in parallel simulation codes. For example, a parallel program that distribute a large -D double array across processes in × × grids can conveniently write out their sub-arrays into the global array, stored in row-major order in a shared file. With parallel I/O libraries, this can be done with a single collective I/O call (which also forces implicit synchronization in a way similar to a barrier call). The complicated data reorganization and correspondent inter-process communication are
therefore hidden from the programmer. There are also interfaces for programmers to define more sophisticated or irregular data distribution, such as with indirection arrays. Several collective I/O architectures have been proposed in the past, including DDIO (Disk Directed I/O) [] and SDIO (Server Directed I/O) []. The current prevalent architecture is PIO (Two Phase I/O) [], where each I/O operation is separated into communication phases and I/O phases, for performing inter-process data exchange/reorganization and file I/O, respectively. Mechanisms such as collective I/O are also important performance optimizations, for example, aggregating small, noncontiguous I/O requests into large, contiguous ones. The resulting long, sequential accesses are favored by hard disks, the dominant storage media used in secondary storage. Also, avoiding finely interleaved accesses alleviates the false sharing problem, as well as the consequent expensive locking/synchronization costs. In addition, parallel I/O libraries employ other performance optimizations, such as I/O aggregation (which reduces the number of processes that make I/O calls, which helps to reduce I/O contention in the face of large-scale runs where thousands or more processes are used to execute a parallel application), data sieving (which avoids small, non-contiguous I/O by in-memory data manipulations, either “picking out” useful data segments or masking holes in output through read-modify-write operations), and I/O hints (which pass information from the application to help the parallel I/O library in configuring its internal settings, such as available buffer size, aggregation parameters, and file consistency requirements). To this date, the most popular parallel I/O libraries follow the MPI-IO specification, as a part of the MPI- Standard []. MPI-IO specifies the semantics of MPI-IO interfaces, which carry a unified style with familiar MPI calls for inter-process communication. While MPI-IO has been implemented with several MPI distributions, the most widely used and studied implementation is ROMIO [], developed at the Argonne National Laboratory, as a part of MPICH, a popular open-source MPI library. ROMIO adopts the PIO design and adopts many collective I/O optimizations mentioned above.
I/O
Scientific Data Format Libraries Parallel I/O libraries such as ROMIO provide application programmers with the convenience and performance for coordinated, efficient multi-process I/O operations. However, many scientists running large-scale simulations and performing subsequent analysis/visualization of the computation results do not read/write data in plain binary formats. Instead, they often use high-level I/O libraries to create and access data in special scientific data formats. These file formats bring several important features. First, they produce self-explanatory and self-contained files that encompass both raw datasets (most commonly multidimensional arrays) and their accompanying metadata items. Such metadata describe attributes associated with the raw datasets, such as variable name, unit, type, array dimension and size, and other information needed to interpret and analyze the data. Second, they facilitate binary portability through automatic file conversion when the files are moved between architectures using different floating point presentation. This is important as scientific data files and programs need to be designed to survive generations of diverse HPC platforms. Finally, such data formats often support good file extensibility, allowing files to grow easily when datasets are expanded or more datasets are stored within a file. Different scientific communities have different preferences over format choices, but more and more scientists and their applications converge to two most popular formats these days, HDF [] and netCDF []. Both formats possess the aforementioned advantages, at the cost of performance: file access latency and bandwidth are often significantly inferior compared to those with binary files. In recent years, both formats have added support for parallel accesses, by developing parallel access interfaces on top of the MPI-IO parallel I/O library.
Parallel I/O Today Due to the unsolved and incoming challenges of delivering scalable, high-speed parallel I/O on large-scale supercomputers, there continue to be research and development efforts on parallel I/O system design and optimizations. In particular, the recent trend is going toward more integrated explorations and development, where application developers/users, I/O
I
middleware designers, supercomputer center managers and system administrators, file system designers, and finally supercomputer vendors gather together to participate in next-generation HPC I/O subsystem design. One example of such coordinated design is the recent ADIOS parallel I/O middleware [], which was developed jointly between Oak Ridge National Laboratory and Georgia Tech through tight interaction with state-of-the-art parallel scientific simulations. ADIOS is able to facilitate convenient result/checkpoint data output through intuitive group-based interfaces, with the help of an auxiliary XML configuration file. Also, ADIOS can perform aggressive asynchronous I/O as well as data staging operations []. This allows analytics tasks typically performed as post-processing steps to be carried out efficiently on the fly along the data output pathway, avoiding the need for multiple rounds of input and output. Another example is the close collaboration among designers across the many layers in the deep HPC I/O stack, such as the Lustre Center of Excellence that connects vendors and file system designers directly to supercomputing centers and application users. Recently researchers and developers are also working on streamlining parallel I/O operations during parallel job runs with other pre-job and post-job data activities, such as data staging/offloading to/from supercomputing centers, long-area data migration, and data analytics. The traditional “in-job” parallel I/O becomes one link in the end-to-end scientific data workflow, while other components of the workflow have also been increasingly exploiting parallelism of various forms. Ongoing work is examining mechanisms and tools to allow part of these data processing/analysis tasks into the “in-job” data pathway. For example, long-desired features such as in-situ visualization not only completes important data post-processing, but also enables valuable runtime job steering capabilities. Parallel I/O beyond the environment of supercomputing centers can potentially get more help from ongoing efforts such as the creation of pNFS (parallel NFS). There are also efforts underway to improve software reuse and data sharing, for example, through efforts to make popular scientific data formats such as netCDF and HDF more compatible with each other.
I
I
I/O
Future Directions Parallel I/O faces new challenges as well as new opportunities in the coming years, when supercomputing is moving from Peta-scale to Exa-scale and beyond. With next-generation supercomputers, the computation parallelism will continue to rise and intensify the pressure on the I/O subsystem. The unprecedented high degree of I/O parallelism will be reflected in the number of processes making I/O calls and sharing files, the number of metadata operations to be processed concurrently, the number of disks and disk controllers, and the number/volume of data files. On larger systems, parallel I/O challenges will be highlighted in several aspects. First, the increasing number of cores per node demands more scalable interconnection and storage device layers. Also, it is not clear whether the memory capacity and bandwidth would become a severe problem again, when shared among dozens or more cores within the same node. If true, this will make it harder to apply optimizations developed in the past decade to exploit unused memory to perform asynchronous I/O or intelligent client-side caching/prefetching. Second, with millions of disks to manage, software and hardware reliability will continue to be a problem in maintaining the overall system availability and usability. In particular, it is predicted that with extreme-scale systems, disk recovery through RAID coding will become a constant event and requires careful handling to avoid overall parallel I/O speed degradation. More elaborate metadata server and I/O node fall-over schemes may be needed too. Third, the large amount of data produced on extreme-scale systems need to be managed and processed. In particular, timely data movement and analysis tools will be crucial in the end-to-end scientific computing cycle. Finally, with the increasing system scale and complexity, it becomes ever more challenging to monitor and analyze parallel I/O performance. Effective benchmarking, tracing, and profiling mechanisms are needed for researchers to provision resources for large centers’ storage subsystem, and to understand, configure, and optimize its performance in everyday operations. Performance visualization and automatic analysis/debugging tools will also facilitate the identification and solving of I/O
problems, particularly in the direction of cross-layer trace correlation and causal study. One side problem of the increasing scale of supercomputers is that it becomes harder and harder for academic researchers to deploy and evaluate their parallel I/O design or improvements on production-scale platforms. In particular, server-side and kernel-level modifications often have to be constrained to experiments on small or medium research clusters, where the configuration and bottlenecks are drastically different from large-scale machines. The advances in cloud computing and virtual machine technologies, though, may enable these researchers to emulate supercomputer environments and perform large-scale experiments. Meanwhile, there are also emerging opportunities for new practice or even paradigm shifts in parallel I/O systems and software in the future years and decades. One ongoing exploration is examining the feasibility and methodology for incorporating new storage devices into the HPC I/O hierarchy, as destination storage devices or intermediate caching layers. Examples of such devices of interest include Solid State Drives (SSDs) and phase-changing memory. These nonvolatile memory devices provide a compromise between capacity and speed between the main memory and secondary storage. Also, free of moving parts, these devices offer more reliable and energy-efficient I/O solutions compared to hard disks. However, there are unique challenges associated with deploying these new devices as the current I/O stack is designed based on hard disks’ operation characteristics. Also, for HPC I/O, the write durability of devices such as SSDs may make their deployment costly, considering the write-intensive workloads on supercomputers. Scientific data centers, in this sense, may be better candidates for employing such read-friendly storage media for fast parallel I/O.
Related Entries MPI (Message Passing Interface) MPI-IO
Bibliographic Notes and Further Reading For interested readers, John May authored a comprehensive book on parallel I/O for HPC [], published in . It gives a detailed overview of
I/O
HPC I/O hardware and software layers, as well as detailed discussions on standard parallel I/O interfaces (MPI-IO) and high-level scientific data libraries (netCDF and HDF). A more recent overview of an HPC storage system instance is given by researchers at Oak Ridge National Laboratory [], which describe the hardware, network, and software settings of the Spider parallel file system that is deployed at their computing center. Spider connects the storage hardware on several computing platforms centered around Jaguar, the Cray system that ranks as the world’s largest supercomputer as of early . There have been several studies that investigate HPC I/O characteristics. In particular, a series of investigations on this topic were published in the s [, , , ]. A recent study on Peta-scale I/O workloads was conducted by Argonne National Laboratory researchers []. A large amount of work has been conducted in parallel I/O performance optimization, especially on improving the performance of collective I/O. These optimizations include automatic I/O library parameter tuning [], buffering/caching techniques [, ], file domain partitioning methods and handling of irregular/noncontiguous access patterns [, , ], exploiting overlap between computation and I/O [, ], and application-specific optimizations [, ]. The John May book mentioned above also discussed two specific uses of parallel I/O beside job input and result output: I/O in parallel out-of-core applications (to swap data between memory and disk due to limited main memory space), and checkpointing (to write out data periodically for fault tolerance and future restart). Several studies focused on parallel I/O for out-of-core codes (e.g., [, ]). Due to the increasing system size and per-node memory capacity, out-of-core applications are less common these days. For checkpointing, application-level parallel checkpointing has mostly shared characteristics with result output, though recent studies have explored new ways of performing checkpointing (such as in-memory checkpointing through file system interfaces [] and checkpointing-specific storage design []).
Bibliography . l-Kiswany S, Ripeanu M, Vazhkudai SS, Gharaibeh A () stdchk: a checkpoint storage system for desktop grid computing. In:
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
I
th IEEE International Conference on Distributed Computing Systems (ICDCS), Beijing, , pp – Bent J, Gibson GA, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate Meghan () Plfs: a checkpoint filesystem for parallel applications. In: SC , Portland, Bordawekar R, Rosario J, Choudhary A () Design and evaluation of primitives for parallel I/O. In: Proceedings of Supercomputing ’, Portland, Broom B, Fowler R, Kennedy K () KelpIO: a telescope-ready domain-specific I/O library for irregular block-structured applications. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, Brisbane, Carns P, Ligon W III, Ross R, Thakur R () PVFS: a parallel file system for Linux clusters. In: Proceedings of the Fourth Annual Linux Showcase and Conference, Atlanta, Carns P, Latham R, Ross R, Iskra K, Lang S, Riley K () / characterization of petascale I/O workloads. In: Proceedings of Workshop on Interfaces and Architectures for Scientific Data storage, New Orleans, September Chen Y, Winslett M () Automated tuning of parallel I/O systems: an approach to portable I/O performance for scientific applications. IEEE Trans Softw Eng ():– Cluster File Systems, Inc. () Lustre: a scalable, high-performance file system. http://www.lustre.org/docs/whitepaper.pdf, Crandall P, Aydt R, Chien A, Reed D () Input/output characteristics of scalable parallel applications. In: Proceedings of Supercomputing ’, San Diego, Galbreath N, Gropp W, Levine D () Applications-driven parallel I/O. In: Proceedings of Supercomputing ’, Portland, High Peformance Fortran Forum () High performance fortran language specification. Center for Research on Parallel Computation, Rice University, Houston, Nov HDF – A New Generation of HDF. http://hdf.ncsa.uiuc.edu/ HDF/doc/ Kotz D () Disk-directed I/O for MIMD multiprocessors. In: Proceedings of the Symposium on Operating Systems Design and Implementation, New Orleans, November Kotz D, Nieuwejaar N () Dynamic file-access characteristics of a production parallel scientific workload. In: Proceedings of Supercomputing ’, Washington, Lang S, Latham R, Kimpe D, Ross R () Interfaces for coordinated access in the file system. In: Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage, New Orleans, Liao WK, Ching A, Coloma K, Choudhary AN, Ward L () An implementation and evaluation of client-side file caching for mpi-io. In: th International Parallel and Distributed Processing Symposium (IPDPS), Long Beach, , pp – Lofstead JF, Klasky S, Schwan K, Podhorszki N, Jin C () Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Sixth International Workshop on Challenges of Large Applications in Distributed Environments (CLADE), Boston, , pp –
I
I
iPSC
. Ma X, Lee J, Winslett M () High-level buffering for hiding periodic output cost in scientific simulations. IEEE Trans Parallel Distrib Syst ():– . Ma X, Winslett M, Lee J, Yu S () Improving MPI-IO output performance with active buffering plus threads. In: Proceedings of the International Parallel and Distributed Processing Symposium, Nice, . May J () Parallel I/O for high performance computing. Morgan Kaufmann Publishers, San Francisco . Message Passing Interface Forum () MPI: message-passing interface standard, June . Message Passing Interface Forum () MPI-: extensions to the message-passing standard, July . More S, Choudhary A, Foster I, Xu MQ () MTIO: a multithreaded parallel I/O system. In: Proceedings of the Eleventh International Parallel Processing Symposium, Geneva, . NetCDF Documentation. http://www.unidata.ucar.edu/packages/ netcdf/docs.html . Nieplocha J, Foster I () Disk resident arrays: An arrayoriented I/O library for out-of-core computation. In: Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, Annapolis, Oct , pp – . Nieplocha J, Foster I, Kendall R () ChemIO: highperformance parallel I/O for computational chemistry applications. Int J Supercomput Appl High Perform Comput ():–, Fall . Nieuwejaar N, Kotz D () The galley parallel file system. Parallel Comput ():–, . No J, Park S, Carretero J, Choudhary A () Design and implementation of a parallel I/O runtime system for irregular applications. J Parallel Distrib Comput ():– . Nowicki B () NFS: network file system protocol specification. Network Working Group RFC, . Oral S, Wang F, Dillow D, Shipman G, Miller R () Efficient object storage journaling in a distributed parallel file system. In: Eighth USENIX Conference on File and Storage Technologies, San Jose, . Patterson D, Gibson G, Katz R () A case for redundant arrays of inexpensive disks (RAID). In: Proceedings of the ACM SIGMOD Conference, Chicago, . Ross R, Nurmi D, Cheng A, Zingale M () A case study in application I/O on Linux clusters. In: Proceedings of Supercomputing ’, Denver, . Schmuck F, Haskin R () GPFS: a shared-disk file system for large computing clusters. In: Proceedings of the First Conference on File and Storage Technologies, Monterey,
. Seamons KE, Chen Y, Jones P, Jozwiak J, Winslett M () Serverdirected collective I/O in Panda. In: Proceedings of Supercomputing ’, San Diego, . . Smirni E, Aydt R, Chien A, Reed D () I/O requirements of scientific applications: an evolutionary view. In: Proceedings of the th IEEE International Symposium on High Performance Distributed Computing, Syracuse, . Thakur R, Choudhary A () An extended two-phase method for accessing sections of out-of-core arrays. Sci Programming ():–, . Thakur R, Gropp W, Lusk E () Data sieving and collective I/O in ROMIO. In: Proceedings of the th Symposium on the Frontiers of Massively Parallel Computation, Annapolis, Feb . Thakur R, Gropp W, Lusk E () On implementing MPI-IO portably and with high performance. In: Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, Atlanta, May . Top supercomputer sites. http://www.top.org/ . Zheng F, Abbasi H, Docan C, Lofstead J, Liu Q, Klasky S, Parashar M, Podhorszki N, Schwan K, Wolf M () Predata – preparatory data analytics on peta-scale machines. In: th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Atlanta
iPSC The iPSC machines were hypercube-connected multicomputers designed and built by Intel Corporation in the s. Four models were developed: iPSC/, iPSC/, and iPSC/
Isoefficiency Metrics
J JANUS FPGA-Based Machine Raffaele Tripiccione Università di Ferrara and INFN Sezione di Ferrara, Ferrara, Italy
Definition JANUS is a massively parallel application-driven machine, developed in the years – to support Monte Carlo simulations of spin-glass systems, a compute-intensive application in statistical physics. JANUS – fully based on FPGA technology – is a large array of processing elements that work under the supervision of a traditional host-machine; they are firmwareconfigured on-the-fly to run an application code developed in a hardware description language. Each processing element, when configured for spin-glass simulations, implements a massively-parallel architecture, midway between a graphic processing unit and a many-core CPU, in which several hundreds small processing cores concurrently run a SIMD application. For the specific application for which it was developed, JANUS offered approximately a , × performance gain over systems available at the time it was brought on line.
Discussion Overview Application-driven computing systems have been used in many cases in computational physics in the last years. These systems have been extremely successful in diverse compute-intensive areas, such as condensed matter simulations [], Lattice QCD [–], and the simulation of gravitational systems []. The main reason for these successes has invariably been that the architectural requirements for the algorithms relevant for those problems were – at the time these machines were developed – at strong variance with those targeted by offthe-shelf CPUs or systems, so tailoring the architecture
to the algorithm yielded huge performance gains over commercial state-of-the art solutions. JANUS belongs to this class of machines: it was developed in the years –, in order to simulate the behavior of glassy materials in condensed matter physics with Monte Carlo techniques; it has delivered – for that specific problem – performances more than three orders of magnitudes higher than possible with traditional systems available at the time it was developed. JANUS was developed by a collaboration of Italian and Spanish Universities (Università di Roma “La Sapienza,” Università di Ferrara, Universidad de Zaragozza, Universitad Complutense de Madrid, Universidad de Extremadura, and Instituto de Biocomputacion y Fisica de Sistemas Complejos (BIFI) in Zaragozza). The team had a strong and complementary background in the areas of physics of complex systems and in dedicated computer systems for lattice QCD. A major challenge in condensed matter physics is the understanding of glassy behavior []. Glasses are materials of the greatest industrial relevance (aviation, pharmaceuticals, automotive, etc.) that do not reach thermal equilibrium in human lifetimes. This sluggish dynamics is a major problem for the experimental and theoretical investigation of glassy behavior, placing numerical simulations at the center of the stage. Glasses pose a formidable challenge to state-of-the-art computers: studying the surprisingly complex behavior governed by deceivingly simple dynamics equations still requires inordinately long execution times. Spin glasses (SG) are widely used theoretical models of disordered magnetic alloys widely regarded as prototypical glassy systems [, ]; the investigation of these systems today is largely based on Monte Carlo simulations (see e.g., [] for a comprehensive review). The associated computational challenge is huge; in many SG models, the dynamical variables – so-called spins – are discrete and sit at the nodes of discrete d-dimensional lattices. In order to make contact with experiments, it is necessary to follow the evolution of a
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
J
JANUS FPGA-Based Machine
large enough D lattice, say sites, for time periods of the order of s. Since one Monte Carlo step – the update of all the spins in the lattice – roughly corresponds to − s, some such steps are needed, that is, spin updates. Statistics on several (≃ ) copies of the system has to be collected, adding up to ≃ Monte Carlo spin updates. Performing this simulation program in a reasonable time frame (say, less than year) requires a computer system able to update on average one spin in approximately ps. At the time JANUS was proposed, clever use of traditional systems allowed to reach simulation performances several hundred times lower. In order to bridge this huge gap, a large amount of parallelism is available and easily identified in the associated algorithms (see later for some details); the JANUS architecture has been designed to carefully exploit at large fraction thereof, basically handling at the same time spin variables sitting at non-neighboring sites of the lattice. Each JANUS processor is able to process almost , spins in parallel; this immediately causes a major memory access bottleneck that has been circumvented by appropriate memory organization and access strategies and – at the technological level – by the use of embedded memory. Altogether, an intra-node degree of parallelism of order , × is compounded by a further factor (×) by trivially processing in parallel a corresponding number of copies of the system, so the required system-level performance has been reached. Implementing this large degree of parallelism has been technologically possible because the data-path associated to the handling of each spin has limited logical complexity, being based on logic operations and integer arithmetics (as opposed to floating-point arithmetics) and because control can be readily shared across collaborating data-paths. JANUS is fully based on Field Programmable Gate Array (FPGA) technology. FPGAs are inherently slow and trade flexibility with complexity (under general terms, any system of n gates can be emulated by an FPGA of complexity n log n, and log n is a “large” number when n ≃ − ). These disadvantages are offset by more dramatic speedup factors allowed by architectural flexibility. Also, development time of an FPGA-based system is short. Early attempts in this
direction had been made several years before by a subset of the JANUS team within the SUE project []. An ASIC approach, that would obviously further boost performance, was not considered by the JANUS developers as the longer development times and larger development costs made it a nonviable option.
Spin Glasses Basic ingredients of spin glasses (SG) are frustration and randomness (see Fig. ). The dynamical variables, the spins, represent atomic magnetic moments (the elementary magnets whose combined effects make up the magnetism of the sample as a whole). Spins sit at the nodes of a crystal lattice (Fig. – left), and take only two values, σi = ±, corresponding to two possible orientations for the atomic magnet. Depending on material details that wildly change from site to site, a pair of neighboring spins may lower their energy if they take the same value (and have therefore a tendency to do so). In this case, the coupling constant Jij , a number assigned to the lattice link that joins spins σi and σj , is positive. However, it could happen with roughly the same probability that the two neighboring spins prefer to anti-align (in this case, Jij < ). A lattice link is satisfied if the two associated neighboring spins are in the energetically favored configuration. In spin glasses, positive and negative coupling constants occur with the same frequency, as the spatial distribution of positive or negative Jij in the lattice links is random; this causes frustration. Frustration means that it is impossible to find an assignment for the spin values, σi , such that all links are satisfied (the concept is sketched in Fig. – right, and explained in more details in the caption). For any closed lattice circuit such that the product of its links is negative, it is impossible to find an assignment that satisfies every link. In spin glasses, the probability that a given circuit is frustrated is %. In a typical SG model, the Edwards–Anderson model [], the energy of a configuration (i.e., a particular assignment of all spin-values) is given by the Hamiltonian function H = − ∑ σi Jij σj ,
()
⟨ij⟩
where summation is taken on all pairs of nearest neighbor sites on the lattice. The coupling constants Jij are
JANUS FPGA-Based Machine
s5
J
s6 s1 = +1
J12 < 0
s2 = −1
s2
s1
J23 > 0
J14 > 0 J14 s7
s8 s4
J43
s3
J43 > 0 s4 = +1
s3 = ?
JANUS FPGA-Based Machine. Fig. Left: The spin-glass lattice is obtained by periodically repeating the unit cell. Dynamical variables, the spins, take values σi = ±. Each pair of neighboring spins have either a tendency to take the same value or the opposite one. This is controlled by a coupling constant (Jij ) attached to the lattice link joining the two spins. J < implies that the link is satisfied if σ ⋅σ < while J > demands that σ ⋅σ > . Right (frustration example): Consider the front plaquette of the cell at left, for the given values of the couplings, and try to find an assignment of the four spins such that all four links are satisfied. If one starts with σ = + and goes around the plaquette clockwise, J < requires that σ = − and then J > implies σ = −. On the other hand, going anticlockwise, σ = + (because J > ) and also σ = +, since J > ! Nothing changes if the initial guess is σ = −: also in this case conflicting assignments for σ are reached
chosen randomly to be ± with % probability, and are kept fixed. A given assignment of the {Jij } is called a sample. Some of the physical properties (such as internal energy density, magnetic susceptibility, etc.) do not depend on the particular choice for {Jij } in the limit of large lattices. However, for the relatively small simulated systems it is useful to average the results obtained on several samples, that is, on different assignments of the {Jij }. It turns out that frustration makes it hard to answer even the simplest questions about the model. In three dimensions, experiments and simulations provide evidence that a spin-glass ordered phase is reached below a critical temperature Tc . In the cold phase (T < Tc ) the spins freeze in some disordered pattern, presumably related to the configuration of minimal energy. For temperatures (not necessarily much) smaller than Tc spin dynamics becomes exceedingly slow. In a typical experiment one quickly cools a spin glass below Tc , then waits to observe the system evolution. As time goes on, the size of the domains where the spins coherently order in the (unknown to us) spin-glass pattern, grows slowly. Domain growth is sluggish: even after h of this process, for a typical spin-glass material at a
temperature T = .Tc , the domain size is only around lattice spacings. The smallness of the spin-glass ordered domains precludes the experimental study of equilibrium properties in spin glasses, as equilibration would require a domain size of the order of ∼ lattice spacings. The good news is that an opportunity window opens for numerical simulations. In fact, in order to understand experimental systems, it is sufficient to simulate lattices not much larger than the typical domain size. This is the physics motivation behind the development of JANUS.
Monte Carlo Simulations of Spin Glasses Spin glasses are studied numerically with Monte Carlo techniques that ensure that the system configurations C are sampled according to the Boltzmann probability distribution P(C) ∝ e−
H(C) T
,
()
describing the equilibrium distribution of configurations for a system at constant temperature T. One typical Monte Carlo algorithm is the Heat Bath algorithm, which assumes that at any time any spin has to be in thermal equilibrium with its surrounding
J
J
JANUS FPGA-Based Machine
environment, meaning that the probability for a spin to take value + or − is determined only by its nearest neighbors, follows the Boltzmann distribution; the energy of a spin at site k of a D lattice of linear size L is E(σk ) = −σk ∑ Jkm σm ,
.
()
⟨km⟩
where the sum runs over the six nearest neighbors of site k. One can derive that the probability follows the Boltzmann distribution: in particular, the probability for the spin to be + is P(σk = +) = =
.
.
−E(σ k =+)/T
e
e−E(σk =+)/T
+ e−E(σ k =−)/T
eϕ k /T , eϕ k /T + e−ϕ k /T
() ()
.
where ϕ k is usually referred to as the local field acting on site k. At the computational level, the corresponding algorithm takes the following steps:
.
ϕk = ∑ Jkm σm , ⟨km⟩
. Pick one site k at random. . Compute the local field ϕk (Eq. ). . Assign to σk the value + with probability P(σk = +) as in Eq. ; this can be done by generating a random number R, uniformly distributed in [, ], and setting σk = if R < P(σk = ), as given by Eq. , and σk = − otherwise. . Go back to step . A full Monte Carlo step consists in the iteration of the above scheme for L times. Many Monte Carlo steps evolve the system toward statistical equilibrium. Physical spin variables can be mapped into bits by the following transformations σk → Sk = ( + σk )/, Jkm → ˆJkm = ( + Jkm )/, ϕ k → Fk = ∑ ˆJkm ⊕ Sm = ( − ϕ k )/,
between lattice sites, as well as temporal correlations in the sequence of generated spin configurations. The local field ϕk can take only the seven even integer values in the range [−, ], so probabilities P(σk = +) = f (ϕ k ) may be stored in a small look-up table. The kernel of the program is the computation of the local field ϕk , involving just a few arithmetic operations on discrete variable data. Under physically reasonable assumptions, the simulation retains the desired properties even if Monte Carlo steps are implemented by visiting each lattice site exactly once, in any deterministic order, provided no two interacting spins are handled at the same time. Several sets of couplings {Jkm } (i.e., several different samples of the system) need to be generated; an independent simulation has to be performed for every sample, in order to generate properly averaged results. One usually studies the property of a spin-glass system by comparing two or more statistically independent simulations of the same sample, starting from uncorrelated initial spin configurations (the two copies of a sample are usually referred to as replicas).
The last three points identify the parallelism available in the computation, which can be trivially exposed for points and above. For point a more accurate analysis is required. In fact, one labels all sites of the lattice as black or white in a checkerboard scheme: all black sites have their neighbors in the white site set, and vice versa: one can then in principle perform the three steps of the algorithm on all white or black sites in parallel. These remarks identify the available parallelism in the algorithm. The next section describes the architecture that JANUS adopts in order to exploit as much as possible of this opportunity.
()
⟨km⟩
(⊕ is the exclusive-or operation) so several (not all) steps in the algorithm involve logic (as opposed to arithmetic) operations. The procedure is easily programmed for a computer; some comments are in order: . High-quality random numbers are necessary to avoid dangerous spurious spatial correlations
JANUS: The Architecture JANUS is a heterogeneous massively parallel system built on a set of FPGA-based reconfigurable computing cores connected to a host-system of standard processors. Each JANUS-core contains so-called simulation processors (SP) and input/output processor (IOP). The SPs are logically arranged at the vertexes of a D mesh and have direct low-latency high-bandwidth
JANUS FPGA-Based Machine
J
J JANUS FPGA-Based Machine. Fig. Left: The JANUS system installed at BIFI, Zaragozza (Spain) has JANUS-cores and JANUS-hosts. Right: Simple block diagram of a minimal JANUS system, with one JANUS-host and one JANUS-core
communication links with nearest neighbors in a toroidal topology. Each SP is also directly connected with the IOP, while the latter communicates with an element of the host-system, as shown in the right side of Fig. . SPs and IOPs are based on Virtex-LX FPGAs manufactured by Xilinx, that – at the time the system was developed – were the densest FPGA devices available from industry. The largest available JANUS machine (installed at BIFI – Zaragozza, since early ) has eight hosts (with a disk-system of TBytes) and JANUS-cores (Fig. , left side). More details on this machine are in [–]. The IOP handles the JANUS-host I/O interface, providing functionalities needed to configure the SPs for any specific computational task, to move data sets across the system and to control SP operations. The IOP processor exchanges data with the JANUS-host on two Gigabit-Ethernet channels, running a standard raw-Ethernet communication protocol. This means that JANUS can only be used in a very loosely-coupled mode with its host, as the JANUS-host bandwidth is rather low and the associated latency is large. The SP is the computational element of the JANUS system. It contains uncommitted logic that can be configured via the IOP as needed to run
computing-intensive kernels. The SPs of each JANUScore can be configured to run in parallel independent programs, or to run single program partitioned among the FPGAs, exchanging data as appropriate across the nearest-neighbor toroidal network. JANUS is programmed developing two different programs: a standard C program running on the host system that typically builds on a low-level communication library to exchange data to and from the SPs, and a code (written in a hardware description language, such as VHDL) defining the application running on the SPs. In this approach, the FPGAs must be carefully configured and optimized for application requirements. Tailoring any specific application to JANUS is a lengthy and complex procedure, hopefully rewarded by huge performance gains. Obviously, high level programming frameworks able to automatically split an application between standard and reconfigurable processors and to generate VHDL code for the configurable segment would be welcome for this machine, but the tools currently available typically do not deliver the level of optimization needed for the high performance application run by JANUS; all application codes executed by JANUS so far have been handcrafted in VHDL.
J
JANUS FPGA-Based Machine
It is critical for any successful JANUS implementation that as much as possible internal resources of the FPGA are exploited in a coordinated way (see [] for a detailed description). It turned out that the underlying architecture of the Virtex FPGAs is almost optimal for the mix of computational operations involved in SG simulations, while other applications that were tried on the machine, especially in the area of optimization, e.g. graph coloring, did not have a comparable impact. So, JANUS, that had been advertised in earlier publications as a potentially general–purpose reconfigurable architecture, has been extensively used only for SG simulations. Here just a few highlights of the architecture configured inside each SP node are given. The first issue that had to be addressed is memory organization, a potential performance bottleneck. LX FPGA devices come with many small embedded RAM blocks, which may be combined and stacked to naturally reproduce a D array of bits representing the D spin lattice. Memory banks embedded in the FPGA device are large enough to store the full simulation data structure, that – in turn – makes huge memory bandwidth possible. If one considers an SG simulation on a lattice of the size discussed earlier ( sites) one standard JANUS implementation configures embedded memory blocks with bit width to represent the linear size of the spin lattice: such memories with bit addressing are enough to store the whole lattice. Addressing all memories with the same given address z allows to fetch or write a × portion of an entire lattice plane corresponding to spins with Cartesian coordinates ( < x < , < y < , z). The next portion of the same plane is addressed by z + , referring to spins with coordinates ( < x < , < y < , z), and so on. It is useful for performance to simulate two real replicas, so a good strategy allocates one such structure for black spins of replica and white ones of replica . A further structure is allocated for white and black spins of replicas and respectively. A similar storage strategy applies to the read-only storage blocks associated to the Jij . After initialization, this firmware-enabled machinery fetches all neighbor spins (addresses z, z + , z − , z + , z − ) of an entire portion of plane of spins (identified by a single memory address z) and feeds them to the update logic. The latter returns the processed (updated) portion of spins to be uploaded to memory
at the same given address. Altogether, spins are updated simultaneously, so the update logic is made up by identical update cells. Each cell receives the nearest neighbor bits, coupling bits, and one -bit random number; it then computes the local field, which is an address to a probability look-up table; the random number is then compared to the extracted probability value and the updated spin is output. Look-up tables are small -bit wide memories instantiated as distributed RAM. There is one such table for each update cell. Random number generators – one to each update cell – are implemented as -bit Parisi–Rapuano generators [], requiring one add and one bit-wise xor-operation for each random value. The typical number of updates per clock cycle is a trade-off between the constraints of allowed RAMblocks configurations and available logic resources for the update machinery. This internal architecture is very similar in structure to modern graphics processing units (GPUs) or many-core CPUs: there is a large number of independent computational data paths that are fed by memory available on board. The specific application can be arranged as just one large SIMD thread, so control is shared by all data paths. This saves precious logic that is allocated to more data path instantiations; furthermore the logical complexity of each data path is significantly smaller than typical CPUs: these two points are the key to the huge performances of this system. Total resource occupation for the design is % of RAM-blocks (total available RAM is ≃ KBytes for each FPGA) and % of logic. The system runs at a conservative clock frequency of . MHz. At this frequency, power consumption for each SP is ≃ W. It is interesting to note that the FPGA would operate correctly at a much higher clock rate, and the current upper limit in frequency is set by the performance of the forced-air cooling systems, a reminder that power management is becoming more and more a key performance bottleneck in high performance computing.
JANUS Performance JANUS performance can be assessed in at least two different ways: ● ●
The number of effective operations per second The speed-up factor (mostly in wall clock time, but also on other relevant metrics) with respect to pro-
JANUS FPGA-Based Machine
cessor clusters or “arbitrary” size (i.e., assuming that, for the given simulation, the optimal number of processor is actually deployed) Let us first consider the number of effective operations performed by second. At each clock cycle (clock frequency is . MHz), each SP processor updates spins. On a traditional architecture, this would imply at least the following instructions for each spin: – – – – – – – – – –
Six loads sum ( bits) xor ( bits) sum ( bits) xor ( bits) load LUT comp ( bits) Six updates of address pointers ( mult, sum ) One store Jump condition
If one wants to count only algorithm-relevant operation, all load/store and address instructions must not be counted; also, the xor and sum operations on short operands may be considered equivalent to one standard arithmetic operation. All in all, there are seven equivalent instructions for each spin update (shown in italic in the list above). This translates into a processing power of × × . × operations per second, which is . Giga-ops. The largest JANUS machine operates processors in parallel, so it performs at the level of . Tera-ops. It is also interesting to compare JANUS performances with those obtained on commercial CPUs, where spin-glass simulations had been performed earlier. The main reason why SG simulations are not very fast on traditional CPUs is that the natural physical variables for the problems are bits, while CPU architectures are based on long (e.g., – bit) words. A widely used trick to boost performance (usually called multispin coding) uses the bits of the CPU word to code the same spin of a matching number of systems, treated in parallel by the algorithm. Using this technique, carefully optimized codes running on high-end CPUs available in – were able to handle on average one spin
J
in slightly less than ns. By contrast, the JANUS implementation performs at . ns/spin. This means that Monte Carlo steps for a system of sites take approximately days wall clock time, while a similar simulation on PCs would take of the order of years: in conclusion, JANUS has made it possible for the first time to stretch SG simulations to experimentally relevant time scales []. JANUS has excellent performance also from the point of view of power requirements: each node uses ≈W of power, which translates into ≈ Giga-ops/W.
Future Directions JANUS has marked a new standard in the simulation of complex discrete systems by allowing studies that would take years on traditional computers. The system has a very large sustained performance, an outstanding price–performance ratio, and an equally good power– performance ratio. Physics results obtained analyzing simulation results delivered by JANUS have been published in several journals, and the system is still on line, with no plans to phase it out in the near future. From the point of view of a computer system, JANUS has shown the potential performance impact of reconfigurable computing and of many core architectures. ●
From the former point of view, JANUS strongly leveraged on the very regular structure of the target algorithm: a highly optimized code carefully matching the low-level underlying FPGA architecture was developed in less than three months with reasonable human effort; on the other hand attempts with automatic VHDL generators (starting, e.g., from C programs) yielded no more that a few percent of the performance of the handcrafted code, even in JANUS almost ideal conditions: clearly, reconfigurable computing will not have a significant impact till this gap is substantially narrowed. ● As a prototypical many-core system, JANUS can be seen as an extreme version of the architecture adopted in very recent years by General Purpose GPUs (GP-GPUs); in both cases a very large number of computational threads is handled by a large set of computing cores, clustered in SIMD-controlled partitions. JANUS pushes to the extreme this trend
J
J
Java
by making the cores very simple and by arranging just one SIMD partition. Huge performance are expected by this architecture, but memory access becomes rapidly the critical bottleneck. JANUS has solved this problem using a large number of embedded memory banks and carefully arranging data structures on them. As a technological trend, GPGPU in the near future will probably still boost performance for floating-point based performance. JANUS is still on the leading edge for the specific mix of logic/ integer-arithmetic operations associated to spinglass systems.
Bibliographic Notes and Further Reading A comprehensive review on the class of physics problems at the basis of the development of JANUS can be found in [], while [] describes the relevant algorithms. A discussion of application-driven systems developed over the years for theoretical physics applications is available in the papers [, , ]. A recent special issue of Computing in Science and Engineering has covered this area in details [].
Bibliography . Condon JH, Ogielski AT () Rev Sci Instrum :–; Ogielski AT () Phys Rev B :– . Boyle PA et al () IBM J Res Dev (/):– . Belletti F et al () Comput Sci Eng ():– . Goldrian G et al () Comput Sci Eng ():– . Makino J et al () A . Tflops Simulation of Black Holes in a Galactic Center on GRAPE-. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Dallas, Article no. . See for instance Angell CA () Science ():–; Debenedetti PG () Metastable liquids. Princeton University Press, Princeton; Debenedetti PG, Stillinger FH () Nature :– . Mydosh JA () Spin glasses: an experimental introduction. Taylor and Francis, London . Young AP (ed) () Spin glasses and random fields. World Scientific, Singapore . Amit DJ, Martin-Mayor V () Field theory, the renormalization group and critical phenomena, rd edn. World Scientific, Singapore . Pech J et al () Comput Phys Commun (–):–; Cruz A et al () Comput Phys Commun (–):– . Edwards SF, Anderson PW () J Phys F: Metal Phys :–; ():–
. Belletti F et al () IANUS: Scientific Computing on an FPGA-based Architecture. In: Proceedings of ParCo, Parallel Computing: Architectures, Algorithms and Applications, NIC Series Vol. , pp – . Belletti F et al () Comput Sci Eng ():– . Belletti F et al () Comput Sci Eng ():– . Belletti F et al () Comput Sci Eng ():– . Parisi V, cited in Parisi G, Rapuano F () Phys Lett B ():– . Belletti F et al () Phys Rev Lett : . Gottlieb S (guest editor) () Comput Sci Eng ()
Java Deterministic Parallel Java
JavaParty Michael Philippsen University of Erlangen-Nuremberg, Erlangen, Germany
Definition While Java offers both a means for programming threads on a shared-memory parallel machine and a means for programming remote method invocations in a wide-area distributed memory environment, it lacks support for NUMA-architectures or clusters, that is, architectures where locality matters because of nonuniform access times. Java also lacks support for a more data-parallel programming style. JavaParty fills this gap by providing language extensions that add both remote objects and collectively replicated objects. The former allow the programmer to express knowledge about the locality of both the application’s objects and threads in an abstract way, that is, without requiring code for low-level object placement, for explicit replication and (cache) coherency, or for low-level details of explicit message passing. The latter provide a means for working elegantly on all objects of an irregular data structure in parallel. The beauty of JavaParty is twofold. First, the flavor of all the added language features fits nicely into Java. Second, since a preprocessor transforms the language extensions back into regular Java code, JavaParty programs run anywhere, that is, for running JavaParty
JavaParty
programs it is sufficient to start a regular JVM (Java Virtual Machine) for example on each (multi-core) node of a cluster.
Discussion Remote Objects The Concept JavaParty class definitions may have a new label remote that is implemented either as a new modifier or as an annotation. An instance of a class with label remote is a remote object. It appears like regular Java objects, but they may live in remote address spaces, that is, by making a class remote the programmer explicitly expresses that it is acceptable if access to its instance variables or invocation of its methods may be slower than accesses to regular (local) Java objects. A remote thread is tied to an object in Java. The thread will be spawned on the remote node that hosts the thread object. JavaParty extends Java’s object model to a distributed object model. Conceptually, a remote object is implemented by a local proxy object (stub) that forwards all accesses to a remote JVM that hosts the remote object. The proxy objects are generated automatically by the preprocessor, see below; they are transparent to the JavaParty programmer. In JavaParty, garbage collection of remote objects works like that for regular (local) Java objects. Remote objects are also transparent with respect to synchronization, that is, a thread that holds a lock on a remote object may enter another synchronized method of that object even if the flow of execution has moved between different address spaces and JVMs. Static fields of a regular (local) Java class exist only once for all objects of that class. The same holds for static fields of a remote JavaParty class: all accesses to its static fields appear like in regular Java although the static fields can reside on a cluster node that may be different from the node that allocates the instance fields of a remote object. Conceptually, the static fields of a remote object are reached through another (automatically generated) proxy object that is also transparent for the programmer. Transformation From the two remote JavaParty classes shown in Fig. , the JavaParty preprocessor generates several pure Java
J
remote class A { static Object f; int b; static int bar() {...} void foo() {...} } remote class B extends A {...}
JavaParty. Fig. Two remote JavaParty classes before transformation
classes. The left column of Fig. depicts the classes that implement static/class fields and static/class methods, that is, you will find implementations for f and bar in A_static. All the instance fields and methods are implemented in class A_instance in the right column. On both sides, variables are replaced by setter and getter methods. At runtime, there might be several instances of class A_instance, one for each object of type A that the application creates. Since all objects might live on remote nodes, JavaParty also generates a stub and a skeleton (interface) for each of the two classes, so that it fits to Java’s RMI implementation of remote methods. To create an instance of A_instance on a remote node, we need some way of calling a constructor on that node. Since Java’s RMI does not allow this, JavaParty generates an extra method _new (for each of A’s constructors) that is called instead and that returns a reference to a remote object (or to its proxy to be more specific). Hence, at runtime there is one object of type A_static on each node, just to provide the constructors for creating all the necessary A_instance objects. But only one of these class objects actually implements the static fields and methods of A. In addition to the classes that implement the instance parts and those that implement the static parts, a remote JavaParty class is replaced by a socalled handle class. The handle classes for the above example are shown in the middle column of Fig. . These classes provide the original constructors (that will call the _new methods), they bundle the references to the two remote objects of types A_static and A_instance, they deal with exceptions that might be caused by remote method invocations, they hide migrating remote objects, they shortcut methods calls if a potentially remote object happens to reside on
J
J
JavaParty
JavaParty. Fig. Simplified result of JavaParty’s remote class transformation
the same node, they are used for instanceOf type comparisons, etc. For details and other aspects of the transformation, for example, exception handling, some additional methods, name mangling, the transformations within non-remote classes that access remote classes, etc., please check the bibliographic notes.
Copy Semantics for Local Object Arguments in Remote Method Invocations The main difference between regular/local Java objects and remote objects is the way arguments are passed.
For method invocations on local objects, Java copies references to local objects that are passed as arguments. In contrast, for a method invocation of a remote JavaParty object, the complete local object referred to and all the regular/local Java objects that can be reached transitively from that object are copied and shipped. This is of course necessary if the remote object resides on a different node of a cluster on which the addressed local objects are not present. For consistency, object cloning is performed even if a remote object happens to reside locally in the same JVM. There is no semantic difference for primitive type arguments. References to remote
JavaParty
objects are treated as such, that is, remote objects are not copied if references to them are passed along in method invocations.
Object Distribution The JavaParty runtime system uses heuristics to decide on which actual node of a cluster to allocate a newly created remote object and on which cluster node to execute a newly spawned remote thread. These heuristics may be guided by the results of a previous static analysis. The runtime system can automatically and transparently migrate remote objects to improve access times. If needed, the programmer can use library calls to select specific cluster nodes, to pin remote objects, or to tie a set of remote objects together so that the automatic (or manual) migration affects all of them at once. The flow of execution automatically follows the data which are being manipulated, that is, a thread executes an object’s method on the node that hosts the object. Explicit or manual migration of thread objects can be used to enhance locality, since thread objects that reside at the place of the accessed objects will be faster.
Collectively Replicated Objects JavaParty adds replicated as another new class modifier/annotation for replicated objects. Whereas a remote object conceptually resides on a single remote node, a copy of a replicated object resides on every remote node (unless the degree of replication is restricted). Whereas a remote object is accessed remotely through a proxy object, a replicated object can only be accessed locally since it is implemented by one regular (local) Java object on each node. There is no proxy object for remotely accessing a replicated object. When an object of a replicated class is created, as its initial distribution, a copy of it with all its instance variables is created on every node of the underlying parallel computing system. Threads access this copy, called replica, locally on their node as they would access regular (local) Java objects, that is, there is no runtime overhead in accessing a replica. As mentioned above, JavaParty transitively clones the whole graph of referred to objects when a local argument object is shipped in a remote method invocation. Similarly, all the objects transitively referred-to from a replicated object are cloned and also exist on every node. Consider a replicated object that holds references
J
to a graph of local objects. Then conceptually, a copy of this object graph exists on every node. The JavaParty runtime system can choose to use a partial replication and to cut out those parts of the object graph that are not used on a particular node. This can be done automatically based on some static analysis. Alternatively, the JavaParty programmer can again use library calls to indicate cutoff objects. As an example consider the situation depicted in Fig. . Here, three threads p – p work on a replicated object graph, one thread per cluster node. Although conceptually a copy of the whole object graph is stored on all three nodes, each of the worker threads updates only a few objects (the black, white, or gray circles, respectively) based on some information in neighboring objects. A full replication that stores the whole object graph per JVM is not required. A partial replication is sufficient and more efficient with respect to memory consumption. It suffices that each JVM only keeps copies of those objects that are shown in the shaded areas. For instance, on the first JVM, where thread p is executed, it is sufficient to keep the part of the wire frame shown in Fig. . As will be discussed below, JavaParty automatically keeps replicas in a consistent state, that is, a deep comparison of all the local objects that can be reached through the replicated object’s fields must show identical values in the fields of primitive types. References to local objects may, however, be set to null on some nodes due to the decision to only partially replicate the data, while on other nodes the graph traversal will reach regular objects. For the example object graph, the JVM of p could store only a part of the replicated
p1
p3
p2
x o
z y
Root, replicated object
Updated by p3 Stored by p3
JavaParty. Fig. Replicated data structure
J
J
JavaParty
Stored by p1
Null-reference
x z
Part of the replicated object that is not locally available at p1
y
JavaParty. Fig. Replica of node p in case of partial replication
object graph. Figure illustrates the null references at the cutoff object z.
Memory Model Similar to Java where a thread is only guaranteed to see the changes made by different threads if both threads have been subject to an explicit mutual synchronization, nodes can work on their replica in isolation, that is, without an explicit synchronization no other node is ever guaranteed to see a change. With replicated objects, a single type of synchronization that guarantees exclusive access to an object is no longer sufficient since the main reason for replication is to allow concurrent access to locally available replicas (with as little overhead as possible). Hence, JavaParty adds a concurrent read, exclusive write (CREW) locking for replicated objects: shared synchronized(p) { //reader ... //read access to replica } ... exclusive synchronized(p) { //writer ... //update replica and notify ... //about change } This language extension allows concurrent threads in a shared block as long as there is no thread in an exclusive block at the same time. Synchronized methods can also be flagged as shared or exclusive. As usual, at some point in time, after a modifying thread leaves the exclusively locked block and before another thread enters some synchronized block on the same object, all changes must be made available to the second thread.
At the end of the exclusive block, changes are broadcast to all nodes that hold copies of the replicated object, so that other threads see the updated values, that is, a consistent state, as soon as they enter another synchronized block. The JavaParty compiler generates code for the necessary consistency messages. This code extends Java’s regular memory model to a distributed environment. In regular Java, changes made by one thread in a synchronized block are only guaranteed to be visible to another thread if the later enters a synchronized block itself. Java’s wait and notify are also transformed into communication operations so that they work on replicated objects as expected even though the signaling needs to span node boundaries.
Bulk Synchronous Collective Synchronization JavaParty’s replicated objects are called collectively replicated objects because there is a third flavor of synchronization, called collective synchronization. It is indicated by means of the modifier collective. In contrast to shared and exclusive synchronization, on every node that carries a replica exactly one thread must enter a collectively synchronized block of code. Together these threads form a collective that cooperatively work on the replicated state. In the above example, three threads will enter a collectively synchronized block of code to work on the replicated object graph, each on its portion of the replicated state, to collectively perform an updating step. After acquiring a collective lock, each thread is allowed to modify its portion of the replicated object’s state, while concurrently, other threads modify their portions. Within the synchronized block, that is, while working on its portion of the replicated data, a thread does not see the updates performed by other threads. Only at the synchronization barrier of lock release (but not earlier), all the modifications are merged and are then made available on all the nodes. Conceptually, the end of a collective block is a synchronization barrier combined with an all-to-all broadcast that brings replicated objects back into a consistent state. In the above example, both threads p and p can access the white object z – each thread on its own node. If either thread changes some fields of z within a collectively synchronized block, the other threads will not
Job Scheduling
see these changes immediately. Due to the bulk synchronous paradigm, the modifications are only shipped when all nodes leave the collective block. If both p and p have modified different instance fields of z concurrently, only the modified fields are exchanged. It is in general considered to be a program error if the modified portions of the replicated state overlap, that is, if both p and p alter the same instance field of z to different values. In this case it is unspecified which of the two values survives. JavaParty also provides a means of specifying merging routines that handle such concurrent updates of a single instance variable. A naive implementation of a collective synchronization could broadcast all-to-all, diff and merge the replicated data structure. However, an efficient implementation, only ships the modified instance fields, and only ships them to those nodes that actually store the objects in case of partial replication.
Related Entries BSP (Bulk Synchronous Parallelism) PGAS (Partitioned Global Address Space) Languages Software Distributed Shared Memory
Bibliographic Notes and Further Reading Many parallel object-oriented languages have been proposed in the past, see for example [, ]. JavaParty extends Java with a remote object model for distributed environments with nonuniform memory access and adds collective replication for data-parallel computation on irregular data structures in the style of bulk synchronous parallelism (BSP []). JavaParty runs on top of a set of regular Java virtual machines (JVMs). Alternatively, a JavaParty implementation could target a VM that spans a cluster and combines to local memories into a virtual huge address space, see for example []. The most cited paper on JavaParty [] covers the technical details of the transformation of remote classes into regular Java classes plus proxies and RMI invocations. There are several papers on implementation aspects that are essential to make JavaParty run fast. The most important ones are the following: The fast remote method invocation and object serialization, that is, the Karlsruhe RMI drop-in replacement for Java’s remote method invocation KaRMI, is presented in []. Locality optimization and automatic placement
J
of threads and objects to reduce communication costs is addressed in []. JavaParty’s distributed garbage collector is covered in []. How collective replication, especially the collective update operation at the end of collective synchronization blocks, can be implemented efficiently is covered in [].
Bibliography . Bal H, Haines M () Approaches for integrating task and data parallelism. IEEE Concurrency ():– (July-September) . Haumacher B, Philippsen M () Locality optimization in JavaParty by means of static type analysis. Concurrency: Practice Experience ():– (July) . Haumacher B, Philippsen M, Tichy WF () Irregular dataparallelism in a parallel object-oriented language by means of collective replication. Tech. Rep. CS-–, University of Erlangen-Nuremberg, Dept. of Computer Science (February) . Philippsen M () Cooperating distributed garbage collectors for cluster and beyond. Concurrency: Practice Experience ():– (May) . Philippsen M () A survey of concurrent object-oriented languages. Concurrency: Practice Experience ():– (August) . Philippsen M, Haumacher B, Nester C () More efficient serialization and RMI for Java. Concurrency: Practice Experience ():– (May) . Philippsen M, Zenger M () JavaParty - transparent remote objects in Java. Concurrency: Practice Experience ():– (November) . Valiant LG () A bridging model for parallel computation. Commun ACM ():– (August) . Veldema R, Philippsen M () Supporting huge address spaces in a virtual machine for Java on a cluster. In: Proc LCPC’, the th Intl. Workshop on Languages and Compilers for Parallel Computing (Urbana, IL, October –, ). Lecture Notes in Computer Science, vol , pp –. Springer, Berlin
Job Scheduling Rolf Riesen , Arthur B. Maccabe IBM Research, Dublin, Ireland Oak Ridge National Laboratory, Oak Ridge, TN, USA
Synonyms Node allocation; Processor allocation
Definition A parallel job scheduler allocates nodes for parallel jobs and coordinates the order in which jobs are run. With
J
J
Job Scheduling
enough resources available, a system can execute multiple parallel jobs simultaneously, while other jobs are enqueued and wait for nodes to become available. The job scheduler manages the queues of waiting jobs and oversees node allocation. The goals of a scheduler are to optimize throughput of a system (number of jobs completed per time unit), provide response time guarantees (finish a job by a deadline), and keep utilization of compute resources high.
Discussion Introduction Users of a parallel system submit their jobs by specifying which application they would like to run and how many nodes they need. It is then the task of the job scheduler to find and allocate the appropriate number of nodes. This is different from a sequential system where the Operating System (OS) is responsible for scheduling processes and assignment of them to processor cores. The difference is that sequential processes are pursuing independent goals and are not working together, in parallel, to complete a common task. There is some cost associated with moving a process to another processor core, but in general the OS can assign processes to the least busy core without much concern for other processes. Parallel jobs have different requirements, and they often run on multicomputers where each node, consisting of one or more processors and memory, runs its own copy of the OS. Parallel jobs cannot make progress until all processes of the job are assigned to CPUs. Only when these processes execute concurrently, are the resources used efficiently and the job finishes as quickly as possible. For example, if one process in a parallel job is waiting for data from another process, it cannot continue until the other process has run and computed the result that is needed by the first process. For that reason, it is important that a parallel scheduling system allocates and schedules all processors that are part of the same job, at the same time. Although a parallel job scheduler has to coordinate with the OSs running on each node, it is usually an independent software component that is not built into the OS kernel. A node in a parallel system consists of one or more processors, each with one or more CPU cores. These cores share a network interface to communicate with
processes running on cores of other nodes. Some job schedulers allocate individual cores, but it is more common to allocate all cores on a node as a unit. A job scheduling system, in coordination with the local OSs, ensures that a node and its processor cores are not overburdened, by limiting the number of processes assigned to each node at any given time. Therefore, a parallel job sometimes has to wait until enough nodes are available to run it. The job scheduler maintains queues to keep track of waiting jobs. When a job finishes, the nodes it was using are released and the scheduler selects the next job to run. Since each job needs a different number of nodes, assigning the available nodes to the waiting jobs is not trivial. It is sometimes necessary to leave nodes unallocated and combine them with nodes that become available later, so a larger job can run. Managing the running and waiting jobs is done with several goals in mind. The highest priority goal is often throughput. The number of jobs a system can process in a unit of time is its throughput. On some systems, at least for some of the jobs, response time is more important. A simulation that predicts tomorrow’s weather needs to finish in time for today’s news. If its execution is delayed for too long, it will not meet its deadline. Another aspect of scheduling is resource utilization. Nodes should not be idle while there are still jobs to complete. There are two main aspects to parallel job scheduling. First, the job scheduler needs to decide when a particular job needs to run. In addition to ensuring that enough nodes are available, possibly by reserving them, the scheduler may also take into consideration the priority of a job and how long it has been in the queue. The second part of scheduling is node allocation. When there are more nodes available than needed by the next job, the scheduler needs to decide which nodes should be allocated. In heterogeneous systems, node allocation may need to take into account available processor speeds and type, as well as memory sizes and other specific resources, and match them with the requirements of a job. Even in homogeneous systems, where all nodes have the same capabilities, the allocation strategy is important because it impacts performance of the system. Nodes that are located physically close to each other have lower communication latencies, and it is less likely that other messages and I/O traffic traversing the
Job Scheduling
system will delay messages between these nodes. The following sections discuss scheduling and allocation in more detail.
Scheduling When nodes become available, the scheduler needs to pick which job to run next. The easiest way to do that is to simply pick the next job in the queue. If that job needs more nodes than are currently available, the scheduler waits until more of the currently running jobs finish and enough nodes become available. This is called First-Come First-Serve (FCFS) scheduling. FCFS is inefficient because it leaves nodes idle, while smaller jobs further back in the queue could be run. A method called backfill solves that problem. When nodes are available, but the next job in the queue is too big to fit, the scheduler searches further back into the queue to find a job that can be run on the available nodes. This can lead to starvation of bigger jobs, if only a few nodes are available at a time. Smaller jobs in the queue must be prevented from always bypassing the larger jobs. Methods to prevent starvation usually assign a priority to each job, or use multiple queues with different priorities. While jobs are waiting to run, their priority is increased periodically or they are moved to higher priority queues. Another method to prevent starvation is reservation. When larger jobs are enqueued, they are allowed to reserve nodes. When these nodes become available, they remain idle or are used for short running jobs only. When all reserved nodes are ready, the larger job runs. Both backfill and reservation work better, if the job scheduler knows how long an application will use its nodes. That knowledge makes it possible to run some shorter and smaller jobs, while the system is waiting for the remaining nodes to become available to run the larger job. This makes scheduling decisions much more complex, but utilizes the nodes of a system better. Calculating an optimal solution for large numbers of nodes and jobs is extremely time consuming. Therefore, heuristics are used to compute good approximations in an acceptable amount of time. Scheduling systems that depend on knowing how long an application needs its nodes terminate applications that do not finish within their allocated time. The scheduler sends a signal to the application a few seconds before it removes the job from the system. This
J
gives applications a chance to save their current state and intermediate results to disk. They can be restarted later at which time they read their saved state and resume from where they were interrupted. Predicting how long an application will run is difficult. Most job schedulers rely on the user to supply that information. Even for experienced users, this is not always an easy estimate since the running time of most applications is highly input dependent. Users therefore have a tendency to overestimate the time their applications need on the system to ensure that they finish before the scheduler terminates them. It turns out that overestimating is not necessarily bad. Since nodes become available sooner than the scheduler anticipated, it can often run some short jobs until the next scheduled job needs these nodes. Researchers investigating this phenomena have reported that accurate estimates help users because their jobs are serviced more predictably and faster by the job scheduler, but overestimating does not hurt the system, since it can backfill short-duration availabilities. For some applications and workloads, it is possible to automatically calculate the expected running time. For other applications, it is possible to look at historical information and estimate the running time. Several sophisticated methods have been proposed, but most commercial systems rely on user estimates.
Node Allocation Scheduling, discussed in the previous section, is used to decide which job to run when. Once a job has been selected to run, node allocation decides on which nodes this job should run. In a multicomputer, the allocation granularity is usually a node, not individual processor. If a user requests processors and there are four processors per node, then the allocator would assign nodes to that job. The network topology connecting the nodes arranges them in a geometric pattern such as -D or -D meshes, hypercubes, or trees. When allocating nodes, it is important to place processes that communicate frequently with each other on nearby nodes. If frequently communicating processes are placed far apart from each other in the network topology, then their communication costs increase. Furthermore, since the links in the network paths between the nodes are shared with
J
J
Job Scheduling
other communication flows, further slowdowns occur due to contention in the network. While placing the processes of a parallel job in a contiguous region of the network topology is desirable, it is not always possible. The coming and going of differently sized jobs causes fragmentation that leads to the situation where enough nodes are available to run a particular job, but they are scattered throughout the system. Using the nodes anyway leads to poor communication performance, while waiting for more nodes to become available leads to poor system utilization and throughput. Another complicating factor is that the allocator does not know which processes of an application will communicate with each other once the job starts executing. Each process of a message-passing application receives a rank number at the start of execution. The allocator assumes that processes with similar rank numbers will be the ones communicating with each other most frequently. This assumption is based on the observation that simulations of physical systems often map processes in a -D or -D layout. In many simulations, physical properties of one region are computed based on the conditions in the surrounding regions. That means information has to flow among processors assigned to neighboring regions. Unfortunately, when application programmers write their code, they cannot know how their processes will be assigned to nodes, especially if their programs will run on different network topologies and machine sizes. Therefore, programmers often make the assumption that nearby ranks are also near each other in the network. If a program logically arranges its processes in a × grid, the allocator only knows that this application requires processors to run. It is tempting to assume that allocating nodes in a square is best, but that strategy fails for nonsquare numbers of nodes, cannot always be done given the available nodes, and may be a mismatch with the logical ordering the application uses. In many systems, the scheduler and allocator are separate components. The scheduler receives information about the number of available nodes and uses that to dispatch jobs. The allocator must find nodes for that job, even if that leads to higher fragmentation and decreased application performance. It is beneficial to combine these two functions and allowing the allocator to delay a job until a more favorable configuration
of nodes becomes available. That requires a more complex scheduler, and ways to assure fairness and prevent starvation. In large-scale, high-performance systems, migration of running processes is considered to be too complex and has too much overhead to be used to achieve better node allocations. Complicating allocation further is that in some systems the nodes are not homogeneous. Matching job requirements with nodes that have different memory sizes and CPUs further complicates node allocation. Often, same node types are grouped and served by separate queues in the scheduler.
Space and Time-Sharing Node allocation discussed so far implicitly assumed space-sharing, where each processor is allocated to a single job. There might be multiple jobs currently running on a machine, but each job has its own set of processors allocated to it. On nonparallel systems, time-sharing, where multiple processes run for short slices of time on a single processor, is more common. Space-sharing improves the performance of parallel applications. When the individual processes of a parallel application communicate with each other, they often depend on a quick response time for data requests, so they can continue with their computations. If process A is waiting for information from process B, but B is not currently running, then A will be blocked and it may lose its time slice before the response from B arrives. When the data finally does arrive, A may not be running anymore and will not be able to process the data and respond right away. This in turn may delay B, while it is waiting for a response from A. This problems gets worse the more processes depend on each other’s responses. The solution in highperformance systems is space-sharing. Once allocated to their nodes, the processes of a parallel application will not be preempted by other parallel jobs. This improves responsiveness among the processes of a parallel job and lets it finish more quickly. A method called gang-scheduling can help with synchronizing processes of the same job. The idea is to coordinate processors and make sure that contextswitches from one job to another occur at exactly the same time on all processors. That way processors can be time-shared among several jobs, but all the processes of one job run during the same time slices. This prevents
Job Scheduling
blocking when processes of the same job communicate with each other. Gang-scheduling is not very common because it requires hardware support or very coarse time slices. IBM’s Blue Gene machines have a built-in synchronization network that makes gang-scheduling feasible. Gang-scheduling can reduce OS noise: All OS housekeeping tasks are done at the same time on all processors and no application process gets delayed waiting for another process whose OS is busy.
Future Directions Desktop computers and even laptops are now parallel systems. With the number of cores increasing for the foreseeable future, the parallelism available in these systems will soon rival many clusters in use today. While scheduling in a shared-memory system is easier, for example, process migration is less difficult, the amount of parallelism and the need for data locality will pose challenges for which traditional OSs have not been designed, but which have been explored in the parallel job scheduling community for many years. One of the many new challenges in this area to be considered is the following. In the future, the cores inside a multicore CPU will be connected by a Network on Chip (NoC). Each core in use generates heat and raises the temperature in adjacent cores. It is therefore beneficial to distribute the cores that are currently active in a way such that the overall temperature of the chip is minimized. This adds another dimension to the already complex problem of allocation.
Related Entries Affinity Scheduling Checkpointing Scheduling Algorithms
Bibliographic Notes and Further Reading The International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP) has been held annually for more than years and is a rich source of information on the topic of parallel scheduling and processor allocation []. Over the years the field has evolved. In the introduction to two of the JSSPP workshops, Feitelson et al. document some of the
J
changes that have occurred [, ], while [] looks at some of the challenges ahead. Knowing the expected running time of an application is important for job scheduling. Whether user estimates are accurate has been explored in [], and using historical information to automate runtime prediction has been looked at in []. Some common batch scheduling systems in use are: Platform Computing’s Load Sharing Facility (LSF) [], the Portable Batch System (PBS) [], and a Simple Linux Utility for Resource Management (SLURM) [, ]. Topology aware processor allocation and the effect of contiguous or noncontiguous allocation have been most recently studied in [], while communication awareness is discussed in []. Innovative use of (onedimensional) space-filling curves to allocate processors to improve communication efficiency is explored in [] and []. Making the allocator temperature-aware has been suggested in []. The tradeoffs of space- and timesharing as well as combining several of these methods to improve utilization are discussed in []. An excellent description of the architectural and software components of a large-scale parallel system, and how they interact, is contained in []. That paper provides a sense of what is required to integrate node scheduling and allocation into a complex parallel system.
Bibliography . Bender MA, Bunde DP, Demaine ED, Fekete SP, Leung VJ, Meijer H, Phillips CA () Communication-aware processor allocation for supercomputers: finding point sets of small average distance. Algorithmica ():– . Feitelson D (Nov ) Workshops on job scheduling strategies for parallel processing. http://www.cs.huji.ac.il/~feit/parsched/ . Feitelson DG, Rudolph L, Schwiegelshohn U, Sevcik KC, Wong P () Theory and practice in parallel job scheduling. In: IPPS ’: Proceedings of the job scheduling strategies for parallel processing, Geneva. Springer, London, pp – . Feitelson DG, Rudolph L, Schwiegelshohn U () Parallel job scheduling – a status report. In: Feitelson DG, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Frachtenberg E, Schwiegelshohn U () New challenges of parallel job scheduling. In: Frachtenberg E, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp –
J
J
Job Scheduling
. Henderson RL () Job scheduling under the portable batch system. In: IPPS ’: Proceedings of the workshop on job scheduling strategies for parallel processing, Santa Barbara. Springer, London, pp – . Lawrence Livermore National Laboratory (Nov ) SLURM: a highly scalable resource manager. https://computing.llnl.gov/ linux/slurm/ . Lee CB, Schwartzman Y, Hardy J, Snavely A () Are user runtime estimates inherently inaccurate? In: Feitelson DG, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Leung VJ, Arkin EM, Bender MA, Bunde D, Johnston J, Lal A, Mitchell JSB, Phillips C, Seiden SS () Processor allocation on C plant: achieving general processor locality using onedimensional allocation strategies. In: CLUSTER ’: proceedings of the IEEE international conference on cluster computing, Chicago. IEEE Computer Society, Washington, DC, pp . Liao X, Jigang W, Srikanthan T () A temperature-aware virtual submesh allocation scheme for NoC-based manycore chips. In: SPAA ’: proceedings of the twentieth annual symposium on parallelism in algorithms and architectures, Munich. ACM, New York, pp – . Moreira JE, Salapura V, Almasi G, Archer C, Bellofatto R, Bergner P, Bickford R, Blumrich M, Brunheroto JR, Bright AA, Brutman M, Castaños JG, Chen D, Coteus P, Crumley P, Ellis S, Engelsiepen T, Gara A, Giampapa M, Gooding T, Hall S, Haring RA, Haskin R, Heidelberger P,
.
.
.
.
.
Hoenicke D, Inglett T, Kopcsay GV, Lieber D, Limpert D, McCarthy P, Megerian M, Mundy M, Ohmacht M, Parker J, Rand RA, Reed D, Sahoo R, Sanomiya A, Shok R, Smith B, Stewart GG, Takken T, Vranas P, Wallenfelt B, Michael B, Ratterman J () The Blue Gene/L supercomputer: a hardware and software story. Int J Parallel Program :– Pascual JA, Navaridas J, Miguel-Alonso J () Effects of topology aware allocation policies on scheduling performance. In: Frachtenberg E, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – Smith W, Foster I, Taylor V () Predicting application run times using historical information. In: Feitelson DG, Rudolph L (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – Yoo AB, Jette MA, Grondona M () SLURM: simple Linux utility for resource management. In: Feitelson DG, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing. Lecture Notes in Computer Science, vol . Springer, Berlin, pp – Zhang Y, Franke H, Moreira J, Sivasubramaniam A () An integrated approach to parallel scheduling using gangscheduling, backfilling, and migration. IEEE T Parall Distr ():– Zhou S, Zheng X, Wang J, Delisle P () Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Softw Pract Exp ():–
K k-ary n-cube Hypercubes and Meshes
Knowledge Discovery Data Mining
Networks, Direct
k-ary n-fly
KSR Cache-Only Memory Architecture
Networks, Multistage
k-ary n-tree Networks, Multistage
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
L LANai Myrinet
Languages Programming Languages
LAPACK Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition LAPACK is a library of Fortran subroutines for solving most commonly occurring problems in dense matrix computations. It has been designed to be efficient on a wide range of modern high-performance computers. The name LAPACK is an acronym for Linear Algebra PACKage. LAPACK can solve systems of linear equations, linear least squares problems, eigenvalue problems, and singular value problems. LAPACK can also handle many associated computations such as matrix factorizations or estimating condition numbers.
Discussion LAPACK contains driver routines for solving standard types of problems, computational routines to perform a distinct computational task, and auxiliary routines to perform a certain subtask or common low-level computation. Each driver routine typically calls a sequence of computational routines. Taken as a whole, the computational routines can perform a wider range of tasks than are covered by the driver routines. Many of the auxiliary routines may be of use to numerical analysts or software
developers, so the Fortran source for these routines has been documented with the same level of detail used for the LAPACK routines and driver routines. LAPACK is designed to give high efficiency performance on vector processors, high-performance “super-scalar” workstations, and shared memory multiprocessors. It can also be used satisfactorily on all types of scalar machines (PCs, workstations, and mainframes). A distributed memory version of LAPACK, ScaLAPACK [], has been developed for other types of parallel architectures (e.g., massively parallel SIMD machines, or distributed memory machines). LAPACK has been designed to supersede LINPACK [] and EISPACK [, ] principally by restructuring the software to achieve much greater efficiency, where possible, on modern high-performance computers, and also by adding extra functionality, by using some new or improved algorithms, and by integrating the two sets of algorithms into a unified package. LAPACK routines are written so that as much as possible of the computation is performed by calls to the Basic Linear Algebra Subprograms (BLAS) [–]. Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. The BLAS enable LAPACK routines to achieve high performance with portable code. The complete LAPACK package or individual routines from LAPACK are freely available on Netlib [] and can be obtained via the World Wide Web or anonymous ftp. The LAPACK homepage can be accessed via the following URL address: http://www.netlib.org/ lapack/. LAPACK is a Fortran interface to the Fortran LAPACK library. LAPACK++ is an object-oriented C++ extension to the LAPACK library. The CLAPACK library was built using a Fortran to C conversion utility called fc []. The JLAPACK project provides the LAPACK and BLAS numerical subroutines translated
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
L
Large-Scale Analytics
from their Fortran source into class files, executable by the Java Virtual Machine (JVM) and suitable for use by Java programmers. The ScaLAPACK (or Scalable LAPACK) [] library includes a subset of LAPACK routines redesigned for distributed memory messagepassing MIMD computers and networks of workstations supporting PVM and/or MPI. Two types of driver routines are provided for solving systems of linear equations: simple and expert. Different driver routines are provided to take advantage of special properties or storage schemes of the system matrix. Supported matrix types and storage schemes include: general, general band, general tridiagonal, symmetric/ Hermitian, positive definite, symmetric/Hermitian, positive definite (packed storage), positive definite band, positive definite tridiagonal, indefinite, complex symmetric, and indefinite (packed storage).
Large-Scale Analytics Massive-Scale Analytics
Latency Hiding Latency hiding improves machine utilization by enabling the execution of useful operations by a device while it is waiting for a communication operation or memory access to complete. Prefetching, context switching, and instruction-level parallelism are mechanisms for latency hiding.
Related Entries Denelcor HEP Multi-Threaded Processors
Related Entries
Superscalar Processors
LINPACK Benchmark ScaLAPACK
Bibliography . Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLAPACK users’ guide. SIAM . Dongarra JJ, Bunch JR, Moler CB, Stewart GW () LINPACK users’ guide Society for Industrial and Applied Mathematics, Philadelphia . Smith BT, Boyle JM, Dongarra JJ, Garbow BS, Ikebe Y, Klema VC, Moler CB () Matrix eigensystem routines – EISPACK guide, vol , Lecture notes in computer science, Springer, Berlin . Garbow BS, Boyle JM, Dongarra JJ, Moler CB () Matrix eigensystem routines – EISPACK guide extension, vol , Lecture notes in computer science, Springer, Berlin . Lawson CL, Hanson RJ, Kincaid D, Krogh FT () Basic Linear Algebra Subprograms for Fortran usage. ACM Trans Math Soft :– . Dongarra JJ, Croz Du J, Hammarling S, Hanson RJ () An extended set of FORTRAN Basic Linear Algebra Subroutines. ACM Trans Math Soft ():– . Dongarra JJ, Croz Du J, Duff IS, Hammarling S () A set of Level Basic Linear Algebra Subprograms. ACM Trans Math Soft ():– . Dongarra JJ, Grosse E () Distribution of mathematical software via electronic mail. Commun ACM ():– . Feldman SI, Gay DM, Maimone MW, Schryer NL () A Fortran-to-C Converter. AT & T Bell Laboratories, Murray Hill. Computing Science Technical Report,
Law of Diminishing Returns Amdahl’s Law
Laws Amdahl’s Law Gustafson’s Law Little’s Law Moore’s Law
Layout, Array Paul Feautrier Ecole Normale Supérieure de Lyon, Lyon, France
Definitions A high-performance architecture needs a fast processor, but a fast processor is useless if a memory subsystem does not provide data at the rate of several words per clock cycle. Run-of-the-mill memory chips in today technology have a latency of the order of ten to a hundred processor cycles, far more than the necessary performance. The usual method for increasing the memory
Layout, Array
bandwith as seen by the processor is to implement a cache, i.e., a small but fast memory which is geared to hold frequently used data. Caches work best when used by programs with almost random but nonuniform addressing patterns. However, high-performance applications, like linear algebra or signal processing, have a tendency to use very regular adressing patterns, which degrade cache performance. In linear algebra codes, and also in stream processing, one finds long sequences of accesses to regularly increasing adresses. In image processing, a template moves regularly across a pixel array. To take advantage of these regularities, in fine-grain parallel architectures, like SIMD or vector processors, the memory is divided into several independent banks. Adresses are evenly scattered among the banks. If B is the number of banks, then word x is located in bank x mod B at displacement x ÷ B. The low-order bits of x (the byte displacement) do not participate in the computation. In this way, words in consecutive addresses are located in distinct banks, and can, with the proper interface hardware, be accessed in parallel. A problem arise when the requested words are not at consecutive addresses. Array layouts were invented to allow efficient access to various templates.
Discussion Parallel Memory Access There are many ways of taking advantage of such a memory architecture. One possibility is for the processors to access in one memory cycle the B words which are necessary for a piece of computation, and then to cooperate in computing the result. Consider, for instance, the problem of summing the elements of a vector. One reads B words, adds them to an accumulator, and proceeds to the next B words. Another organization is possible if the program operations can be executed in parallel, and if the data for each operation are regularly located in memory. Consider, for instance, the problem of computing the weighted sum of three rows in a TV image (downsampling). One possibility is to read three words in the same column and do the computation. But since it is likely that B is much larger than three, the memory subsystem will be under-utilized. The other possibility is to read B words in the first row, B words in the second row, and so on, do the summation, and store B words of the
L
result. This method is more efficient, but places more constraints on the target application.
Templates and Template Size In both situations, the memory is addressed throught templates, i.e., finite sets of cells which can move accross an array. Consider for instance a vector: float A[100].
The template (, , , ) represents four consecutive words. If located (or anchored) at position i, it covers elements A[i], A[i+1], A[i+2] and A[i+3] of vector A. But the most interesting case is that of a two-dimensional array, which may be a matrix or an image. The template ((, ), (, ), (, ), (, )) represents a two-by-two square block. If anchored at position (i, j) in a matrix M, it covers elements M[i][j], M[i][j+1], M[i+1][j], and M[i+1][j+1]. The first template is one dimensional, and the second one is two dimensional. One may consider templates of arbitrary dimensions, but these two are the most important cases. The basic problem is to distribute arrays among the available memory banks in such a way that access to all elements of the selected template(s) takes only one memory cycle. It follows that the size of a template is at most equal to the number of banks, B. As has been seen before, memory usage is optimized when templates have exactly B cells; this is the only choice that is discussed here. Observe also that, as the previous example has shown, there may be many template selections for a given program. The problem of selecting the best one will not be discussed here.
Layouts A layout is an assignment of bank numbers to array cells. A template can be viewed as a stencil that moves accross an array and shows bank numbers through its openings. A layout is valid for a template position if all shown bank numbers are distinct. In Fig. , the layout is valid for all positions of the solid template. In Fig. , the layout is valid for all positions of the solid template, and valid for no position of the dashed template. The reader may have noticed that the information given by array layouts such as Fig. or is not complete. One must also know at which address each cell is located in its bank. This information can always be retrieved
L
L 0
Layout, Array
1
2
3
0
1
2
3
equivalent: If a template tiles an array and if block numbers are assigned as explained above, then the layout is valid for every position. For a proof, see [].
Layout, Array. Fig. A vector layout
Skewing Schemes: Vector Processing Consider the following example: 0
1
2
3
3
0
1
2
2
3
0
1
1
2
3
0
0
1
2
3
3
0
1
2
Layout, Array. Fig. A matrix layout
from the layout. Simply order the array cells and allocate each cell to the first free word in its bank. For Fig. , the bank address is found equal to the row number. This scheme imposes two constraints on a layout: ● ●
Array cells must be evenly distributed among banks. The correspondance between cells and bank addresses cannot be too complex, since each processor must know and use it.
Depending on the application, one may want that a layout be valid at some positions in the array, or at every possible position. For instance, in linear algebra, it is enough to move the template in such a way that each entry is covered once and only once. One says that the template tesselates or tiles the array domain. Once a tiling has been found one simply numbers each square of the template from to B − , and reproduces this assignment (perhaps after a permutation) at all positions of the tile. This insures that the array cells are evenly distributed among the banks. In other applications, like image processing, the template must be moved at every position in the array. Imagine for instance that a smoothing operator must be applied at all pixels of an image. In fact, the two situations are
for(i=0; i< n; i=i+1) X[i] = Y[i] + Z[i],
to be run on a SIMD architecture with B processing elements. If the layout of Y is such that Y[i] is stored in bank i mod B at address i ÷ B, B elements of Y can be fetched in one memory cycle. After another fetch of B words of Z, the SIMD processor can execute B addition in parallel. Another memory cycle is needed to store the B results. Note that this scheme has the added advantage that all groups of B words are at the same address in each bank, thus fitting nicely in the SIMD paradigm. Suppose now that the above loop is altered to read: for(i=0; i< n; i=i+1) X[d*i] = Y[d*i] + Z[d*i],
where d is a constant (the step). The block number d.i mod B is equal to d.i − k.B for some k, and hence is a multiple of the greatest common divisor (gcd) g of d and B. It follows that only B/g words can be fetched in one cycle, and the performance is divided by g. Since B is usually a power of in order to simplify the computation of x mod B and x ÷ B, a loss of performance will be incurred whenever d is even. One can always invent layouts that support non-unit step accesses. For instance, if d = , assign cell i to bank (i ÷ ) mod B. However, in many algorithms that use non-unit steps, like the Fast Fourier Transform, d is a variable. It is not possible to accomodate all its possible values without costly remapping.
Skewing Schemes: Matrix Processing In C, the canonical method for storing a matrix is to concatenate its rows in the order of their subscripts. If the matrix is of size M × N, the displacement of element (i, j) is N.i + j. In Fortran, the convention is reversed, but the conclusions are the same, mutatis mutandis. One may assume that the row length N is a multiple of B, by padding if necessary; hence, all matrices may be assumed to have size M × B. As a consequence, element (i, j) is allocated to bank j mod B.
Layout, Array
0,0 1,0 2,0 3,0
0,1 1,1 2,1 3,1
0,2 1,2 2,2 3,2
0,0 1,3 2,2 3,1
0,3 1,3 2,3 3,3
0,1 1,0 2,3 3,2
0,2 1,1 2,0 3,3
0,3 1,2 2,1 3,0
B = 4, S = 0: four cycles are needed to access column 0
B = 4, S = 1: column 0 can be accessed in one cycle
a
b
Layout, Array. Fig. The effect of skewing
If the algorithm accesses a matrix by rows, as in: for(j=0; j
L
results on matrix layouts corresponds to the many ways of tiling the plane with B-sized linear templates. A tiling is defined by a set of vectors v , . . . , vn (translations). If a template instance has been anchored at position (i, j), then there are other instances at positions (i, j) + v , . . . , (i, j) + vn . The translations v , . . . , vn must be selected in such a way that the translates do not overlap. If an algorithm needs several templates, the first condition is that each of them tiles the plane. But each tiling defines a layout, and these layouts must be compatible. One can show that the compatibility condition is that the translation vectors are the same for all templates. Since a template may tile the plane in several different ways, finding the right tiling may prove a difficult combinatorial problem.
performance will be maximal. However, when accessing columns, consecutive items will be separated by B rows, and be located all in the same bank. All parallelism will be lost. A better solution is skewing (Fig. ). The cell (i, j) is now allocated to bank (S.i + j) mod B. S is the skewing factor. Here, two consecutive cells in a row are still at a distance of ; hence, a row can still be accessed in one cycle. But two consecutive cells in a column, (i, j) and (i + , j) are at a distance of S. If S is selected to be relatively prime to B (for instance, S = if B is even Fig. (b)) access to a column will also take one cycle. This raises the question of whether other patterns can be accessed in parallel. Consider the problem of accessing the main diagonal of a matrix. In the above skewing scheme, the distance from cell (i, i) to (i+, j+) is S+, and S and S+ both cannot be odd. Hence, accessing both columns and diagonals (or both rows and diagonals) in parallel with the same skewing scheme is impossible when B is even.
to be run on a pipelined processor which executes one addition per cycle. If the memory latency is equal to B processor cycles, the memory banks can be activated in succession, in such a way that, after a prelude of B cycles, the processor receives an element of A at each cycle. As for parallel processors, if the loop has a step of d, the performance will be reduced by a factor of gcd(d, B). While today vector processors are confined to niche applications, the same problem is reemerging for Graphical Processing Units (GPU), which can access B words in parallel under the name “Global Memory Access Coalescing.”
Skewing Schemes: More Templates
Reordering
It is clear that a more general theory is needed when the subject algorithm needs to use several templates at the same time. As seen above, a template has a valid layout if and only if it tiles the two-dimensional plane. This result seems natural if the template is required to cover once and only once each cell of the underlying array. If bank numbers are assigned in the same order in each tile [] or are regularly permuted from tile to tile [] then the layout will obviously be valid for each tile. The striking fact is that the layout will also be valid when the template is offset from a tile position. Observe that the previous
With nonstandard layouts, the results of a parallel access may not be returned in the natural order. Consider, for instance, the problem of having parallel access both for rows and for the main diagonal of a × matrix on a banks memory. The skewing factor S must be such that S + is relatively prime to : S = is a valid choice. The corresponding layout is shown on Fig. . The integer in row i and column j is the number of the bank which holds element (i, j) of the matrix. One can see that bank holds element (, ), bank holds element (, ), and so on. If the banks are connected to processors in
Pipelined Processors and GPU Consider the loop: for(i=0; i
L
L
LBNL Climate Computer
0
1
2
3
0
0
1
2
3
1
2
3
0
1
2
0
1
2
3
3
2
3
0
1
Layout, Array. Fig. The need for reordering
the natural order, the elements of the main diagonal are returned in the order (, ), (, ), (, ), (, ). This might be unimportant in some cases, for instance, if the elements of the main diagonal are to be summed (the trace computation), since addition is associative and commutative. In other cases, returned values must be reordered. This can be done by inserting a B words fast memory between the main memory subsystem and the processors. One can also use a permutation network which may route a datum from any bank to any processor.
LBNL Climate Computer Green Flash: Climate Machine (LBNL)
libflame Field G. Van Zee , Ernie Chan Robert A. van de Geijn The University of Texas at Austin, Austin, TX, USA NVIDIA Corporation, Santa Clara, CA, USA
Synonyms FLAME
Definition libflame is a software library for dense matrix computations.
Discussion Introduction
The Number of Banks Remember that in vector processing, when the step d is not equal to one, performance is degraded by a factor gcd(d, B). It is tempting to use a prime number for B, since then the gcd will always be one except when d is a multiple of B. However, having B a power of two considerably simplifies the addressing hardware, since computing x mod B and x ÷ B is done just by routing wires. This has lead to the search for prime numbers with easy division, which are of one of the forms p ± . See [] for a discussion of this point.
The libflame library is a modern, high-performance dense linear algebra library that is extensible, easy to use, and available under an open source license. It is both a framework for developing dense linear algebra solutions and a ready-made library that strives to be user-friendly while offering competitive (and in many cases superior) real-world performance when compared to the more traditional Basic Linear Algebra Subprograms (BLAS) and Linear Algebra PACKage (LAPACK) libraries. It was developed as part of the FLAME project.
The FLAME Project
Bibliography . Budnik P, Kuck DJ () The organisation and use of parallel memories. IEEE Trans Comput C-:– . de Dinechin BD () A ultra fast euclidean division algorithm for prime memory systems. In: Supercomputing’, Albuquerque, pp – . Jalby W, Frailong JM, Lenfant J () Diamond schemes: an organization of parallel memories for efficient array processing. Technical Report RR-, INRIA . Shapiro HD () Theoretical limitations on the efficient use of parallel memories. IEEE Trans Comput C-:–
A solution based on fundamental computer science. The FLAME project advocates a more stylized notation for expressing loop-based linear algebra algorithms. This notation, illustrated in Fig. , closely resembles how matrix algorithms are illustrated with pictures. The notation facilitates formal derivation of algorithms so that the algorithms are developed hand-in-hand with their proof of correctness. For a given operation this yields a family of algorithms so that the best option for a given situation (e.g., problem size or architecture) can be chosen.
libflame
Algorithm: A ← Chol_blk(A)
ATL ATR ABL ABR where ATL is 0 × 0 while m(ATL) <m (A) do Determine block size b Repartition
A11 A21 A11 A21
⎛
:= := := :=
⎜ ⎝
A11 − A10 AT11 A21 − A20 AT10 Chol(A11 ) A21 A−T 11
Continue with
⎞
A00 A01 A02 ⎟ → A10 A11 A12 ⎠ A20 A21 A22 where A11 is b × b
ATL ATR ABL ABR
ATL ABL
ATR ABR
(syrk) (gemm) (chol) (trsm)
⎛
←
⎜ ⎝
A complete dense linear algebra framework. The lib-
Partition A →
L
⎞
A00 A01 A02 A10 A11 A12 ⎟ ⎠ A20 A21 A22
endwhile
libflame. Fig. Blocked Cholesky factorization expressed with FLAME notation. Subproblems annotated as SYRK, GEMM, and TRSM correspond to level- BLAS operations while CHOL performs a Cholesky factorization of the submatrix A
Object-based abstractions and API. The libflame
library is built around opaque structures that hide implementation details of matrices, such as data layout, and exports object-based programming interfaces to operate upon these structures. FLAME algorithms are expressed (and coded) in terms of smaller operations on sub-partitions of the matrix operands. This abstraction facilitates programming without array or loop indices, which allows the user to avoid indexrelated programming errors altogether, as illustrated in Fig. . Educational value. The clean abstractions afforded by the notation and API, coupled with the systematic methodology for deriving algorithms, make FLAME well suited for instruction of high-performance linear algebra courses at the undergraduate and graduate level.
flame library provides ready-made implementations of common linear algebra operations. However, libflame differs from traditional libraries in two important ways. First, it provides families of algorithms for each operation so that the best can be chosen for a given circumstance. Second, it provides a framework for building complete custom linear algebra codes. This makes it a flexible and useful environment for prototyping new linear algebra solutions. High performance. The libflame library dispels the myth that user- and programmer-friendly linear algebra codes cannot yield high performance. Implementations of operations such as Cholesky factorization and triangular matrix inversion often outperform the corresponding implementations currently available in traditional libraries. For sequential codes, this is often simply the result of implementing the operations with algorithmic variants that cast the bulk of their computation in terms of higher-performing matrix operations such as general matrix rank-k updates. Support for nontraditional storage formats. The libflame library supports both column-major and rowmajor order (“flat”) storage of matrices. Indeed, it allows for general row and column strides to be specified. In addition, the library supports storing matrices hierarchically, by blocks, which can yield performance gains through improved spatial locality. Previous efforts [] at implementing storage-by-blocks focused on solutions that required intricate indexing. libflame simplifies the implementation by allowing the elements within matrix objects to be references to smaller submatrices. Furthermore, the library provides user-level APIs which allow application programmers to read and modify individual elements and submatrices while simultaneously abstracting away the details of the storage scheme. Supported operations. As of this writing, the following operations have been implemented within libflame to operate on both “flat” and hierarchical matrix objects:
●
The libflame Library The techniques developed as part of the FLAME project have been instantiated in the libflame library.
Factorizations. Cholesky; LU with partial pivoting; QR and LQ. ● Inversions. Triangular; Symmetric/Hermitian Positive Definite.
L
L
libflame
void FLA_Chol_blk ( FLA_Obj A , { FLA_Obj ATL , ATR , A00 , ABL , ABR , A10 , A20 , int b ; FLA_Part_2x2 ( A ,
int nb_alg ) A01 , A02 , A11 , A12 , A21 , A22 ;
& ATL , & ATR , & ABL , & ABR ,
0 , 0 , FLA_TL );
while ( FLA_Obj_length ( ATL ) < FLA_Obj_length ( A ) ) { b = min ( FLA_Obj_length ( ABR ) , nb_alg ); FLA_Repart_2x2_to_3 x 3 ( ATL , /* */ ATR , /* ************* */
& A00 , /* */ & A01 , & A02 , /* ******************** */ & A10 , /* */ & A11 , & A12 , & A20 , /* */ & A21 , & A22 ,
ABL , /* */ ABR , b , b , FLA_BR ); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ FLA_Syrk ( FLA_LOWER_TRIANGULAR , FLA_NO_TRANSPOSE , FLA_MINUS_ONE , A10 , FLA_ONE , A11 ); FLA_Gemm ( FLA_NO_TRANSPOSE , FLA_TRANSPOSE , FLA_MINUS_ONE , A20 , A10 , FLA_ONE , A21 ); FLA_Chol_unb ( FLA_LOWER_TRIANGULAR , A11 ); FLA_Trsm ( FLA_RIGHT , FLA_LOWER_TRIANGULAR , FLA_TRANSPOSE , FLA_NONUNIT_DIAG , FLA_ONE , A11 , A21 ); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ FLA_Cont_with_3x3_to_2x2 ( A00 , A01 , /* */ A02 , & ATL , /* */ & ATR , A10 , A11 , /* */ A12 , /* ************** */ /* ****************** */ & ABL , /* */ & ABR , A20 , A21 , /* */ A22 , FLA_TL ); } }
libflame. Fig. The algorithm shown in Fig. implemented with the FLAME/C API
●
Solvers. Linear systems via any of the factorizations above; Solution of the Triangular Sylvester equation. ● Level- BLAS. General, Triangular, Symmetric, and Hermitian matrix-matrix multiplies; Symmetric and Hermitian rank-k and rank-k updates; Triangular solve with multiple right-hand sides. ● Miscellaneous. Triangular-transpose matrix multiply; Matrix-matrix axpy and copy; and several other supporting operations. Cross-platform support. The libflame distribution
includes build systems for both GNU/Linux and
Microsoft Windows. The GNU/Linux build system provides a configuration script via GNU autoconf and employs GNU make for compilation and installation, and thus adheres to the well-established “configure; make; make install” sequence of build commands. Compatibility with LAPACK. A set of compatibility rou-
tines is provided that maps conventional LAPACK invocations to their corresponding implementations within libflame. This compatibility layer is optional and may be disabled at configure-time in order to allow the application programmer to continue to
libflame
use legacy LAPACK implementations alongside native libflame functions.
Support for Parallel Computation Multithreaded BLAS. A simple technique for exploiting thread-level parallelism is to link the sequential algorithms for flat matrices in libflame to multithreaded BLAS libraries where each call to a BLAS routine is parallelized individually. Inherent synchronization points occur between successive calls to BLAS routines and thus restrict opportunities for parallelization between subproblems. Routines are available within libflame to adjust the block size of these sequential implementations in order to attain the best performance from the multithreaded BLAS libraries.
L
SuperMatrix. The fundamental bottleneck to exploit-
ing parallelism directly within many linear algebra codes is the web of data dependencies that frequently exists between subproblems. As part of libflame, the SuperMatrix runtime system has been developed to detect and analyze dependencies found within algorithms-by-blocks (algorithms whose subproblems operate only on block operands). Once data dependencies are known, the system schedules sub-operations to independent threads of execution. This system is completely abstracted from the algorithm that is being parallelized and requires virtually no change to the algorithm code while exposing abundant high-level parallelism. The directed acyclic graph for the Cholesky factorization, given a × matrix of blocks (a hierarchical
CHOL0
TRSM1
SYRK4
TRSM2
GEMM5
GEMM6
TRSM8
TRSM9
L
TRSM3
SYRK10
GEMM12
SYRK16
SYRK11
GEMM13
SYRK17
CHOL7
CHOL14
TRSM15
SYRK18
CHOL19
libflame. Fig. The directed acyclic graph for the Cholesky factorization on a × matrix of blocks
L
Libraries, Numerical
matrix stored with one level of blocking), is shown in Fig. . The nodes of the graph represent tasks that operate over different submatrix blocks, and the edges represent data dependencies between tasks. The subscripts in the names of each task denote the order in which the tasks would typically be executed in a sequential algorithm-by-blocks. Many opportunities for parallelism exist within the directed acyclic graph such as trsm , trsm , and trsm being able to be executed in parallel. PLAPACK. FLAME and libflame have their roots
in the distributed-memory parallel PLAPACK library. Integrating PLAPACK into libflame is a current effort.
. Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (Dec ) FLAME: formal linear algebra methods environment. ACM Trans Math Softw ():– . Gunnels JA, van de Geijn RA () Formal methods for highperformance linear algebra libraries. In: Proceedings of the IFIP TC/WG. working conference on the architecture of scientific software, Ottawa, Canada, pp – . Quintana-Ortí G, Quintana-Ortí ES, van de Geijn RA, Van Zee FG, Chan E (July ) Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans Math Softw ()::–: . van de Geijn RA, Quintana-Ortí ES () The science of programming matrix computations. http://www.lulu.com/content// . Van Zee FG, Chan E, van de Geijn R, Quintana-Ortí ES, QuintanaOrtí G () Introducing: The libflame library for dense matrix computations. IEEE Computing in Science and Engineering ():– . Van Zee FG () libflame: the complete reference. http://www. lulu.com/content//
Related Entries BLAS (Basic Linear Algebra Subprograms) LAPACK PLAPACK
Libraries, Numerical Numerical Libraries
Bibliographic Notes and Further Reading The FLAME notation and APIs have their root in the Parallel Linear Algebra Package (PLAPACK). Eventually the insights gained for implementing distributedmemory libraries found their way back to the sequential library libflame. The first papers that outlined the vision that became the FLAME project were published in [, ]. A book has been published that describes in detail the formal derivation of dense matrix algorithms using the FLAME methodology []. How scheduling of tasks is supported was first published in [] and discussed in more detail in []. An overview of libflame can be found in [] and an extensive reference guide has been published [].
Bibliography . Chan E, Quintana-Ortí ES, Quintana-Ortí G, van de Geijn R (June ) SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In: SPAA ’: Proceedings of the nineteenth ACM symposium on parallelism in algorithms and architectures, San Diego, CA, USA, pp – . Elmroth E, Gustavson F, Jonsson I, Kagstrom B () Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev ():–
Linda Robert Bjornson Yale University, New Haven, CT, USA
Definition Linda [] is a coordination language for parallel and distributed computing, invented by David Gelernter and first implemented by Nicholas Carriero. Work on Linda was initially done at SUNY Stony Brook and continued at Yale University. Linda uses tuple spaces as the mechanism for communication, coordination, and process control. The goal is to provide a tool that makes it simple to develop parallel programs that are easy to understand, portable and efficient.
Discussion Introduction Linda is a coordination language concerned with issues related to parallelism and distributed computing, such as communication of data, coordination of activities, and creation and control of new threads/processes.
Linda
L
Linda is not concerned with other programming language issues such as control structures (testing, looping), file I/O, variable declaration, etc. Thus, it is orthogonal to issues of traditional programming languages, and is normally embedded in a sequential language such as C, Fortran, Java, etc., resulting in a Linda dialect of that language (C-Linda, Fortran-Linda, Java-Linda, etc). It is important to distinguish between Linda as a conceptual model for coordination and a concrete X-Linda implementation. The latter necessarily imposes restrictions and limitations on the former.
its elements. Instead, new execution entities are created to evaluate the elements of the tuple (depending on the implementation, these entities might be threads or processes). Once all elements have been evaluated, the tuple is indistinguishable from one created via out. Eval is typically used with tuples that contain function calls, and is an idiomatic and portable way to create parallelism in Linda. For example, a process might create a new worker by doing:
Tuples and Tuple Spaces
Eval will immediately return control to the calling program, but in the background a new execution entity will begin computing worker(i). When that subroutine returns, its value will be the second element of a “new worker” tuple. There are two operations that can extract tuples from tuple space: in and rd (Traditionally read was spelled rd to avoid collisions with read from the C library. The exact verbs used have varied somewhat, with store and write common variants of out, and take a common variant of in). These operations take a template and perform a match of its arguments against the tuples currently in tuple space. A template is similar to a tuple, consisting of some number of actual and formal fields. If a match is found, the tuple is returned to the process performing the in or rd. In the case of in, the tuple is deleted from tuple space after matching, whereas rd leaves the tuple in tuple space. If multiple matches are available, one is chosen arbitrarily. If no matches are available, the requesting process blocks until a match is available. Tuples are always created and consumed atomically. In order to match, the tuple and template must:
Tuples (too-pulls) are the fundamental data objects in Linda. A tuple is an ordered, typed collection of any number of elements. These elements can be whatever values or objects are supported by the host language. Elements can also be typed place holders, or formals, which will act as wild cards. Here are some example tuples: (“counter”, ) (“table”, , , .) (“table, i, j, ? a) Tuples exist in tuple spaces. Processes communicate and synchronize with one another by creating and consuming tuples from a common tuple space. Tuple spaces exist, at least abstractly, apart from the processes using them. Tuple space is sometimes referred to as an associative memory. A tuple establishes associations between the values of its constituent elements. These associations are exploited by specifying some subset of the values to reference the tuple as a whole.
Tuple Space Operations There are two operations that create tuples: out and eval. Out simply evaluates the expressions that yield its elements in an unspecified order and then places the resulting tuple in tuple space:
eval(“new worker”, worker(i))
out(“counter”, )
. . . .
Agree on the number of fields Agree on the types of corresponding fields Agree in the values of corresponding actual fields Have no corresponding formals (this is unusual)
A tuple space is a multiset: if the above operation is repeated N times (and no other operations are done), then the tuple space will contain N identical (“counter”, ) tuples. Eval is identical to out, except that the tuple is placed into tuple space without completing the evaluation of
‘?’ indicates that the argument is a place-holder, or formal argument. In this case, only the types must match, and as a side effect, the formal variable in the template will be assigned the value from the tuple. Formals normally appear only in templates, but in some limited circumstances can appear in tuples as well (it is
L
L
Linda
not uncommon for a particular implementation to omit support for formals in tuples). Some work was also proposed to allow a “ticking” type modifier to enable specification of elements still under evaluation. Absent this, the matching above implicitly enforces matches on tuples whose elements have all completed evaluation. For example, the template: in(“counter”, ) would match the above tuple, as would in(“counter”, ?i) also setting i to in(“counter”, i) assuming i was bound to
FIFO Queue In the FIFO queue (Fig. ), each queue is represented by a number of tuples: one for each element currently in the queue, plus two counter tuples that indicate the current position of the head and tail of the queue. Pushing a value onto the tail of the queue is accomplished by assigning it the current value of the tail counter, and then incrementing the tail, and similarly for popping. Important to note is that no additional synchronization is needed to allow multiple processes to push and pop from the queue “simultaneously.” The head and tail tuples act as semaphores, negotiating access to each position of the queue automatically.
But these templates would not match: in(“counter”, , ) rule in(“counter”, .) rule in(“counter”, ) rule In addition to in and rd, there are non-blocking variants inp and rdp, which return FALSE rather than blocking. These variants can be useful in certain circumstances, but their use has generally been discouraged since it introduces timing dependencies. In the above examples, note that the first field is always a string that serves to name the tuple. This is not required, but is a common useful idiom. It also provides a useful hint to the Linda implementation that can improve performance, as discussed below.
Linda Examples Linda programmers are encouraged to think of coordination in terms of data structure manipulations rather than message passing, and a number of useful idioms have been developed along these lines. Indeed, Linda can be viewed as a means to generalize Wirth’s “Algorithms + Data Structures = Programs”: “Algorithms + Distributed Data Structures = Parallel Programs” []. That is, tuples can be used to create rich, complex distributed data structures in tuple space. Distributed data structures can be shared and manipulated by cooperating processes that transform them from an initial configuration to a final result. In this sense, the development of a parallel program using Linda remains conceptually close to the development of a sequential (ordinary) program. Two very simple, yet common, examples are a FIFO queue and a master/worker pattern.
Master/Worker Pattern In the master/worker example (Fig. ), the master initially creates some number of worker processes via eval. It then creates task tuples, one per task, and waits for the results. Once all the results are received, a special “poison” task is created that will cause all the workers to shut down. The workers simply sit in a loop, waiting to get a tuple that describes a task. When they do, if it is a normal task, they process it into a result tuple that will be consumed by the master. In the case of a poison task, they immediately use its elements to create a new task tuple to replace the one they consumed, poisoning another worker, and then finish.
init(qname) { out(qname, ‘head’, 0) out(qname, ‘tail’, 0) } push(qname, elem) { in(qname, ‘tail’, ?p) out(qname, ‘elem’, p, elem) out(qname, ‘tail’, p+1) } pop(qname) { in(qname, ‘head’, ?p) in(qname, ‘elem’, p, ?elem) out(qname, ‘head’, p+1) return elem }
Linda. Fig. A FIFO queue written in Linda
Linda
master() { for (i=0; i
Linda. Fig. The master/worker pattern written in Linda
This is the simplest sort of master/worker pattern one could imagine, yet it still presents a number of interesting features. Process startup and shutdown is represented. Load balancing occurs automatically, since slow workers or large tasks will simply take longer, and as long as there are significantly more tasks than workers the processing time will tend to balance. The code as shown can be modified in a number of simple but interesting ways: . If there are a very large number of tasks, rather than having the master create all the task tuples at once (and thereby possibly filling memory) it can create an initial pool. Then, each time it receives a result it can create another task, thus maintaining a reasonable number of tasks at all times. . If the master needs to process the results in task order, rather than as arbitrarily received, the in statement that retrieves the result can be modified to make taskid an actual, i.e., in(“result”, i, ?result). . The program can be made fault-tolerant by wrapping the worker’s inner loop in a transaction (for Linda implementations that support transactions) []. If a worker dies during the processing of a task,
L
the original task tuple will automatically be returned when the transaction fails, and another worker will pick up the task. . The program can be easily generalized to have new task tuples generated by the workers (in a sense, the propagation of the poison task is an example of this). This could, for example, be used to implement a dynamic divide-and-conquer decomposition.
Implementations Linda has been implemented on virtually every kind of parallel hardware ever created, including sharedmemory-multiprocessors, distributed-memory-multi processors, non-uniform memory access (NUMA) multiprocessors, clusters, and custom hardware. Linda is an extremely minimalist language design, and naive, inefficient implementations are deceptively simple to build. Most take the approach of building a runtime library that simply implements tuple space as a list of tuples and performs tuple matching by exhaustively searching the list. These sorts of implementations led to a suspicion that Linda is not suitable for serious computation. Many critics admired the language’s elegance, but insisted that it could not be implemented efficiently enough to compete with other approaches, such as shared memory with locking, or native message passing. In response, considerable work was done at Yale and elsewhere during the s and s to create efficient implementations that could compete with lower-level approaches. The efforts focused on two main directions: . Compile-time optimizations. Carriero [] recognized that tuple matching overhead could be greatly reduced by analyzing tuple usage at compile time and translating it to simpler operations. Tuple operations could be segregated into distinct, noninteracting partitions based on type information available at compile time. Each partition was then classified into one of a number of usage patterns and reduced to an efficient data structure such as a counter, hash table or queue, which could then be searched and updated very efficiently. This work demonstrated that Linda on shared-memory multiprocessors could be as efficient as handwritten codes using low level locks. An important part of the demonstration showed for a variety of
L
L
Linda
representative problems that on average fewer than two tuples would need to be searched before a match was found. . Run-time optimizations. For distributed memory implementations, optimization of tuple storage was key to reducing the overall number of messages required by tuple space operations. Each tuple had a designated node on which it would be stored, usually chosen via hashing. Naively, each out/in pair would require as many as three distinct messages, since out would require a single message sending the tuple to its storage location, and in would require two messages, a request to the storage location and a reply containing the matched tuple. Bjornson [] built upon Carriero’s compile-time optimizations by analyzing tuple traffic at runtime, recognizing common communication patterns such as broadcast (one-to-many), point-to-point, and many-to-one, and rearranging tuple space storage to minimize the number of messages required. This work demonstrated that most communication in Linda required only one or two sends, just as with message passing. Overall, this work demonstrated that while in the abstract Linda programs could make unrestrained use of expensive operations such as content addressability and extensive search, in practice any given program was likely to confine itself to well-understood patterns that could be reduced through analysis to much simpler and more efficient low-level operations. Implementations of Linda closely followed hardware developments in parallel computing. The first implementation was on a custom, bus-based architecture called the S/Net [], followed by implementations on shared memory multiprocessors (Sequent Balance, Encore Multimax), distributed memory multiprocessors (Intel iPSC/ and hypercubes, Ncube, IBM SP), collections of workstations, and true cluster architectures. In terms of host languages, Linda has been embedded in many of the most commonly used computer languages. Initial imbedding was done in C, but important embeddings were built in Fortran, Lisp, and Prolog. In the s, Java became an important host language for Linda, including the Jini, JavaSpaces [], and Gigaspaces [] implementations. More recently, Scientific Computing Associates (SCAI) created a variant of Linda
called NetWorkSpaces that is designed for interactive languages such as Perl, Python, R (see below).
Open Linda Initial compile-time Linda systems can be termed “closed,” in the sense that the Linda compiler needed to see all Linda operations that would ever occur in the program, in order to perform the analysis and optimization. In addition, only a single global tuple space was provided, which was created when the program started and destroyed when it completed. Linda programs tended to be collections of closely related processes that stemmed from the same code, and started, ran for a finite period, and stopped together. During the s, interest in distributed computing increased dramatically, particularly computing on collections of workstations. These applications consisted of heterogeneous collections of processes that derived from unrelated code, started and stopped independently, and might run indefinitely. The components could be written by different users and using different languages. In , SCAI developed an Open Linda (Open referred to the fact that the universe of tuples remained flexible and open at runtime, rather than fixed and closed at compile time), which was later trademarked as “Paradise.” With Open Linda, tuple spaces became first-class objects. They could be created and destroyed as desired, or persist beyond the lifetime of their creator. Other Open Lindas that were created include JavaSpaces [], TSpaces [], and Gigaspaces []. In SCAI created a simplified Open Linda called NetWorkSpaces (NWS) [], designed specifically for high productivity languages such as Perl, R, Python, Matlab, etc. In NetWorkSpaces, rather than communicating via tuples in tuple spaces, processes use shared variables stored in network spaces. Rather than use associative lookup via matching, in NetWorkSpaces shared data is represented as a name/value binding. Processes can create, consult, and remove bindings in shared network spaces. Figure contains an excerpt from a NWS code for performing a maximization search, using the Python API. In NWS, store, find, and fetch correspond to Linda’s out, rd, and in, respectively. In the code, a single shared variable “max” is used to hold the current best value found by any worker. The call NetWorkSpace (“my
Linda
def init(): ws=NetWorkSpace('my space') ws.store('max', START) def worker(): ws=NetWorkSpace('my space') for x in MyCandidateList: currentMax = ws.find('max') y = f(x) if y > currentMax: currentMax = ws.fetch('max') if y > currentMax: currentMax = y ws.store('max', currentMax)
Linda. Fig. Excerpt of NetWorkSpaces code for performing a maximization search
space”) connects to the space of that name, creating it if necessary, and returning a handle to it.
Key Characteristics of Linda, and Comparison to Other Approaches Linda presents a number of interesting characteristics that contrast with other approaches to parallel computing: Portability: A Linda program can run on any parallel computer, regardless of the underlying parallel infrastructure and interconnection, whether it be distributed- or shared-memory or something in between. Most competing parallel approaches are strongly biased toward either shared- or distributedmemory. Generative Communication: In Linda, when processes communicate, they create tuples, which exist apart from the process that created them. This means that the communication can exist long after the creating process has disappeared, and can be used by processes unaware of that process’ existence. This aids debugging, since tuples are objects that can be inspected using tools like TupleScope []. Distribution in space and time: Processes collaborating via tuple spaces can be widely separated over both these dimensions. Because Linda forms a virtual shared memory, processes need not share any physical memory. Because tuples exist apart from processes, those processes can communicate with one another even if they don’t exist simultaneously. Content Addressability: Processes can select information based on its content, rather than just a tag or
L
the identity of the process that sent it. For example, one can build a distributed hash table using tuples. Anonymity: Processes can communicate with one another without knowing one another’s location or identity. This allows for a very flexible style of parallel program. For example, in a system called Piranha [], idle compute cycles are harvested by creating and destroying workers on the fly during a single parallel run. The set of processes participating at any given time changes constantly. Linda allows these dynamic workers to communicate with the master, and even with one another, because processes need not know details about their communication partners, so long as they agree on the tuples they use to communicate. Simplicity: Linda consists of only four operations, in, out, rd, eval, and two variants, inp and rdp. No additional synchronization functions are needed, nor any administrative functions. When Linda was invented in the early s, there were two main techniques used for parallel programming: shared memory with low-level locking, and message passing. Each parallel machine vendor chose one of these two styles and built their own, idiosyncratic API. For example, the first distributed-memory parallel computers (Intel iPSC, nCube, and IBM SP) each had their own proprietary message passing library, which was incompatible with all others. Similarly, the early shared-memory parallel computers (Sequent Balance, Encore Multimax) each had their own way of allocating shared pages and creating synchronization locks. Linda presented a higher level approach to parallelism that combined the strengths of both. Like shared memory, Linda permits direct access to shared data (no need to map such data onto an “owner” process, as is the case for message passing). Like message passing, Linda self-synchronizes (no need to worry about accessing uninitialized shared data, as is the case for hardware shared memory). Once the corresponding Linda implementation had been ported to a given parallel computer, any Linda program could be simply recompiled and run correctly on that computer. At the time of this writing, it is fair to say that message passing, as embodied by Message Passing Interface (MPI), has become the dominant paradigm for writing parallel programs. MPI addressed the portability
L
L
Linda
problem in message passing by creating a uniform library interface for sending and receiving messages. Because of the importance of MPI to parallel programming, it is worth contrasting Linda and MPI. Compiler Support: Linda’s compiler support performs both type and syntax checking. Since the Linda compiler understands the types of the data being passed through tuple space, marshalling can be done automatically. The compiler can also detect outs that have no matching in, or vice versa, and issue a warning. In contrast, MPI is a runtime library, and the burdens of type checking and data marshalling fall largely to the programmer. Decoupled/anonymous communication: In Linda, communicating processes are largely decoupled. In MPI, processes must know one another’s location and exist simultaneously. In addition, receivers must anticipate sends. If, on the hand, the coordination needed by an application lends itself to a message passing style solution, this is still easy to express in Linda using associative matching. In this case, the compile and runtime optimization techniques discussed will often result in wire traffic similar, if not identical, to that that would have obtained if message passing was used. Simplicity: MPI is a large, complex API. Debugging usually requires the use of a sophisticated parallel debugger. Even simple commands such as send() exist in a number of subtly differently variants. In the spirit of Alan Perlis’s observation “If you have a procedure with parameters, you probably missed some,” the MPI standard grew substantially from MPI to MPI . Acknowledging the lack of a simple way to share data, MPI introduced one-sided communication. This was not entirely successful []. In contrast, Linda consists of four simple commands.
extremely efficient parallel codes in order to scale to tens of thousands of cpus, implying expert assistance by highly skilled programmers, careful coding, optimization and verification of programs, and are well suited to large-scale development efforts in languages such as C and tools such as MPI. In contrast, nearly everyone with a compute-bound program will have access to a non-trivial parallel computer containing tens or hundreds of cpus, and will need a way to make effective use of those cpus. An interesting characteristic of many of these users is that, rather than using traditional high performance languages such as C, Fortran, or even Java, they use scripting languages such as Perl, Python or Ruby, or high-productivity environments such as Matlab or R. These languages dramatically improve their productivity in terms of writing and debugging code, and therefore in terms of their overall goals of exploring data or testing theories. That is, they have explicitly opted for simplicity and expressivity in their tool of choice for sequential programming. However, these environments pose major challenges in terms of performance and memory usage. Parallelism on a relatively modest scale can permit these users to compute effectively while remaining comfortably within an environment they find highly productive. But, to be consistent with the choice made when selecting these sequential computing environments, the parallelism tool must also emphasize simplicity and expressivity. High level coordination languages like Linda can provide them with simple tools for accessing that parallelism.
Influences Some of the work influencing Linda includes: Sussman and Steele’s language, Scheme, particularly continuations [], Dijkstra’s semaphores [], Halstead’s futures [], C.A.R. Hoare’s CSP [], and Shapiro’s Concurrent Prolog [], particularly the concept of unification.
Future In , parallel computing stands at an interesting crossroads. The largest supercomputer in the world contains , cpus []. Meanwhile, multicore systems are ubiquitous; modern high performance computers typically contain – cpu cores, with hundreds envisioned in the near future. These two use cases represent two very different extrema on the spectrum of parallel programming. The largest supercomputers require
History Linda was initially conceived by David Gelernter while a graduate student at SUNY Stony Brook in the early s. After taking a faculty position at Yale University in , Gelernter, along with a number of graduate students, most notably Carriero, built a number of Linda implementations as well as high level applications that used Linda.
Linear Algebra Software
In the early s Scientific Computing Associates (SCAI), founded by Yale professors Stan Eisenstat and Martin Schultz, began commercial development of Linda. Linda systems created by SCAI were sold in a variety of industries, most notably finance and energy. Large-scale systems, running on hundreds of computers, were built and used in production environments for years at a time, performing task such as portfolio pricing, risk analysis, and seismic image processing. Around , interest grew in more flexible languages and approaches. Java became the rage, particularly in financial organizations, and a number of commercial Java Linda systems were built. Although they varied in detail, all shared many common features: an object oriented design, open tuple spaces, and support for transactions.
Related Entries CSP (Communicating Sequential Processes) Logic Languages Multilisp Top
Bibliographic Notes and Further Reading The following papers are good general introductions to Linda: [, ]. Carriero and Gelernter used Linda in their introductory parallel programming book []. There have been conferences organized around the ideas embodied in Linda [, ].
Bibliography . Gelernter D () Generative communication in Linda. ACM Trans Program Lang Syst ():– . Wirth N () Algorithms [plus] data structures. Prentice-Hall, Englewood Cliffs . Jeong K, Shasha D () PLinda .: A transactional/checkpointing approach to fault tolerant Linda. In: th symposium on reliable distributed systems, Dana Point, pp – . Carriero N () Implementation of tuple space machines. Technical Report. Department of Computer Science, Yale University . Bjornson RD () Linda on distributed memory multiprocessors. PhD dissertation, Deparment of Computer Science, Yale University . Freeman E, Arnold K, Hupfer S () JavaSpaces principles, patterns, and practice. Addison-Wesley Longman, Reading . Gigaspaces Technologies Ltd. [cited; Available from: www. gigaspaces.com] . TSpaces. [cited; Available from: http://www.almaden.ibm.com/ cs/TSpaces]
L
. Bjornson R et al () NetWorkSpace: a coordination system for high-productivity environments. Int J Parallel Program ():– . Bercovitz P, Carriero N () TupleScope: a graphical monitor and debugger for Linda-based parallel programs. Technical Report. Department of Computer Science, Yale University . Carriero N et al () Adaptive parallelism and Piranha. IEEE Comput ():– . Bonachea D, Duell J () Problems with using MPI . and . as compilation targets for parallel language implementations. Int J High Perform Comput Network (//):– . Top Supercomputing sites. [cited; Available from: www. top.org] . Steele GL, Sussman GJ () Scheme: An interpreter for extended lambda calculus.Massachusetts Institute of Technology, Artificial Intelligence Laboratory. http://repository.readscheme. org/ftp/papers/ai-lab-pubs/AIM-.pdf . Dijkstra EW () Cooperating sequential processes. In: Per Brinch Hansen (ed) The origin of concurrent programming: from semaphores to remote procedure calls. Springer, New York, pp – . Halstead Jr RH () Multilisp: A language for concurrent symbolic computation. TOPLAS ():– . Hoare CAR () Communicating sequential processes. Commun ACM ():– . Shapiro E () Concurrent prolog: a progress report. In: Concurrent prolog. MIT Press, Cambridge, pp – . Gelernter D, Carriero N () Coordination languages and their significance. Commun ACM ():– . Bjornson R, Carriero N, Gelernter D () From weaving threads to untangling the web: a view of coordination from Linda’s perspective. In: Proceedings of the second international conference on coordination languages and models, Springer, New York . Carriero N, Gelernter DH () How to write parallel programs: a first course. MIT Press, Cambridge, Massachusetts; London, pp . Garlan D, Le Metayer D () Coordination: languages and models. Springer, Berlin . Ciancarini P, Hankin C () Coordination languages and models. Springer, Berlin
Linear Algebra Software Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition Linear algebra software is software that performs numerical calculations aimed at solving a system of linear equations or related problems such as eigenvalue,
L
L
Linear Algebra Software
singular value, or condition number computations. Linear algebra software operates within the confines of finite precision floating-point arithmetic and is characterized by its computational complexity with respect to the sizes of matrix and vectors involved. Equally important metric is that of numerical robustness: The linear algebra methods aim to deliver a solution that is as close as possible (in a numerical sense by taking into account the accuracy of the input data) to the true solution (obtained in sufficiently extended floating-point precision) if such a solution exists.
Discussion The increasing availability of high-performance computers has a significant effect on all spheres of scientific computation, including algorithm research and software development in numerical linear algebra. Linear algebra – in particular, the solution of linear systems of equations – lies at the heart of most calculations in scientific computing. There are two broad classes of algorithms: those for dense, and those for sparse matrices. A matrix is called sparse if it has a substantial number of zero elements, making specialized storage and algorithms necessary. Much of the work in developing linear algebra software for high-performance computers is motivated by the need to solve large problems on the fastest computers available. For the past years or so, there has been a great deal of activity in the area of algorithms and software for solving linear algebra problems. The goal of achieving high performance on codes that are portable across platforms has largely been realized by the identification of linear algebra kernels, the Basic Linear Algebra Subprograms (BLAS). See discussion of the EISPACK, LINPACK, LAPACK, and ScaLAPACK libraries, which are expressed in successive levels of the BLAS. The key insight to designing linear algebra algorithms for high-performance computers is that the frequency with which data are moved between different levels of the memory hierarchy must be minimized in order to attain high performance. Thus, the main algorithmic approach for exploiting both vectorization and parallelism in various implementations is the use of block-partitioned algorithms, particularly in conjunction with highly tuned kernels for performing
matrix–vector and matrix–matrix operations (the Level and BLAS). Algorithms for sparse matrices differ drastically because they have to take full advantage of sparsity and its particular structure. Optimization techniques differ accordingly due to changes in the memory access patterns and data structures. Commonly, then, the sparse methods are discussed separately.
Dense Linear Algebra Algorithms Origins of Dense Linear Systems A major source of large dense linear systems is problems involving the solution of boundary integral equations []. These are integral equations defined on the boundary of a region of interest. All examples of practical interest compute some intermediate quantity on a two-dimensional boundary and then use this information to compute the final desired quantity in three-dimensional space. The price one pays for replacing three dimensions with two is that what started as a sparse problem in O(n ) variables is replaced by a dense problem in O(n ). Dense systems of linear equations are found in numerous applications, including: ● ● ● ● ● ●
Airplane wing design Radar cross-section studies Flow around ships and other off-shore constructions Diffusion of solid bodies in a liquid Noise reduction Diffusion of light through small particles
The electromagnetics community is a major user of dense linear systems solvers. Of particular interest to this community is the solution of the so-called radar cross-section problem. In this problem, a signal of fixed frequency bounces off an object; the goal is to determine the intensity of the reflected signal in all possible directions. The underlying differential equation may vary, depending on the specific problem. In the design of stealth aircraft, the principal equation is the Helmholtz equation. To solve this equation, researchers use the method of moments [, ]. In the case of fluid flow, the problem often involves solving the Laplace or Poisson equation. Here, the boundary integral solution is known as the panel methods [, ], so named from the quadrilaterals that discretize and approximate a
Linear Algebra Software
structure such as an airplane. Generally, these methods are called boundary element methods. Use of these methods produces a dense linear system of size O(N) by O(N), where N is the number of boundary points (or panels) being used. It is not unusual to see size N by N, because of three physical quantities of interest at every boundary element. A typical approach to solving such systems is to use LU factorization. Each entry of the matrix is computed as an interaction of two boundary elements. Often, many integrals must be computed. In many instances, the time required to compute the matrix is considerably larger than the time for solution. The builders of stealth technology who are interested in radar cross sections are using direct Gaussian elimination methods for solving dense linear systems. These systems are always symmetric and complex, but not Hermitian.
Basic Elements in Dense Linear Algebra Methods Common operations involving dense matrices are the solution of linear systems Ax = b, the least squares solutionofover-orunderdeterminedsystemsmin x ∣∣Ax−b∣∣ , and the computation of eigenvalues and vectors Ax = λx. Although these problems are formulated as matrix– vector equations, their solution involves a definite matrix–matrix component. For instance, in order to solve a linear system, the coefficient matrix is first factored as A = LU (or A = U T U in the case of symmetry) where L and U are lower and upper triangular matrices, respectively. It is a common feature of these matrix–matrix operations that they take, on a matrix of size n by n, a number of operations proportional to n , a factor n more than the number of data elements involved. It is possible to identify three levels of linear algebra operations: ●
Level : vector–vector operations such as the update y ← y + αx × y and the inner product α = xT y. These operations involve (for vectors of length n) O(n) data and O(n) operations. ● Level : matrix–vector operations such as the matrix–vector product. These involve O(n ) operations on O(n ) data.
●
L
Level : matrix–matrix operations such as the matrix–matrix product C = AB. These involve O(n ) operations on O(n ) data.
These three levels of operations have been realized in a software standard known as the Basic Linear Algebra Subprograms (BLAS) [–]. Although BLAS routines are freely available on the net, many computer vendors supply a tuned, often assembly-coded, BLAS library optimized for their particular architecture. The relation between the number of operations and the amount of data is crucial for the performance of the algorithm.
Optimized Implementations Due to its established position, BLAS form the foundation for optimized dense linear algebra operation on homogeneous as well as heterogeneous multicore architectures. Similarly, LAPACK API serves as the defacto standard for accessing various decompositional techniques. Currently active optimized implementations include ACML from AMD, cuBLAS from NVIDIA, LibSci from Cray, MKL from Intel, and open source ATLAS. In the distributed memory regime, Basic Linear Algebra Communication Subourtines (BLACS), PBLAS, and ScaLAPACK serve the corresponding role of defining an API and are accompanied by vendoroptimized implementations.
Sparse Linear Algebra Methods Origin of Sparse Linear Systems The most common source of sparse linear systems is the numerical solution of partial differential equations. Many physical problems, such as fluid flow or elasticity, can be described by partial differential equations (PDE). These are implicit descriptions of a physical model, describing some internal relation such as stress forces. In order to arrive at an explicit description of the shape of the object or the temperature distribution, the PDE needs to be solved, and for this, numerical methods are required. Discretized Partial Differential Equations Several methods for the numerical solution of PDEs exist, the most common ones being the methods of finite elements, finite differences, and finite volumes. A common feature of these is that they identify discrete
L
L
Linear Algebra Software
points in the physical object, and give a set of equations relating these points. Typically, only points that are physically close together are related to each other in this way. This gives a matrix structure with very few nonzero elements per row, and the nonzeros are often confined to a “band” in the matrix.
Sparse Matrix Structure Matrices from discretized partial differential equations contain so many zero elements that it pays to find a storage structure that avoids storing these zeros. The resulting memory savings, however, are offset by an increase in programming complexity, and by decreased efficiency of even simple operations such as the matrix– vector product. More complicated operations, such as solving a linear system, with such a sparse matrix present a next level of complication, as both the inverse and the LU factorization of a sparse matrix are not as sparse, thus needing considerably more storage. Specifically, the inverse of a band sparse matrix is a full matrix, and factoring such a sparse matrix fills in the band completely.
Direct methods can have a BLAS component if they are a type of dissection method. However, in a given sparse problem, the denser the matrices are, the smaller they are on average. They are also not general full matrices, but only banded. Thus, very high performance should not be expected from such methods either.
Direct Sparse Solution Methods For the solution of a linear system, one needs to factor the coefficient matrix. Any direct method is a variant of Gaussian elimination. As remarked above, for a sparse matrix, this fills in the band in which the nonzero elements are contained. In order to minimize the storage needed for the factorization, research has focused on finding suitable orderings of the matrix. Re-ordering the equations by a symmetric permutation of the matrix does not change the numerical properties of the system in many cases, and it can potentially give large savings in storage. In general, direct methods do not make use of the numerical properties of the linear system, and thus, their execution time is affected in a major way by the structural properties of the input matrix.
Iterative Solution Methods Basic Elements in Sparse Linear Algebra Methods Methods for sparse systems use, like those for dense systems, vector–vector, matrix–vector, and matrix– matrix operations. However, there are some important differences. For iterative methods, there are almost no matrix– matrix operations. Since most modern architectures prefer these Level operations, the performance of iterative methods will be limited from the outset. An even more serious objection is that the sparsity of the matrix implies that indirect addressing is used for retrieving elements. For example, in the popular rowcompressed matrix storage format, the matrix–vector multiplication looks like for i= . . . n p←pointer to row i for j= . . . ni yi ← yi + a(p + j)∗ (c(p + j))
where ni is the number of nonzeros in row i, and p(.) is an array of column indices. A number of such algorithms for several sparse data formats are given in [].
Direct methods, as sketched above, have some pleasant properties. Foremost is the fact that their time to solution is predictable, either a priori, or after determining the matrix ordering. This is due to the fact that the method does not rely on numerical properties of the coefficient matrix, but only on its structure. On the other hand, the amount of fill can be substantial, and with it the execution time. For large-scale applications, the storage requirements for a realistic size problem can simply be prohibitive. Iterative methods have far lower storage demands. Typically, the storage, and the cost per iteration with it, is of the order of the matrix storage. However, the number of iterations strongly depends on properties of the linear system, and is at best known up to an order estimate; for difficult problems, the methods may not even converge due to accumulated round-off errors.
Basic Iteration Procedure In its most informal sense, an iterative method in each iteration locates an approximation to the solution of the problem, measures the error between the approximation and the true solution, and based on the
Linear Algebra Software
error measurement improves on the approximation by constructing a next iterate. This process repeats until the error measurement is deemed small enough.
Stationary Iterative Methods The simplest iterative methods are the “stationary iterative methods.” They are based on finding a matrix M that is, in some sense, “close” to the coefficient matrix A. Instead of solving Ax = b, which is deemed computationally infeasible, Mx = b is solved. The true measure of how well x approximates x is the error e = x − x, but, since the true solution x is not known, this quantity is not computable. Instead, the “residual”: r = Ae = Ax − b is monitored since it is a computable quantity. It is easy to see that the true solution satisfies x = A− b = x − A− r , so, replacing A− with M − in this relation, x = x − M − r can be defined. Stationary methods are easily analyzed: It can be proven that ri → if all eigenvalues λ = λ(I −AM− ) satisfy ∣λ∣ < . For certain classes of A and M, this inequality is automatically satisfied [, ]. Krylov Space Methods The most popular class of iterative methods nowadays is that of “Krylov space methods.” The basic idea there is to construct the residuals such that the n-th residual rn is obtained from the first by multiplication by some polynomial in the coefficient matrix A, that is, rn = P n− (A)r . The properties of the method then follow from the properties of the actual polynomial [–]. Most often, these iteration polynomials are chosen such that the residuals are orthogonal under some inner product. From this, one usually obtains some minimization property, though not necessarily a minimization of the error. Since the iteration polynomials are of increasing degree, it is easy to see that the main operation in each iteration is one matrix–vector multiplication. Additionally, some vector operations, including inner products in the orthogonalization step, are needed. Optimized Implementations Sparse BLAS used to be available from some of the vendors. The more common approach is to have the dedicated sparse packages available publicly and have the vendors tune them for their software offerings. Examples of direct sparse solvers include MUMPS, SuperLU, UMFPACK, and Paradiso. The iterative solvers in wide
L
use are PETSc and Trilinos. Also, autotuning efforts such as OSKI aim to provide basic building blocks for use with more complete iterative method frameworks.
Available Software You will find a comprehensive list of open source software that is freely available for solving problems in numerical linear algebra, specifically dense, sparse direct and iterative systems and sparse iterative eigenvalue problems at: http://www.netlib.org/utk/people/ JackDongarra/la-sw.html
Related Entries LAPACK LINPACK Benchmark Linear Algebra, Numerical ScaLAPACK
Bibliography . Edelman A () Large dense numerical linear algebra in : the parallel computing influence. Int J Supercomput Appl :– . Harrington R () Origin and development of the method of moments for field computation. IEEE Antennas and Propagation Magazine, June . Wang JJH () Generalized moment methods in electromagnetics. Wiley, New York . Hess JL () Panel methods in computational fluid dynamics. Annu Rev Fluid Mech :– . Hess L, Smith MO () Calculation of potential flows about arbitrary bodies. In: Kuchemann D (ed) Progress in aeronautical sciences, vol . Pergamon Press . Lawson CL, Hanson RJ, Kincaid D, Krogh FT () Basic linear algebra subprograms for fortran usage. ACM Trans Math Soft :– . Dongarra JJ, Du Croz J, Hammarling S, Hanson RJ () An extended set of FORTRAN basic linear algebra subroutines. ACM Trans Math Soft ():– . Dongarra JJ, Du Croz J, Duff IS, Hammarling S () A set of level basic linear algebra subprograms. ACM Trans Math Soft ():– . Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, Vorst H () Templates for the solution of linear systems: building blocks for iterative methods. SIAM, Philadelphia PA, http://www.netlib.org/templates/ templates.ps. . Hageman LA, Young DM () Applied iterative methods. Academic, New York . Young DM, Jea KC () Generalized conjugate-gradient acceleration of nonsymmetrizable iterative methods. Lin Alg Appl :–
L
L
Linear Algebra, Numerical
. Axelsson O, Barker AV () Finite element solution of boundary value problems. Theory and computation. Academic, Orlando, FL . Birkho G, Lynch RE () Numerical solution of elliptic problems. SIAM, Philadelphia . Chan T, van der Vorst H () Linear system solvers: Sparse iterative methods. In Keyes D et al. (eds) Parallel numerical algorithms, Proc. of the ICASW/LaRC Workshop on Parallel Numerical Algorithms, May –, , pp –. Kluwer, Dordrecht, The Netherlands,
Linear Algebra, Numerical Bernard Philippe , Ahmed Sameh Campus de Beaulieu, Rennes, France Purdue University, West Lafayette, IN, USA
Synonyms Matrix computations
Definition Numerical linear algebra (or matrix computations) is the design and analysis of matrix operations and algorithms. Matrix computations plays a vital role in enabling the majority of computational science and engineering applications. Examples of such applications include: computational fluid dynamics, structural mechanics, fluid-structure interaction, computational electromagnetics, image processing, web search (PageRank), and information retrieval, just to list a few. Matrix computations can be divided into two classes: (a) dense matrix computations, and (b) sparse matrix computations, with the former being rich in data locality and hence can readily achieve high performance on modern architectures. The basic standard problems in numerical linear algebra are: (i) solving linear systems of equations, (ii) solving linear least squares problems with or without constraints, (iii) solving standard and generalized eigenvalue problems, and (iv) obtaining singular values or singular triplets of a matrix.
Discussion Sparse Matrix Computations The overwhelming majority of science and engineering applications give rise to sparse matrix computations.
A sparse matrix is one that is populated mainly with zeros, and hence the need for storing only the nonzero elements especially for large problems. In the context of sparse matrix computations, one usually refers to sparse matrices with irregular sparsity structure. Designing algorithms that realize high performance for structured sparse matrix computations is much easier than the case of irregular sparsity. Irregular sparsity leads to lack of data locality and expensive memory references across the memory hierarchies of modern parallel architectures. While the creation of a standard set of Sparse Basic Linear Algebra Subroutines (SBLAS) was not as successful as the classical dense BLAS the most important kernels in sparse matrix computations are: (i) sparse matrix–vector multiplication and (ii) sparse matrix–tall dense matrix multiplication. Next to generating and applying preconditioners, these two kernels have a substantial impact on the realizable performance of Krylov subspace iterative schemes on these architectures. One should also state that in Finite Element Analysis it has been the tradition to resort to matrix-free computations where the sparse matrices involved are not stored explicitly but algorithms are provided to compute their action, or the action of functions of such matrices, on a vector or a block of vectors. Such matrixfree approach makes it rather difficult to use effective preconditioners for solving large sparse linear systems with the needed iterative methods. Thus, matrix-free formulations may not be the optimal choice when using the emerging petascale parallel architectures with their abundant memories.
Dense Matrix Computations Unlike sparse matrix computations, dense matrix algorithms benefit from the regular storage of the dense matrices involved to achieve high data locality and hence high performance through the fine-tuned threelevel dense BLAS on parallel architectures. The third level of this set BLAS- includes matrix–matrix multiplications, which are the procedures that reach the highest rate of computation on such architectures. Even when dealing with sparse matrix computations, the goal is often to transform algorithms that are originally vector-oriented into their block versions so as to involve dense matrix multiplications. Moreover, the very popular Krylov methods for solving sparse linear systems or eigenvalue problems consist of projecting the
Linear Least Squares and Orthogonal Factorization
initial sparse problem onto some subspace of reduced dimension, leading to a similar dense matrix problem.
Bibliographic Notes and Further Reading For examples of parallel dense matrix computations, see [, , ], and for examples of parallel sparse matrix computations, see [].
Bibliography . Blackford L S, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, available on line at http://www.netlib.org/lapack/lug/ . Dongarra JJ, Duff IS, Sorensen DC, Van der Vorst H () Numerical Linear Algebra for High Performance Computers. SIAM, Philadelphia . Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms for dense linear algebra computations. In: Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh AH, Voigt RG (ed) Parallel algorithms computations. SIAM, Philadelphia, PA . Golub GH, Van Loan CF () Matrix computations, rd edn. John Hopkins, New York
L
Definition A linear least squares problem is given by obtain x ∈ Rn , such that the two-norm ∥b − Ax∥ is minimized,
()
where A ∈ Rm×n and b ∈ Rm . The vector x is a solution if and only if the residual r = b − Ax is orthogonal to the range of A: AT r = . When A is of full column rank n, there exists a unique solution. When rank(A) < n, the solution of choice is that one with the smallest two-norm.
Discussion Note: The number of the floating-point operations (additions, multiplications, divisions, square roots) of the algorithms are simply called operations in what follows.
Full Rank Standard Problem Consider first the case in which n ≤ m and A is of full column rank n. General scheme. Since the two-norm is invariant under
Linear Equations Solvers Dense Linear System Solvers Multifrontal Method Numerical Libraries Preconditioners for Sparse Iterative Methods Sparse Direct Methods
Linear Least Squares and Orthogonal Factorization Bernard Philippe , Ahmed Sameh Campus de Beaulieu, Rennes, France Purdue University, West Lafayette, IN, USA
Synonyms Linear regression; Overdetermined systems; Underdetermined systems
orthogonal transformations, the most common approach for solving this least squares problem consists of computing the orthogonal factorization of A. The procedure amounts to premultiplying A and b by a finite number – say p – of orthogonal transformations Q , ⋯, Qp , such that ⎛R⎞ ⎟ Qp ⋯Q Q A = ⎜ ⎜ ⎟, ⎝ ⎠
()
where R ∈ Rn×n is a nonsingular upper triangular or upper trapezoidal matrix. Therefore, ⎛ c ⎞ ⎛R⎞ ⎟ − ⎜ ⎟ x∥ ∥b − Ax∥ = ∥Qp ⋯Q (b − Ax)∥ = ∥⎜ ⎜ ⎟ ⎜ ⎟ ⎝ c ⎠ ⎝ ⎠ where c ∈ Rn , and the unique solution of () is given by solving Rx = c . In what follows, the orthogonal reduction to the triangular form will only be considered. For solving the resulting triangular system, please refer to [BLAS 12.c.ii.1.a, linear solvers 12.c.ii.2.a].
L
L
Linear Least Squares and Orthogonal Factorization
Householder Reductions The sequential algorithm. Given a vector a ∈ Rs , a
vector v ∈ Rs can be determined such that the transformation H = H(v) = Is −βvvT , where β = (/vT v), yields Ha = ±∥a∥e where e is the first canonical vector of Rs (there are two such transformations from which one is selected for numerical stability []). H is an elementary reflector, i.e., H is orthogonal and H = Is . Application of H to a given vector x ∈ Rs is given by H(v)x = x − β(vT x)v and computed using two Level BLAS routines [], namely, DDOT and DAXPY, involving only s operations and not s as the case in matrix-vector multiplication. The triangularization of A is obtained by successively applying such orthogonal transformations to A: A = A and for k = : min(n, m−), Ak+ = Hk Ak where ⎛I ⎞ k− ⎟ with vk chosen such that all but Hk = ⎜ ⎜ ⎟ ⎝ H(vk ) ⎠ the first entry of H(vk )Ak (k : m, k) are zero. At step k, the multiplication Hk Ak is obtained from H(vk )Ak (k : m, k : n) which involves (n − k)(m − k) + O(m − k) operations. It can be implemented by two Level BLAS routines [], namely, the matrix-vector multiplication DGEMV and the rank-one update DGER. The procedure stores the matrix Ak+ in place of Ak . The total procedure involves n (m − n/) + O(mn) operations for obtaining R. To apply simultaneously the transformations to the right-hand-side b of problem (), one may append that vector as the n + st column of A. When necessary for subsequent calculations, the sequence of the vectors (vk ),n can be stored for instance in the lower part of the transformed matrix Ak . This is the case when n << m and an orthogonal basis ˜ = [q , ⋯, qn ] ∈ Rm×n of the range of A is sought: the Q ⎛I ⎞ n ⎟ basis is obtained by premultiplying the matrix ⎜ ⎜ ⎟ ⎝ ⎠ successively by Hn , Hn− , ⋯, H . This process involves (m n − mn + n /) + O(mn) operations. Block versions of the procedure have been introduced by Bischof and Van Loan [] (the WY form) and by Schreiber and Parlett [] (the GGT form). The generation and application of the WY form are implemented in the routines DLARFT and DLARFB of LAPACK []. The latter involves Level BLAS primitives []. The WY form consists of considering a narrow window in which the single-vector algorithm is used
before applying the corresponding transformations to the remaining part of A: if s is the window width, s steps are accumulated in the form Hk+s ⋯Hk+ = I + WY T where W, Y ∈ Rm×s . This expression allows the use of Level BLAS in updating the remaining part of A. In LAPACK, the driver routine which implements the resolution of the least square problem () is DORGQR. The parallel algorithm. Theoretically, by using O(mn)
processors, the computation may be performed in O(n log m) time steps. This is a direct extension to rectangular matrices of Sameh’s result []: in each of the n update phases, each scalar product can be performed in O(log m) time steps on m processors and there are at most n colums to be processed in parallel. Since, except for very special computing architectures (e.g., systolic arrays), independent tasks cannot be reduced to one operation; a plausible approach is to wrap the columns of A across a ring of processors as shown in []. The block version of the algorithm above is the preferred parallel scheme which is implemented in ScaLAPACK [] via the routines PDGEQRF and PDGELS where A and b are distributed on a two-dimensional grid of processes according to the block cyclic scheme [] (see [4. Parallel programming / i. Data distribution]). The block size is chosen large enough to allow Level BLAS routines to achieve maximum performances on each involved uniprocessor (at least % as recommended by the authors of ScaLAPACK []). While increasing the block size improves the granularity of the computation, it may negatively affect concurrency. An optimal tradeoff therefore must be found depending on the architecture of the computing platform.
Givens Rotations The sequential algorithm. In the standard Givens’ pro-
cess, the orthogonal matrices in () consist of p = mn − n(n + )/ plane rotations extended to the entire space Rm . The triangularization of A is obtained by successively applying plane rotations Rk , so that Ak+ = Rk Ak for k = : mn − n(n + )/ with A = A. . Each of these rotations will be chosen to introduce a single zero below (k) the diagonal of A: at step k, eliminating the entry aiℓ of Ak where ≤ ℓ ≤ n and < i ≤ m can be done by the rotation
Linear Least Squares and Orthogonal Factorization
(k) Rij
⎛ ⎜ ⎜ ⎜ ⋅ ⎜ ⎜ ⎜ ⋅ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
c
⋅ ⋅ ⋅ ⋅
s
⋅
⋅
⋅
⋱
⋅
⋅ ⋅
−s ⋅ ⋅ ⋅ ⋅
c . .
↑
↑
ith col.
jth col.
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
L
their comparison with the Householder reductions are given in [].
Hybrid Algorithms (m ≫ n) Dense matrix. This is a situation that arises in some
()
where ≤ j ≤ n, and c + s = are such that ⎛ c s ⎞ ⎛ a(k) ⎞ ⎛ a(k+) ⎞ ⎜ ⎟ ⎜ jℓ ⎟ = ⎜ jℓ ⎟. Two such rotations ⎜ ⎟ ⎜ (k) ⎟ ⎟ ⎜ ⎝ −s c ⎠ ⎝ aiℓ ⎠ ⎝ ⎠ exist from which one is selected for numerical stability []. The construction of a rotation and its application to a couple of rows are implemented in the Level BLAS [] via the procedures DROTG and DROT, respectively. The order of annihilation is quite flexible as long as applying a new rotation does not destroy a previously introduced zero. The simplest way is to organize columnwise eliminations. The total number of operations is mn − n + O(mn) which is roughly . times that required by the Householder’s reduction.
computations like in the adjustment of geodetic networks [], or in the parallel construction of a Krylov basis where often m is larger than a million and n is smaller than []. The algorithm was introduced by Sameh []. It is depicted in Fig. in the case where two processors are used. It consists of partitionning A into contiguous blocks of r rows where r ≥ n, with each block assigned to one processor. In the first step (Fig. a), each block is triangularized by Householder reduction. This step consumes the major part of the overall computation and is void of any communications. The second step is devoted to annihilating all the resulting triangular factors except the first one. There are several possibilities for achieving this goal. In [], for annihilating the entries of the triangular matrix T , the rotations are chosen as depicted in Fig. b. The entry (, ) of the first row of T is annihilated by rotating the first rows of T and T . Then all the other entries of the first diagonal of T can easily be annihilated by rotating rows of T . As soon as the entry (, ) of T is annihilated, the entry (, ) of T can be annihilated before the annihilation of the first diagonal of the matrix is completed. Rotations indicated with stars are the only ones which require synchronization.
T1
Parallel algorithms. One parallel algorithm consists
of organizing the mn − n/(n + )/ rotations into sweeps which are sets of at most ⌊m/⌋ parallel rotations. For square matrices, Sameh and Kuck [] and later, Cosnard and Robert [], introduced such O(n) procedures using O(n ) processors. In the general situation of rectangular matrices, Cosnard et al. in [] proved the theoretical optimality of the greedy algorithm introduced by Modi and Clarke in [] over any other algorithm. More details on the chronological appearance of the Givens algorithms and
T2 1*
a
b
3*
5*
7*
2
4
6
3
5 4
Linear Least Squares and Orthogonal Factorization. Fig. Parallel annihiliation
L
L
Linear Least Squares and Orthogonal Factorization
By following that strategy, Sidje [] developped the software RODDEC which performs the QR factorization on a ring of processors. Block angular matrix. In some applications, a domain
decomposition provides a matrix such that under some appropriate numbering of the domains and corresponding interfaces, the matrix A is of the form: ⎛B C ⎞ ⎟ ⎜ ⎟ ⎜ ⎜ B C ⎟ ⎟ ⎜ ⎟ A=⎜ ⎟ ⎜ ⎜ ⋱ ⋮ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ C B ⎝ r r ⎠
()
where A as well as each block Bi , for i = , ⋯, r is of full column rank. The special structure implies that the QR factorization of A yields an upper triangular matrix with the same block structure; this can be seen as shown by Golub et al. in [] from the Cholesky factorization of AT A which is a block arrowhead matrix. Therefore, similar to the first step of the algorithm previously described, for i = , ⋯, r, processor i first performs the QR factorization of the block Bi . The corresponding blocks of [Bi , Ci ] are now transformed into ⎡ ⎤ ⎢⎛ R ⎞ ⎛ D ⎞⎥ ⎢ ⎥ i i ⎢⎜ ⎟⎥. By virtually reordering the block ⎟,⎜ ⎢⎜ ⎟⎥ ⎟ ⎜ ⎢⎝ ⎠ ⎝ D ⎠⎥ ⎢ i ⎥ ⎣ ⎦ rows, the resulting matrix can be expressed by : ⎛R ⎜ ⎜ ⎜ R ⎜ ⎜ ⎜ ⎜ ⋱ ⎜ ⎜ ⎜ ⎜ Rr ⎜ A = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
D ⎞ ⎟ ⎟ D ⎟ ⎟ ⎟ ⎟ ⋮ ⎟ ⎟ ⎟ ⎟ Dr ⎟ ⎟ ⎟. ⎟ D ⎟ ⎟ ⎟ ⎟ D ⎟ ⎟ ⎟ ⎟ ⋮ ⎟ ⎟ ⎟ ⎟ Dr ⎠
()
To complete the QR factorization of A, it is therefore necessary to obtain the QR factorization of the submatrix defined by the sub-blocks Di for i = , ⋯, r. This is exactly the situation where the procedure RODDEC
can be used. Other approaches can also be considered as mentioned in []. For instance Golub et al. in [] introduce a version of the algorithm adapted to the hierachical definition of the subdomains. The algorithm takes advantage of the locality of the frontiers to minimize communications. Gram-Schmidt procedures. The goal of these procedures
is to obtain the orthogonal factorization A = QR of a full rank matrix A = [a , ⋯, an ] ∈ Rm×n where Q = [q , ⋯, qn ] ∈ Rm×n has orthonormal columns, and R ∈ Rn×n is a nonsingular upper-triangular matrix with positive diagonal elements thus insuring the uniqueness of the factorization. This factorization is often referred to as the skinny QR factorization. Since orthogonality of the vectors qj deteriorates rather quickly, reorthogonalization is often necessary. The general structure of the scheme is given by Algorithm : Gram-Schmidt do k = : n-, w = Pk (ak+ ) ; qk+ = w/∥w∥ ; end where Pk is the orhogonal projector onto the orthogonal complement of the subspace spanned by {a , ⋯, ak } (P is the identity). R = QT A can be built step by step during the process. Two different versions of the algorithm are obtained by expressing Pk in distinct ways: Classical_GS: for the Classical Gram-Schmidt Pk = I − Qk QTk , where Qk = [q , ⋯, qk ]. Unfortunately, this version has been proved by Bjorck [] to be numerically unreliable except when it is applied two times which makes it numerically equivalent to Modified_GS. Modified_GS: for the Modified Gram-Schmidt Pk = (I − qk qTk ) ⋯(I − q qT ). Numerically, the columns of Q are orthogonal up to a tolerance determined by the machine precision multiplied by the condition number of A []. By applying the algorithm a second time (i.e., with complete reorthogonalization), the columns of Q become orthogonal up to the tolerance determined by the machine precision parameter similar to the Householder or Givens reductions.
Linear Least Squares and Orthogonal Factorization
The basic procedures involve mn + O(mn) operations. Classical_GS is based on Level BLAS procedures but it must be applied two times while Modified_GS is based on Level BLAS routines. By inverting the two loops of Modified_GS, the procedure can proceed with a Level BLAS routine. This is only possible when A is available explicitly (for instance, this is not possible with the Arnoldi procedure [Krylov methods 12.c.ii.3.f]). In order to proceed with Level BLAS routines, a block version Block_GS is obtained by considering blocks of vectors instead of single vectors in Modified_GS and by replacing the normalizing step by an application of Modified_GS on the individual blocks. Jalby and Philippe [] proved that it is necessary to apply the normalizing step twice to reach the numerical accuracy of Modified_GS. However, with q blocks of r vectors (n = qr), the Block2_GS procedure only involves (/q)(mn ) + O(mn) additional operations which makes Block2_GS faster than Modified_GS as soon as q is not too small. Parallel implementations of the different forms of Gram-Schmidt orthogonalization are similar to those of Gauss elimination schemes in which there are two nested loops of vector operations. To orthogonalize the columns of A on a cluster, the procedure Modified_GS can be expressed by: Algorithm : Parallel Modified_GS do j= : n, aj = aj /∥aj ∥ ; doall k = j+ : n, ak = ak − (aTj ak )aj ; end end By inserting a synchronizing mechanism to insure that normalizing a vector happens only after it is completely updated, it is even possible to transform the sequential outer loop into a doacross loop. On distributed memory platforms, this algorithm may also be used, but it might be more efficient to use a block partitioning similar to that of ScaLAPACK for either Modified_GS or Block2_GS.
Other Least Squares Problems Rank-deficient problems. When A is rank q < n, there is
an infinite set of solutions for (): for any solution x, the
L
set of all solutions is expressed by S = x + N (A) where N (A) is the (n − q)-dimensional null space of A. The Singular Value Decomposition (SVD) of A is given by ⎛ Σ q,n−q A = U⎜ ⎜ ⎝ m−q,q m−q,n−q
⎞ ⎟ VT, ⎟ ⎠
()
where the diagonal entries of the matrix Σ = diag(σ , ⋯, σq ) are positive, U ∈ Rm×m , V ∈ Rn×n are ⎛ c ⎞ ⎟ where orthogonal matrices. Let c = U T b = ⎜ ⎜ ⎟ ⎝ c ⎠ c ∈ Rq . Therefore, x ∈ S if and only if there exists ⎛ Σ − c ⎞ ⎟ and the correspondy ∈ Rn−q such that x = V ⎜ ⎜ ⎟ ⎝ y ⎠ ing minimum residual norm is equal to ∥c ∥ which is null when m = q. In S, the classically selected solution is the one with the smallest norm, i.e., when y = . Householder QR with column pivoting. In order to avoid computing the singular-value decomposition which is quite expensive, the factorization () is often replaced by a similar factorization where Σ is not assumed to be diagonal. Such factorizations are called complete orthogonal factorizations. One way to compute such factorization relies on the procedure introduced by Businger and Golub []. It computes the QR factorization by Householder reductions with column pivoting. Factorization () is now replaced by
⎛T ⎞ ⎟ Qq ⋯Q Q AP P ⋯Pq = ⎜ ⎜ ⎟, ⎝ ⎠
()
where matrices P , ⋯, Pq are permutations and T ∈ Rq×n is upper-trapezoidal. The permutations are determined so as to insure that the absolute values of the diagonal entries of T are nonincreasing. To obtain a complete orthognal factorization of A the process performs first the factorization () and then computes an LQ factorization of T (equivalent to a QR factorization of T T ). The process ends up with a factorization similar to () with Σ being a nonsingular lower-triangular matrix. In LAPACK, Householder QR with column pivoting is implemented in the routine DGEQPF which only involves Level and Level BLAS routines. The new
L
L
Linear Least Squares and Orthogonal Factorization
routine DGEQP3 is proposed with Level BLAS calculations. The rank-deficient linear least squares problems are solved by the routines DGELSX which uses DGEQPF or DGELSY which, in turn, uses DGEQP3. Parallel implementation. In ScaLAPACK, Householder
QR with column pivoting and complete orthogonal factorization are both implemented via the routines PDGEQPF and PDTZRZF, respectively. Bischof introduced a method that avoids global pivoting to achieve a better parallel efficiency []. It consists of a local pivoting strategy limited to each block allocated to a single processor. This strategy is numerically acceptable as long as the condition number of a block remains below some threshold. In order to control it, the author uses an Incremental Condition Estimator (ICE) studied in []. Achieving adequate local pivoting strategies is a topic that is yet to be completely settled.
Squares, Cholesky factorization, and Singular Value Decomposition. Kontoghiorghes considers the Seemingly Unrelated Regression Equations Models (SURE) models as Generalized Least Squares problems []. In subsequent work with his co-authors, he considers updating (adding equations to the model), down-dating (deleting equations) as well as parallel resolution of the models. Gang Lou and Shih-Ping Han introduced in [] a general class of parallel methods for solving other Least Squares problems with constraints.
Related Entries BLAS (Basic Linear Algebra Subprograms) Dense Linear System Solvers LAPACK Linear Algebra, Numerical PLAPACK ScaLAPACK
Generalized least squares and total least squares problems. In many applications, the right-hand-side b of
problem () is known only with some uncertainty: its error is known to follow a statistical law given by its covariance matrix Ω. In general, Ω is positive definite and its Cholesky factorization Ω = LLT does exist. The Generalized Least Squares problem consists of computing x ∈ Rn such that ∥b − Ax∥Ω − is minimized. It can therefore be recast into the regular least squares problem: compute x ∈ Rn such that ∥L− (b − Ax)∥ is minimized. This approach is acceptable when () Ω is not ill-conditioned and () the factor L can be computed. When the matrix A suffers from uncertainty as well, the problem becomes one of Total Least Squares. The basic problem is then given by ˆ F, ˆ b]∥ minimize ∥[A; b] − [A;
ˆ subject to bˆ ∈ R(A),
ˆ where ∥.∥F denotes the Frobenius norm and where R(A) ˆ Any x ∈ Rn for which Ax ˆ = bˆ is a is the range of A. Total Least Squares solution (TLS). By denoting, respectively, σn and σ˜n+ the smallest singular values of A and [A; b], assuming that σn > σ˜n+ , the only TLS solution − I) AT b. Total Least Squares theory, is xˆ = (AT A − σ˜n+ problems, as well as solution schemes are given by Van Huffel and Vandewalle in []. Parallel algorithms for the Generalized or Total Least Squares problems are based on Linear Least
Bibliographic Notes and Further Reading Numerical methods for least squares problems: [] Parallel Algorithms for dense computations: [, ] Total least squares : []
Bibliography . Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D () LAPACK users’ guide, rd edn. SIAM, Philadelphia, Library available at http://www.netlib.org/lapack/ . Bischof CH () A parallel QR factorization algorithm using local pivoting. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Orlando, pp – . Bischof CH () Incremental condition estimation. SIAM J Matrix Anal Appl –:– . Bischof CH, Van Loan C () The WY representation for products of householder matrices. SIAM J Sci Stat Comp :s–s . Bjorck A () Solving linear least suares problems by GramScmidt orthogonalization. BIT :– . Bjorck A () Numerical methods for least squares problems. SIAM, Philadelphia . Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLAPACK users’ guide. SIAM, Philadelphia, Available on line at http://www.netlib.org/lapack/lug/
LINPACK Benchmark
. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC () ScaLAPACK: a linear Algebra Library for Message-Passing Computers. Information Bridge: DOE Scientific and Technical Information, #. Available at http://www.osti.gov/bridge/servlets/purl/dBCNwX/webviewable/.pdf . Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC () An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Soft –:– . Businger PA, Golub GH () Linear least squares solutions by householder transformations. Numer Math :– . Cosnard M, Muller J-M, Robert Y () Parallel QR decomposition of a rectangular matrix. Numer Math :– . Cosnard M, Robert Y () Complexity of parallel QR factorization. J ACM ():– . Dongarra JJ, Duff IS, Sorensen DC, Van der Vorst H () Numerical linear algebra for high performance computers. SIAM, Philadelphia . Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms for dense linear algebra computations. In: Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh AH, Voigt RG (eds) Parallel algorithms computations. SIAM, Philapdelphia . Golub GH, Plemmons RJ, Sameh AH () Parallel block schemes for large-scale least-squares computations. In: Wilhelmson RB (ed) High-speed computing, scientific applications and algorithm design. University of Illinois Press, Illinois, pp – . Golub GH, Van Loan CF () Matrix computations, rd edn. John Hopkins University Press, Baltimore . Jalby W, Philippe B () Stability analysis and improvement of the Block Gram-Schmidt algorithm. SIAM J Sci Stat Comput ():– . Kontoghiorghes EJ () Parallel algorithms for linear models: numerical methods and estimation problems. Advances in computational economics, vol , Springer, Boston, Dodrecht, London, Kluwer Academic Publishers . Lawson CL, Hanson RJ, Kincaid DR, Krogh FT () Basic linear algebra subprograms for FORTRAN usage. ACM Trans Math Soft –:– . Lou G, Han S-P () A parallel projection method for solving generalized linear least-squares problems. Numer Math ():– . Modi JJ, Clarke MRB () An alternative Givens ordering. Numer Math :– . Sameh AH () Numerical parallel algorithms – a survey. In: Kuck D, Lawrie D, Sameh A (eds) High speed computer and algorithm organization. Academic, New York, pp – . Sameh AH () Solving the linear leastsquares problem on a linear array of processors. In: Snyder L, Jamieson LH, Gannon DB, Siegel HJ (eds) Algorithmically specialized parallel computers. Academic, New York, pp –
L
. Sameh AH, Kuck D () On stable parallel linear system solvers. J ACM ():– . Schreiber R, Parlett BN () Block reflectors: theory and computation. SIAM J Numer Anal :– . Sidje RB () Alternatives for parallel Krylov subspace basis computation. Numer Linear Algebra Appl –:– . Van Huffel S, Vandewalle J () The total least squares problems: computational aspects and analysis. SIAM, Philadelphia
Linear Regression Linear Least Squares and Orthogonal Factorization
LINPACK Benchmark Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition The LINPACK benchmark is a computer benchmark that reports the performance for solving a system of linear equations with a general dense matrix of size , ,, and also of arbitrary size. The matrices, the calculations, and the solution must use -bit floating point arithmetic, and partial pivoting needs to be used for numerical stability. The allowed parallelism modes include automatic parallelization done by the compiler as well as manual parallelization that uses hardware-assisted shared memory or explicit message passing on a distributed memory machine.
Discussion The original LINPACK Benchmark is, in some sense, an accident. It was originally designed to assist users of the LINPACK package [] by providing information on execution times required to solve a system of linear equations. The first “LINPACK Benchmark” report appeared as an appendix in the LINPACK Users’ Guide in . The appendix comprised data for one commonly used path in the LINPACK software package. Results were provided for a matrix problem of size , on a collection of widely used computers ( computers in all). This was done so users could estimate the time required to solve their matrix problem by extrapolation.
L
L
LINPACK Benchmark
Over the years, additional performance data was added, more as a hobby than anything else, and today the collection includes over , different computer systems. In addition to the increasing number of computers, the scope of the benchmark has also expanded. The benchmark report describes the performance for solving a general dense matrix problem Ax = b at three levels of problem size and optimization opportunity: by problem (inner loop optimization), , by , problem (three loop optimization – the whole program), and a scalable parallel problem. The LINPACK benchmark features two routines: DGEFA and DGESL (these are the double precision versions, usually -bit floating point arithmetic, SGEFA and SGESL are the single-precision counter parts, usually bit floating point arithmetic); DGEFA performs the decomposition with partial pivoting and DGESL uses that decomposition to solve the given system of linear equations. Most of the execution time – O(n ) floating-point operations – is spent in DGEFA. Once the matrix has been decomposed, DGESL is used to find the solution; this requires O(n ) floating-point operations. DGEFA and DGESL in turn call three BLAS routines: DAXPY, IDAMAX, and DSCAL. By far the major portion of time – over % at order – is spent in DAXPY. DAXPY is used to multiply a scalar, ff, times a vector, x, and add the results to another vector, y. It is called approximately n / times by DGEFA and n times by DGESL with vectors of varying length. The statement yi ← yi + ff xi , which forms an element of the DAXPY operation, is executed approximately n / + n times, which gives rise to roughly /n floating-point operations in the solution. Thus, the n = benchmark requires roughly / million floating-point operations. The statement yi ←yi + ff xi , besides the floatingpoint addition and floating-point multiplication, involves a few one-dimensional index operations and storage references. While the LINPACK routines DGEFA and DGESL involve two-dimensional array references, the BLAS refer to one-dimensional arrays. The LINPACK routines in general have been organized to access twodimensional arrays by column. In DGEFA, the call to DAXPY passes an address into the two-dimensional array A, which is then treated as a one-dimensional reference within DAXPY. Since the indexing is down
a column of the two-dimensional array, the references to the one-dimensional array are sequential with unit stride. This is a performance enhancement over, say, addressing across the column of a two-dimensional array. Since Fortran dictates that two-dimensional arrays be stored by column in memory, accesses to consecutive elements of a column lead to simple index calculations. References to consecutive elements differ by one word instead of by the leading dimension of the two-dimensional array.
Detailed Operation Counts The results reflect only one problem area: solving dense systems of equations using the LINPACK programs in a Fortran environment. Since most of the time is spent in DAXPY, the benchmark is really measuring the performance of DAXPY. The average vector length for the algorithm used to compute LU decomposition with partial pivoting is /n. Thus in the benchmark with n = , the average vector length is . In order to solve this matrix problem, it is necessary to perform almost , floating point operations. The time required to solve the problem is divided into this number to determine the megaflops rate. The routines DGEFA calls IDAMAX, DSCAL and DAXPY. Routine IDAMAX, which computes the index of a vector with largest modulus, is called times, with vector lengths running from to . Each call to IDAMAX gives rise to n double precision absolute value computation and n − double precision comparisons. The total number of operations is , double precision absolute values and , double precision comparisons. DSCAL is called times, with vector lengths running from to . Each call to DSCAL performs n double precision multiplies, for a total of , multiplications. DAXPY does the bulk of the work. It is called n times, where n varies from to . Each call to DAXPY gives rise to one double precision comparison with zero, n double precision additions, and n double precision multiplications. This leads to , comparisons against , , additions and , multiplications. In addition, DGEFA itself does double precision comparisons against , and double precision reciprocals. The total operation count for DGEFA is given in Table .
LINPACK Benchmark
LINPACK Benchmark. Table Double precision operations counts for LINPACK ’s DGEFA routine Operation count
Operation type
Operation count
Add
,
Add
,
Multiply
,
Multiply
,
Reciprocal
Reciprocal
Absolute value
,
Divide
≤
,
Negate
≠
,
Absolute value
,
≤
,
≠
,
Total
,
Operation type
Operation count
Add
,
Multiply
,
Divide
Negate
≠
Routine DGESL, which is used to solve a system of equations based on the factorization from DGEFA, does much more modest amount of floating point operations. In DGESL, DAXPY is called in two places, once with vector lengths running from to and once with vector lengths running from to . This leads to a total of , double precision additions, the same number of double precision multiplications, and compares against . DGESL does divisions and negations as well. The total operation count for DGESL is given in Table . This leads to a total operation count for the LINPACK benchmark given in Table or a grand total of , floating point operations. (The LINPACK uses approximately /n +n operations, which for n = has the value of ,.) It is instructive to look just at the contribution due to DAXPY. Of these floating point operations, the calls to DAXPY from DGEFA and DGESL account for a total of , additions, , multiplications, and , comparisons with zero. This gives a total of , operations, or over % of all the floating point operations that are executed. The total time is taken up with more than arithmetic operations. In particular, there is quite a lot of time spent loading and storing the
LINPACK Benchmark. Table Double precision operations counts for LINPACK benchmark
Operation type
LINPACK Benchmark. Table Double precision operations counts for LINPACK ’s DGESL routine
L
operands of the floating point operations. We can estimate the number of loads and stores by assuming that all operands must be loaded into registers, but also assuming that the compiler will do a reasonable job of promoting loop-invariant quantities out of loops, so that they need not be loaded each time within the loop. Then DAXPY accounts for , double precision loads and , double precision stores. IDAMAX accounts for , loads, DSCAL for , loads and , stores, DGEFA outside of the loads and stores in the BLAS does , loads and , stores, and DGESL for loads and stores. Thus, the total number of loads is , and stores is ,. Here again DAXPY dominates the statistics. The other overhead that must be accounted for is the load indexing, address arithmetic, and the overhead of argument passing and calls.
Precision In discussions of scientific computing, one normally assumes that floating-point computations will be carried out to full precision or -bit floating point arithmetic. Note that this is not an issue of single or double precision as some systems have -bit floating point arithmetic as single precision. It is a function of the arithmetic used.
Related Entries Benchmarks HPC Challenge Benchmark LINPACK Benchmark Livermore Loops TOP
L
L
Linux Clusters
Bibliography . Dongarra JJ, Bunch J, Moler C, Stewart GW () LINPACK user’s guide, SIAM, Philadelphia
Linux Clusters Clusters
*Lisp Guy L. Steele Jr. Oracle Labs, Burlington, MA, USA
Definition ∗
Lisp (pronounced “star-lisp”) is a data-parallel dialect of Lisp developed around for Connection Machine supercomputers manufactured by Thinking Machines Corporation. It consists of Common Lisp (or, in its earliest implementations, Symbolics Zetalisp) augmented with a package of relatively low-level data-parallel operations. The language design draws a sharp distinction between data residing in the front-end Common Lisp processor and data residing in the massively parallel Connection Machine coprocessor. While pointers to any Lisp data could reside in the Connection Machine processors, parallel processing was most effective on arrays of numeric and character data.
Discussion
Of the four programming languages (∗Lisp, C∗ , CM Fortran, and CM-Lisp) provided by Thinking Machines Corporation for Connection Machine Systems, ∗ Lisp was the first to be implemented and the one most closely aligned to details of the Connection Machine architecture; essentially anything that could be done with the Connection Machine hardware could be expressed in ∗ Lisp, but the programmer was freed from much drudgery because expression evaluation made automatic use of a stack for parallel data structures and provided automatic conversions among data types and sizes as necessary. To quote from the Connection Machine Model CM- Technical Summary, :
The parallel primitives of ∗ Lisp support a model of the Connection Machine in which each processor executes a subset of Common Lisp, with a single thread of control residing on the front-end computer. For most Common Lisp functions, ∗ Lisp provides a corresponding parallel function that can operate on all processors, or some selected subsets, simultaneously. In addition, the language provides Lisp-level operators for communicating between processors, both through pointers and in regular patterns. Sequential Common Lisp code, running on the front end, can be freely intermixed with the parallel code executed on the Connection Machine. Most ∗ Lisp functionality corresponds directly to underlying Paris [Connection Machine PARallel Instruction Set] instructions.… As a result … direct calls to Paris instructions and special purpose microcode blend in naturally with ∗ Lisp code [, p. ], [, p. ].
By the phrase “most Common Lisp functions” had been amended to “many Common Lisp functions” [, p. ]. A ∗ Lisp pvar (parallel variable) was, in effect, an array indexed by processor number, so that each (virtual) Connection Machine processor contained one array element. A unary parallel operation would perform the same scalar operation on each element simultaneously; a binary parallel operation would operate on corresponding elements of two pvars. While, in general, a pvar could hold pointers to any Common Lisp objects, not all operations on Common Lisp objects were supported in parallel form. In particular, classic list operations such as CAR and CDR and ASSOC had no parallel versions (“∗ Lisp does not implement list pvars” [, p. ]). On the other hand, most atomic Common Lisp types were eventually supported, including integers and floating-point numbers of various fixed sizes, complex numbers, and characters; and many of the Common Lisp sequence functions were provided in parallel form for operating on many arrays simultaneously. “The eight basic pvar types are boolean, integer, floating-point, complex, character, structure, and front-end value” [, p. ]. The operation of indexing into a pvar to fetch or store a specified element was supported in both scalar and parallel forms; in this way either the front-end computer or the entire set of Connection Machine processors could freely fetch or store any element of a pvar. ∗ Lisp was useful because it provided all the program-building tools of Common Lisp [, ],
*Lisp
including user-defined macros, and because it provided convenient access to all the facilities of the Connection Machine hardware, including (on the model CM-) the multidimensional NEWS network, “backwards routing” in the hypercube network, the thousands of red LEDs that were a distinctive feature of the Connection Machine cabinetry, and operations on integers and floating-point numbers that could be of any specified length, such as -bit integers or floating-point numbers with -bit exponents and -bit significands (a flexibility of the model CM- that few users exploited in practice once the hardware floating-point accelerators came into use on the model CM-). By , ∗ Lisp had grown to include over functions and macros, listed in a ∗ Lisp Dictionary of over , pages []. Compared with other dialects and extensions of Lisp, ∗ Lisp had a distinct visual style because of its consistent use of either a leading “∗ ” or a trailing “!!” in names:
The names of functions and macros that return pvars as their values end in !!. This suffix, pronounced bang-bang, is meant to look like two parallel lines.… All ∗ Lisp functions that perform parallel computation and do not end in !! begin with ∗ (pronounced star); hence, the name ∗ Lisp [, p. ].
A later document further explains:
The characters “!!”are meant to resemble the mathematical symbol ∥, which means parallel. [, p. ]
(It should be noted that the lexical syntax of Common Lisp [, ] would have made it awkward to use two vertical-bar symbols “||” rather than two exclamation points.)
L
Figure shows an early () example of a Lisp program that identifies prime numbers less than , by the method of the Sieve of Eratosthenes, taken (with minor corrections) from []. A set of virtual processors of size , is assumed to have been established before the function find-primes is called. The special form ∗ all enables all virtual processors for activity. The special form ∗ let is analogous to Common Lisp let in that it binds local variables, but ∗ let binds local parallel variables, allocating them on a stack within the Connection Machine (actually, one stack per virtual processor). In this case, the local parallel variable prime is initially nil (false) within each processor, and the local parallel variable candidate is initially t (true) within each processor. The conditional form ∗ when is analogous to Common Lisp when, but operates by (a) evaluating a parallel test form; (b) temporarily disabling for execution any processors that computed nil for the test form in step (a); (c) executing the form(s) in the body of the ∗ when construct; and finally (d) reenabling any processors disabled in step (b). The function self-address!! returns the virtual processor address, which is a different integer in each virtual processor (starting from zero); the function !! broadcasts a scalar value from the front-end processor to all Connection Machine processors, so that (!! 2) is a parallel quantity with value in every virtual processor; and function
∗
(*defun find-primes () (*all (*let ((prime (!! nil) (candidate (!! t)))) (*when (
*Lisp. Fig. Example ∗ Lisp program for identifying prime numbers
L
L
Lisp, Connection Machine
processor has the value nil for candidate, and (∗ min (self-address!!)), because it occurs within (∗ when candidate ...), will return the smallest integer for which candidate is true. The function pref is analogous to Common Lisp aref; it may be used by the front end to access a single element of a parallel value by supplying an integer index. (The parallel equivalent, not used in this example, is pref!!, which allows every Connection Machine processor to fetch a value by indexing into a parallel value; this is one way of performing interprocessor communication in ∗ Lisp.) The functions zerop!! and mod are parallel arithmetic operations that are analogous to Common Lisp zerop and mod.
Related Entries C*
Little’s Law John L. Gustafson Intel Labs, Intel Corporation, Santa Clara, CA, USA
Synonyms Little’s lemma; Little’s principle; Little’s result; Little’s theorem
Definition Little’s Law says that in the long-term, steady state of a production system, the average number of items L in the system is the product of the average arrival rate λ and the average time W that an item spends in the system, that is, L = λW.
Connection Machine Connection Machine Fortran Connection Machine Lisp
Bibliography . Thinking Machines Corporation. Connection Machine model CM- technical summary, technical report HA-. Cambridge, MA, April . Michael Hord R () Parallel supercomputing in SIMD architectures. CRC Press, Boca Raton . Thinking Machines Corporation. Connection Machine technical summary, version .. Cambridge, MA, May . Thinking Machines Corporation. Supplement to the ∗ Lisp reference manual, version .. Cambridge, MA, September . Thinking Machines Corporation. Connection Machine CM- technical summary, rd edn. Cambridge, MA, November . Steele GL Jr, Fahlman SE, Gabriel RP, Moon DA, Weinreb () Common Lisp: The language. Digital Press, Burlington . Steele GL Jr., Fahlman SE, Gabriel RP, Moon DA, Weinreb DL, Bobrow DG, DeMichiel LG, Keene SE, Kiczales G, Perdue C, Pitman KM, Waters RC, White JL () Common Lisp: The language, nd edn. Digital Press, Bedford . Thinking Machines Corporation. ∗ Lisp dictionary, version .. Cambridge, MA, February . Thinking Machines Corporation. ∗ Lisp reference manual, version .. Cambridge, MA, September
Lisp, Connection Machine Connection Machine Lisp
Discussion At first glance, Little’s Law looks like common sense. If items arrive faster than the system can process them, the system will overflow. Perhaps the earliest mention of the relation is by A. Cobham in , and he states it as a fact without proof []. However, the law is insightful in two ways. First, it does not depend on the probability distributions of any of the variables. Second, since arrival rates are generally less than maximum processing rates, it says what the capacity of the system must be to handle the queue in a system design. Figure shows a visualization of a queue in steady state that explains Little’s Law: Proofs of Little’s Law use the Law of Large Numbers, which is implicit in the phrase “steady state” of the system. The number of items in any time period T is the integral of the arrival rate over the period, so if there is a steady state average rate λ, then L = λW is the area of the rectangle under the dotted line for time period W. A more formal proof appears in [] or []. Notice that the units in the expression are like that of a speed equation: work equals rate times time. Here, “work” is items that complete processing, “rate” is the arrival rate, and “time” is the time spent in the system. The result is from queuing theory, and the meaning of L, λ, and W varies by application. In manufacturing, L might mean inventory, λ the rate of arrival of materials, and W the amount of time the materials spend
Little’s Law
Items leaving queue
L
Items entering queue
Average rate of arrival
Time
Time period T
Little’s Law. Fig. Visualization of Little’s law
in inventory. In retail applications, L might mean the average number of customers in a store, λ the rate they arrive at the store, and W the average time a customer spends in the store.
Example Suppose a fast food restaurant has an average business of one customer every two minutes, and the average customer takes minutes to eat. For purposes of planning seating to accommodate the customers, Little’s Law says how many seats in the restaurant will be in use, on average. Here, λ = . customers per minute, and W = minutes; hence, L = . × = occupied seats. If the business doubles to λ = . customers per minute and the restaurant does not have enough seating area, then Little’s Law says there are two solutions: double the seating area, or halve the average amount of time customers spend eating to minutes, say, by making the seats less comfortable.
that the average amount of time a customer wishing to order food should have to wait in line is minutes. Suppose also that there are two queues for ordering. How long will the line be to order food, on the average? Since the two queues must accommodate the total arriving customer rate of λ = . customers per minute, the rate for each food-ordering line is . customers per minute. Each line will then have an average length of L = λW = . × = customers.
Implications for Computer Design Little’s Law is of obvious value for computer design, especially for decoupled architectures. Computer architect Burton Smith first noted the following form of Little’s Law in : Concurrency = Latency × Bandwidth.
It is often useful to apply Little’s Law in compound ways. For example, a store has a rate at which customers arrive and are in the process of browsing, and after selecting an item, they enter the checkout queue within the store to pay for the item. Thus, the law predicts the length of the checkout line and how crowded the store will be. Alternatively, it can predict how many checkout tellers suffice to prevent the checkout lines from averaging more than a certain goal in length.
Bandwidth is a rate of arrival of data into a processing queue. Latency is the time spent in the queue, and concurrency (or degree of parallelism) is the number of simultaneous data items to process. Some refer to this as Little’s Law of High Performance Computing []. Processing elements require holding areas for data (buffers), and Little’s Law provides guidance for how to match the processing rate, storage size of a buffer, and time a datum spends in the buffer. Just as more checkout tellers can reduce the average size of the checkout queue, parallel computing elements can reduce the amount of buffer space and the time spent in the buffer.
Example
Example
Suppose the same fast food business assesses the tradeoff between the cost of personnel and the effect waiting in line has on customer satisfaction. It then sets a goal
Suppose the processing units in a parallel computer can collectively process trillion ( ) items of data in main memory per second, and the latency between the
Compound Queues
L
L
Little’s Law
main memory and the processors is ns. Then Little’s Law says that the number of parallel elements for processing memory must be Concurrency = × ( × − ) = , processors. The parallel elements could be in the form of pipeline stages, data parallel units, or any other way that permits concurrent activity at the level of , items of data processed at once. Thus, if the system consists of , servers organized as a networked cluster, then each server must have an internal parallelism of a factor of five to accommodate the demands of the queue. Even within a single processor, Little’s Law is quite useful for answering architectural questions. For example, if the latency to the main memory is ns for one core of a multicore processor, and the core operates at . GHz, then the memory system must be able to accommodate (. × ) × ( × − ) = outstanding memory references, if each core is to make full use of its capability (that is, if the average arrival rate is the same as the maximum possible arrival rate). Another important computer application of Little’s Law is in assuring the fairness of benchmarks, especially those related to transaction processing. When measuring a system, if the benchmarking engineer asks it to process so many transactions (arrival rate λ) that the storage (number of items L in the system) is larger than the main memory and thus spills into mass storage, the performance will drop catastrophically.
Perspective Operations Research (OR) originated in the s, and courses on the subject rely heavily on queuing theory. Little was teaching formal queuing theory at Case Western Reserve University and noticed that the L = λW behavior was independent of the probability distribution of the arrival times. At the urging of one of his students (Sid Hess), Little published the proof of the generality of the law. Little’s theorem was one of the simplest yet most useful results from that early era. While formal queuing theory requires an understanding of concepts like Poisson distributions and a somewhat advanced level of statistics and mathematics, the simple formula L = λW is one that requires no advanced
training to understand and apply. There was no thought in those early years of OR about answering questions of computer design. The traditional textbook examples are for inventory, manufacturing, order entry, and so on. Computer designs of that era tended to be tightly coupled, and so deterministic that the only statistical queuing behavior that would enter into their use would be that of users submitting jobs to the batch queue and waiting for computed results. Burton Smith (Tera Computer) and David Bailey (NASA Ames Research) deserve credit for showing, in the late s, the applicability of Little’s Law to many issues of computing design. It seems obvious in retrospect that a law originally intended for use in the arena of business operations with human time scales could apply to computing design. Here the use is a subtle variation on the “average” behavior, in that the law assesses the optimum behavior. If the processing speed is to be some idealized maximum, and the latency is as small as possible, then the law predicts a lower bound on the capacity or parallelism required for near-peak performance. Little’s Law for High Performance Computing provides perhaps the simplest way to explain the necessity for the parallel computing approach. Latency tends to be difficult to reduce because of the laws of physics; concurrency is the product of latency and bandwidth (processing rate), so increasing bandwidth forces the need for more concurrency.
Related Entries Bandwidth-Latency Models (BSP, LogP) Dependences Dependence Abstractions Network Obliviousness NUMA Caches Pipelining
Bibliographic Notes and Further Reading Little’s original paper is of historic interest, as is a less general precedent text () by OR pioneer Philip Morse: Queues, Inventories and Maintenance []. A very readable account with many examples drawn from everyday situations is in []. The latter work also gives perspective on the impact of the law over four
Livermore Loops
decades, and provides examples from managing email, traffic created by toll booths, and the housing market. Bailey’s brief paper [] promulgated the use of Little’s Law in High Performance Computing, and credits Burton Smith for the first restatement of the law in computer architecture terms.
Bibliography . Bailey DH () Little’s law and high performance computing. RNR Technical Report, NAS applications and tools group, NASA Ames Research Center. At http://crd.lbl.gov/~dhbailey/ dhbpapers/little.pdf . Cobham A () Priority assignment in waiting line problems Oper Res ():– . Jewell WS () A simple proof of L = λW. Oper Res (): – . Little JDC () A proof of the queueing formula L = λW. Oper Res :– . Little JDC, Graves SC () Little’s law. In: Chhajed D, Lowe TJ (eds) Building intuition: insights from basic operations management models and principles. MIT Press, Cambridge. Available at http://web.mit.edu/sgraves/www/ papers/Little’s%Law-Published.pdf . Morse PM () Queues, inventories and maintenance: the analysis of operational systems. Original edition by Wiley, New York. ISBN ---. Reprinted by Dover Phoenix Editions,
Little’s Lemma Little’s Law
Little’s Principle Little’s Law
Little’s Result Little’s Law
Little’s Theorem Little’s Law
L
Livermore Loops Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition Livermore Loops are a set of Fortran DO-loops (The Livermore Fortran Kernels, LFK) extracted from operational codes used at the Lawrence Livermore National Laboratory [, ]. They have been used since the early s to assess the arithmetic performance of computers and their compilers. They are a mixture of vectorizable and nonvectorizable loops and test rather fully the computational capabilities of the hardware as well as the skill of the software in compiling and vectorization of efficient code. The main value of the benchmark is the range of performance that it demonstrates, and in this respect it complements the limited range of loops tested in the LINPACK benchmark.
L Discussion As a benchmark, the Livermore Loops provide the individual performance of each loop, together with various averages (arithmetic, geometric, harmonic) and the quartiles of the distribution. However, it is difficult to give a clear meaning to these averages, and the value of the benchmark is more in the distribution itself. In particular, the maximum and minimum give the range of likely performance in full applications. The ratio of maximum to minimum performance has been called the instability or the specialty [], and is a measure of how difficult it is to obtain good performance from the computer, and therefore how specialized it is. The minimum or worst performance obtained on these loops is of special value, because there is much truth in the saying that “the best computer to choose is that with the best worst-performance.” The LFK set began in the early s with only kernels and was timed on the mainframes of the era with CPU-times of microsecond accuracy. By the early s the number of kernels in the set grew to and, during the decade, statistical features of the results reporting code were enhanced and adaptations were made to accommodate crude UNIX timers of only millisecond resolution. A port of LFK to C happened in the middle
L
Livermore Loops
of s. By the beginning of the s, the compilers were able to perform code-hoisting which necessitated changes to the code like introducing functions to embed the critical kernels inside it and hide it from the optimization. The s brought new hardware from HP, SGI, and SUN which exhibited vast performance improvements but still suffered with low timer resolution which required increase in the repetition count to assure % accuracy of the reported timings. The origins or functions of each of the LFK kernels are the following: . Fragment of hydrodynamics code . ICCG (Incomplete Cholesky Conjugate Gradient) excerpt . Inner product calculation . Banded linear equations . Tri-diagonal elimination below the diagonal . General linear recurrence equations . Fragment of equation of state code . ADI (Alternating Direction Implicit) integration scheme for differential equations . Integrate predictor kernel . Difference predictor kernel . Summation with storage of partial sums . First difference scheme . D PIC (Particle In Cell) code . D PIC . A challenging (for the compiler) ordering of scalar operations in FORTRAN . Monte Carlo search loop . Implicit conditional computation . D explicit hydrodynamics fragment . General linear recurrence equations . Discrete ordinates transport: recurrence . Matrix–matrix multiplication . Planckian probability distribution . D implicit hydrodynamics fragment . A loop that finds location of first minimum in an array In the LFK collection, scalar tests are numbered , , , , and [[], page ]. In the late s, the scalar features of the contemporary machines were important for supercomputing at Los Alamos (LANL) – as much as % of the code executed at the laboratory might have been scalar []. The LFK scalar codes are small, so their indicated performance is generally higher than
that measured with larger benchmarks. Nonetheless, a good correlation exists among the LFK scalar tests and others []. Vectorization at the compiler level was turned on and off in the LFK kernels through special comments that were recognized as compiler directives. Similarly, code directives were used to render selected kernels scalar and force their execution onto the scalar units of the processor. The scalar execution was used as the definition of necessary computation and the appropriated number of the performed floating point operations. The actual number of floating point operations for vectorized execution could have been larger. For the purposes of the calculation of the performance results, each type of floating operation was weighted. The weight represented the execution time relative to floating point addition. For example, the weight of floating point division was which meant that it takes four times as many cycles to complete a division than an addition of two floating point numbers. The authors of LFK suggested to use the results by weighing their performance in proportion to the actual usage of that category of computation in the total workload. For example, for a hydrodynamics code, kernels , , and would be the most relevant and consequently they should be given the greatest weight. Another important aspect of performance benchmarking stressed by the LFK report was the treatment of the results’ data as statistical quantity. A single number quoted as the performance rate was considered insufficient. Statistical sampling was encouraged and each reported number was to be accompanied by minimum and maximum values obtained as well as the equiweighted harmonic, geometric, and arithmetic means. In addition, the first, second, third, and fourth quartiles were used to perform sensitivity analysis of harmonic mean rate with respect to various weight distributions. At the time of release of the LFK set, the available hardware included machines from Cray, DEC, IBM, and NEC. The clock frequencies were of order MHz and the achieved performance varied between single digit Mflop/s to man hundreds of Mflop/s depending on the kernel and the type of processor (vector or RISC).
Related Entries Benchmarks HPC Challenge Benchmark
Load Balancing, Distributed Memory
LINPACK Benchmark TOP
Bibliography . McMahon FH () The Livermore FORTRAN kernels: a computer test of the numerical performance range. Lawrence Livermore Laboratory technical report LLNL UCRL-. Lawrence Livermore National Laboratory, Livermore . McMahon FH () The Livermore Fortran kernels test of the numerical performance range. In: Martin JL (ed) Performance evaluation of supercomputers. Elsevier Science, North-Holland, Amsterdam, pp – . Hockney RW () Performance parameters and benchmarking of supercomputers. Parallel Comput :– . Barley D, Brooks E, Dongarra J, Hayes A, Heath M, Lyon G () Benchmarks to supplant export “FPDR” calculations. Technical report RNR--, National Bureau of Standards, Gaithersburg . Lubeck OM () Supercomputer performance: The theory, practice and results. Los Alamos National Laboratory Report LA-MS, pp
Load Balancing Load Balancing, Distributed Memory METIS and ParMETIS
Load Balancing, Distributed Memory Aaron Becker, Gengbin Zheng, Laxmikant V. Kalé University of Illinois at Urbana-Champaign, Urbana, IL, USA
Definition Load balancing in distributed memory systems is the process of redistributing work between hardware resources to improve performance, typically by moving work from overloaded resources to underloaded resources.
Discussion In a parallel application, when work is distributed unevenly so that underloaded processors are forced to wait for overloaded processors, the application is suffering from load imbalance. Load imbalance is one of
L
the key impediments in achieving high performance on large parallel machines, especially when solving highly dynamic and irregular problems. Load balancing is a technique that performs the task of distributing computation and communication load across the hardware resources of a parallel machine so that no single processor is overloaded. It can reduce processor idle time across the machine while also reducing communication costs by colocating related work on the same processor. Balancing an application’s load involves making decisions about where to place newly created computational tasks on processors, or where to migrate existing work among processors. Orthogonally, some applications only require static load balancing: They have consistent behaviors over their lifetimes, and once they are balanced they remain balanced. Other applications exhibit dynamic changes in behavior and require periodic rebalancing. Typically it is far too expensive to compute an optimal distribution of work in any realistic application, but a wide variety of heuristic algorithms have been developed that are very effective in practice across a wide variety of application domains. The load associated with an application can be conceptualized as a graph, where the nodes are discrete units of work and an edge exists between two nodes if there is a communication between them. To solve the load balancing problem one must split this graph into parts, with each part representing the work to be associated with a particular hardware resource. This partitioning process is at the heart of load balancing. Given the heuristic nature of load balancing, which makes perfect load balance an unrealistic goal, it is important to remember that the goal of load balancing is not so much to equalize the amount of work on each processor as it is to minimize the load of the most heavily loaded processor in the system. The most overloaded processor is the bottleneck that determines time to completion. A heuristic that leaves some processors idle is preferable to one that reduces the variance in load, as long it reduces the load on the most loaded processor. Load balancing heuristics can also take factors beyond time to completion into account, for example by trying to minimize power usage over an application’s lifetime. Modern HPC systems include clusters of multi-core nodes. The general load balancing strategies described
L
L
Load Balancing, Distributed Memory
in this entry are still applicable at the level of nodes (i.e., work is assigned to nodes). Finer grained load balancing within a node (work assigned to cores) can be performed independently. In the simplest case, it can be done by applying the same load balancing strategies that are used for node-level load balancing, except that the cost of communication among cores within a node is much smaller than the internode communication, and the number of cores involved within a node is relatively small.
Periodic Load Balancing In a periodic load balancing scheme, computational tasks are persistent; load is balanced only as needed, and the balancing consists of migrating existing tasks and their associated data. With periodic load balancing, expensive load balancing decision making and data migration are not continuous processes. They occur only distinct point in the application when a decision to do balancing has been made. Periodic load balancing schemes are suitable for iterative scientific applications such as molecular dynamics, adaptive finite element simulation, and climate simulation, where the computation typically consists of a large number of time steps, iterations (as in iterative linear system solvers), or a combination of both. A computational task in these applications is executed for a long period of time, and tends to be persistent. During execution, at periodic intervals, partially executed tasks may be moved to different processors to achieve global load balance.
The Load Balancing Process In an application which rebalances its load periodically, there are four distinct steps in the load balancing process. The first step is load estimation, which tries to gauge what the load on each processor will be in the near future if no rebalancing is done. This may involve the use of a model that predicts future performance based on current conditions, or it may be based directly on past measured load, with the assumption that the near future will closely resemble the near past. The second step is making a decision of when to execute load balancing. This is a trade-off between the cost of load balancing itself (i.e., the cost of determining a new mapping of application tasks to hardware and the cost of migration) and the savings one can expect from a
better load distribution. The nature of the application being balanced and the cost of the load balancing method to be used factor heavily into this decision. The third step is determining how the load will be rebalanced. There are many algorithms devoted to computing a good mapping of work onto processors, each with its own strengths and weaknesses. This section describes some of the most widely used methods. The general problem of creating a mapping of work onto processors which minimizes time to completion is NPhard, so all strategies we discuss are heuristic. The final step in the load balancing process is migration, the process by which work is actually moved. This may simply be a matter of relocating application data, but it can also encompass thread and process migration.
Initial Balancing For some classes of application, there is very little change in load balance over time once the application reaches a steady state. Thus, once good load balance is achieved, no further balancing need be done. However, load balancing is often still an important part of such applications. For example, consider the case of molecular dynamics, where one must distribute simulated particles onto processors. These applications do not experience significant dynamic load imbalance, but determining a good initial mapping may be difficult because of the difficulty of estimating the load of multiple computational tasks accurately a priori. In cases like this, one can simply start the application with an unoptimized distribution of work, run the application long enough to accurately measure the load, rebalance based on those measurements, and then continue without the need for any further balancing. Classifying Load Balancers There are many families of load balancing approaches that can be classified according to how they answer the fundamental questions of load balancing: How do we estimate the load, at what granularity should one balance the load, and where should the load balancing decisions be made. Load estimation underlies all load balancing algorithms. Load balancing is fundamentally a forward-looking task which aims to improve the future performance of the application. To do this effectively, the load balancer must have some model of what the
Load Balancing, Distributed Memory
future performance of the application will be in different scenarios. There are two common approaches to estimating future load. The first is to measure the current load associated with each piece of work in the application and to assume that this load will remain the same after load balancing. This assumption is based on the principle of persistence, which posits that, for certain classes of scientific and engineering applications, computational loads and communication patterns tend to persist over time, even in dynamically evolving computations. This approach has several advantages: It can be applied to any application, it accurately and automatically accounts for the particular characteristics of the machine on which the application is running, and it removes some of the burden of load estimation from the application developer. The alternative is to build some model of application performance which will provide a performance estimate for any given distribution of work. These models can be sophisticated enough to take into account dynamic changes in application performance that a measurement-based scheme will not account for, but they must be created specifically to match the actual characteristics of the application they are used for. If the model does not match reality then poor load balancing may result. Load balancing may take place at any of several levels of granularity. The greatest flexibility in mapping work onto hardware resources can be achieved by exposing the smallest possible meaningful units of data to the load balancer. For example, these may be nodes in a finite element application or individual particles in a molecular simulation. Alternatively, one may group this data into larger chunks and only expose those chunks to the load balancing algorithm. This makes the load balancing problem smaller while guaranteeing respect for locality within the chunks. For example, in a molecular simulation the load balancer might only try to balance contiguous regions which may contain a large number of particles without being directly aware of the particles. It is also possible to balance load by migrating entire processes, avoiding the need for application developers to write code dedicated to migrating their data during the load balancing process. Load balancers may be further categorized according to where decisions are made: locally, globally, or
L
according to some hierarchical scheme. Global load balancers collect global information about load across the entire machine and can use this information to make decisions that take the entire state of the application into account. The advantage of these schemes is that load balancing decisions can take a global view of the application, and all decisions about which objects should migrate to which processors can be made without further coordination during the load balancing process. However, these schemes inherently lack scalability. As problem sizes increase, the object communication graph may not even fit in memory on a single node, making global algorithms infeasible. Even if the load balancing process is not constrained by memory, the time required to compute an assignment for very large problems may preclude the use of a global strategy. However, with coarse granularity, it is possible to use global strategies up to several thousand processors, especially if load balancing is relatively infrequent. Parallelized versions of global load balancing schemes are also possible. These use the same principles for partitioning as a serial global scheme, but to improve scalability the task graph is spread across many processors. This introduces some overhead in the partitioning process, but allows much larger problems to be solved than a purely serial scheme. One example of a parallel global solver is ParMETIS, the parallel version of the METIS graph partitioner. Distributed load balancing schemes lie at the opposite end of the scale from global schemes. Distributed schemes use purely local information in their decisionmaking process, typically looking to offload work from overloaded processors to their less-loaded immediate neighbors in some virtualized topology. This leads to a sort of diffusion of work through the system as objects gradually move away from overloaded regions into underloaded regions. These schemes are very cheap computationally, and do not require global synchronization. However, they have the disadvantage of redistributing work slowly compared to global schemes, often requiring many load balancing phases before achieving system-wide load balance, and they typically perform more migrations before load balance is achieved. Hybrid load balancing schemes represent a compromise between global and distributed schemes. In a hybrid load balancer, the problem domain is split
L
L
Load Balancing, Distributed Memory
into a hierarchy of sub-domains. At the bottom of the hierarchy, global load balancing algorithms are applied to the small sub-domains, redistributing load within each sub-domain. At higher levels of the hierarchy, the load balancing strategies operate on the sub-domains as indivisible units, redistributing these larger units of work across the machine. This keeps the size of the partitioning problem to be solved low without giving up some global decision-making capabilities.
Algorithms An application’s load balancing needs can vary widely depending on its computational characteristics. Accordingly, a wide variety of load balancing algorithms have been developed. Some do careful analysis of an application’s communication patterns in an effort to reduce internode communication; others only try to minimize the maximum load on a node. Some try to do a reasonably good job as quickly as possible; others take longer to provide a higher quality solution. Some are even tailored to meet the needs of particular application domains. Despite the wide variety of algorithms available, there are a few in wide use that demonstrate the range of available techniques. Parallel Prefix In some cases, the pattern of communication between objects is unimportant, and we care only about keeping an equal amount of work on each processor. In this scenario, load may be effectively distributed using a simple parallel prefix strategy (also known as scan or prefix sum). The input to the parallel prefix algorithm is simply a local array of work unit weights on each processor reflecting the amount of work that each unit represents. The classic parallel prefix algorithm computes the sum of the first i units for each of the n unit in the list. Once the prefix sum is computed, all work can be equally distributed across processors by sending each unit to the processor number given by its prefix sum value divided by the total work divided by the number of processors. The primary advantage of the parallel prefix scheme is its fast, inexpensive nature. The computation can be shared across p processors using only log p communications, and efficient prefix algorithms are widely available in the form of library functions like MPI_Scan. An assignment of work units to processors based on parallel
prefix can be computed virtually instantaneously even for extremely large processor counts, regardless of variations in work unit weights. The corresponding disadvantage of using the parallel prefix algorithm for load balancing is its simplicity. It does not account for the structure of communication between processors and will not attempt to minimize the communication volume of the resulting partition. It also does not attempt to minimize the amount of migration needed.
Recursive Bisection Recursive bisection is a divide-and-conquer technique that reduces the partitioning problem to a series of bisection operations. The strategy of recursive bisection is to split the object graph in two approximately equal parts while minimizing communication between the parts, then proceeding to subdivide each half recursively until the required number of partitions is achieved. This recursive process introduces parallelism into the partitioning process itself. After the top-level split is completed, the two child splits are independent and can be computed in parallel, and after n levels of bisection there are n independent splitting operations to perform. In addition, geometric bisection operations can themselves be parallelized by constructing a histogram of the nodes to be split. There are many variations of the recursive bisection algorithm based on the algorithm used to split the graph. Orthogonal recursive bisection (ORB), shown in Fig. , is a geometric approach in which each object is associated with coordinates in some spatial domain. The bisection process splits the domain in two using an axis-aligned cutting plane. This can be a fast way of creating a good bisection if most communication between objects is local. However, this method does not guarantee that it will produce geometrically connected subdomains, and it may also produce sub-domains with high aspect ratios. Further variations exist which allow cutting planes that are not axis-aligned. An alternate approach is to use spectral methods to do the bisection. This involves constructing a sparse matrix from the communication graph, finding an eigenvector of that matrix, and using it as a separator field to do the bisection. This method can produce superior results compared with ORB, but at a substantially higher computational cost.
Load Balancing, Distributed Memory
L
(0,3) 1
3
2nd cut 2nd cut
0 (0,0)
2 1st cut
(4,0)
Load Balancing, Distributed Memory. Fig. Orthogonal recursive bisection (ORB) partitions by finding a cutting plane that splits the work into two approximately equal pieces and recursing until the resulting pieces are small enough []
For many problems, recursive bisection is a fast way of computing good partitions, with the added benefit that the recursive procedure naturally exposes parallelism in the partitioning process itself. The advantages and disadvantages of recursive bisection depend greatly on the nature of the bisection algorithm used, which can greatly affect partition quality and computational cost.
Space-Filling Curve A space-filling curve is a mathematical function that maps a line onto the entire unit square (in two dimensions, or the entire unit N-cube in N dimensions). There are many such curves, and they are typically defined as the limit of sequence of simple curves, so that closer and closer approximations to the space-filling curve can be constructed using an iterative process. The value of space-filling curves for load balancing is that they can be used to convert n-dimensional spatial data into onedimensional data. Once the higher-dimensional data has been linearized, it can be easily partitioned by splitting the line into pieces with equal numbers of elements. Consider the case where each object in the application has a two-dimensional coordinate associated with it. Starting with a coarse approximation to the space-filling curve, a recursive refinement process can be used to create closer and closer approximations until each object is associated with its own segment of the approximated curve. This creates a linear ordering of the objects, as depicted in Fig. , which can then be split into even partitions.
Load Balancing, Distributed Memory. Fig. To partition data using a space-filling curve, a very coarse approximation of the curve is replaced with finer and finer approximations until each unit of work is in its own segment. The curve is then split into pieces, with each piece containing an equal number of work units. This figure shows a Hilbert curve []
A good choice of space-filling curve will tend to keep objects which are close together in the higherdimensional space. This property is necessary so that the partitions that result respect the locality of the data. Many different curves have been used for load balancing purposes, including the Peano curve, the Morton curve, and the Hilbert curve. The Hilbert curve is a common choice of space-filling curve for partitioning applications because it provides good locality.
Graph Partitioning The pattern of communication in any parallel program can be represented as a graph, with nodes representing discrete units of work, and weighted edges representing communication. Graph partitioning is a generalpurpose technique for splitting such a graph into pieces, typically with the goal of creating one piece for each processor. There are two conflicting goals in graph partitioning: achieving equal-size partitions, and minimizing the total amount of communication across partition boundaries, known as the edge cut. Finding an optimal solution to the graph partitioning problem is NP-Hard, so heuristic algorithms are required. Because it is computationally infeasible to attempt to partition the whole input graph at once, the graph partitioning algorithm proceeds by constructing a series of
L
L
Load Balancing, Distributed Memory
coarser representations of the input graph by combining distinct nodes in the input graph into a single node in the coarser graph. By retaining information about which nodes in the original graph have been combined into each node of the coarse graph, the algorithm maintains basic information about the graph’s structure while reducing its size enough that computing a partitioning is very simple. Once the coarsest graph has been partitioned, the coarsening process is reversed, breaking combined nodes apart. At each step of this uncoarsening process, a refinement algorithm such as the Kernighan–Lin algorithm can be used to improve the quality of the partition by finding advantageous swaps of nodes across partition boundaries. Once the uncoarsening process is complete, the user is left with a partitioning for the original input mesh. This technique can be further generalized to operate on hypergraphs. Whereas in a graph an edge connects two nodes, in a hypergraph a hyperedge can connect any number of nodes. This allows for the explicit representation of relationships that link several nodes together as one edge rather than the standard graph representation of pairwise edges between each of the nodes. For example, in VLSI simulations hypergraphs can be used to more accurately reflect circuit structure, leading to higher quality partitions. Hypergraphs can be partitioned using a variation of the basic graph partitioning algorithm which coarsens the initial hypergraph until it can be easily partitioned, then refining the partitioned coarse hypergraph to obtain a partitioning for the full hypergraph. The primary advantages of load balancing schemes based on graph partitioning are their flexibility and generality. An object communication graph can be constructed for any problem without depending on any particular features of the problem domain. The resulting partition can be made to account for varying amounts of work per object by weighting the nodes of the graph and to account for varying amounts of communication between objects by weighting the edges. This flexibility makes graph partitioners a powerful tool for load balancing. The primary disadvantage of these schemes is their cost. Compared to the other methods discussed here, graph partitioning may be computationally intensive, particularly when a very large number of partitions
is needed. The construction of an explicit communication graph may also be unnecessary for problems where communication may be inferred from domainspecific information such as geometric data associated with each object.
Rebalancing with Refinement Versus Total Reassignment There are two ways to approach the problem of rebalancing the computational load of an application. The first approach is to take an existing data distribution as a baseline and attempt to incrementally improve load balance by migrating small amounts of data between partitions while leaving the majority of data in place. The second approach is to perform a total repartitioning, without taking the existing data distribution into account. Incremental load balancing approaches are desirable when the application is already nearly balanced, because they impose smaller costs in terms of data motion and communication. Incremental approaches are more amenable to asynchronous implementations that operate using information from only a small set of processors. Software Frameworks Many parallel dynamic load balancing libraries and frameworks have been developed before for specialized domains. These libraries are often particularly useful because they allow application developers to specify the structure of their application once and then easily try a number of different load balancing strategies to see which is most effective in practice without needing to reformulate the load measurement and task graph construction process for each new algorithm. The Zoltan toolkit [] provides a suite of dynamic load balancing and parallel repartitioning algorithms, including geometric, hypergraph, and graph methods. It provides a simple interface to switch between algorithms, allowing straightforward comparisons of algorithms in applications. The application developers provide an explicit cost function and communication graph for the Zoltan algorithms to use. Charm++ [] adopts a migratable object-based load balancing model. As such, Charm programs have a natural grain size determined by their objects, and can be load balanced by redistributing their objects among processors. Charm provides facilities for automatically
Load Balancing, Distributed Memory
measuring load, and provides a variety of associated measurement-based load balancing strategies that use the recent past as a guideline for the near future. This avoids the need for developers to explicitly specify any cost functions or information about the application’s communication structure. Charm++ includes a suite of global, distributed, and hierarchical load balancing schemes. DRAMA [] is a library for parallel dynamic load balancing of finite element applications. The application must provide the current distributed mesh, including information about its computation and communication requirements. DRAMA then provides it with all necessary information to reallocate the application data. The library computes a new partitioning, either via direct mesh migration or via parallel graph re-partitioning, by interfacing to the ParMetis or Jostle graph partitioning libraries. This project is no longer under active development. The Chombo [] package was developed by Lawrence Berkeley National Lab. It provides a set of tools including load balancing for implementing finite difference methods for the solution of partial differential equations on block-structured adaptively refined rectangular grids. It requires users to provide input indicating the computational workload for each box, or mesh partition.
Task Scheduling Methods Some applications are characterized by the continuous production of tasks rather than by iterative computations on a collection of work units. These tasks, which are continually being created and completed as the application runs, form the basic unit of work for load balancing. For these applications, load balancing is essentially a task scheduling or task allocation problem. This task pool abstraction captures the execution style of many applications such as master/worker and state-space search computations. Such applications are typically non-iterative. Load balancing strategies in this category can be classified as centralized, fully distributed, or hierarchical. Hierarchical strategies are hybrids that aim to combine the benefits of centralized and distributed methods. In centralized strategies, a dedicated “central” processor gathers global information about the state of the entire machine and uses it to make global
L
load balancing decisions. On the other hand, in a fully distributed strategy, each processor exchanges state information only with other processors in its neighborhood. Fully distributed load balancing received significant attention from the early days in parallel computing. One way to categorize them is based on which processor initiates movement of tasks. Sender-initiated schemes assign newly created tasks to some processor, chosen randomly or from one of the neighbors in a physical or virtual topology. The decision to assign it to another processor, instead of retaining the task locally may also be taken randomly, or based on a load metric such as a queue size. Random assignment has some good statistical properties but suffers from high communication costs, by requiring that most tasks be sent to remote processors. With neighborhood strategies, global balancing is achieved as tasks are moved from heavily loaded neighborhood diffuse onto lightly loaded processors. In the adaptive contraction within neighborhood (ACWN) scheme, tasks always travel to topologically adjacent neighbors with the least load, but only if the difference in loads is more than a predefined threshold. In addition, ACWN does saturation control by classifying the system as being either lightly, moderately, or heavily loaded. In receiver-initiated schemes, the underloaded processors request load from heavily loaded processors. You may chose the victim to request work from randomly, or via a round-robin policy, or from among “neighboring” processors. Randomized work stealing is yet another distributed dynamic load balancing technique, which is used in some runtime systems such as Cilk. The distinction between sender or receiver initiation is blurred when processors exchange load information, typically with neighbors in a virtual topology. Although the decision to send work is taken by the sender, it is taken in response to load information from the receiver. For example, in neighborhood averaging schemes, after periodically exchanging load information with neighbors in a virtual (and typically, low-diameter) topology, each processor that is overloaded compared with its neighbors, sends equalizing work to its lower-loaded neighbors. Such policies tend to be proactive compared with work stealing, trading better load balance for extra communication.
L
L
Load Balancing, Distributed Memory
Another strategy in this category, and one of the oldest ones, is the gradient model. Here each processor participates in a continuous fixed-point computation with its neighbors to identify the neighbor that is closest to an idle processor. Overloaded processors then send work toward the idle processors via that neighbor. The gradient model is a demand-driven approach. In gradient schemes, underloaded processors inform other processors of their state, and overloaded processors respond by sending a portion of their load to the nearest lightly loaded processor. The resulting effect is a form of a gradient map that guides the migration of tasks from overloaded to underloaded processors. The dimensional exchange method is a distributed strategy in which balancing is performed in an iterative fashion by “folding” an P processor system into log P dimensions and balancing one dimension at a time. that is, in phase i, each processor exchanges load with its neighbor in the ith dimension so as to equalize load among the two. After log P phases, the load is balanced across all the processors. This scheme is conceptually designed for a hypercube system but may be applied to other topologies with some modification. Several hierarchical schemes have been proposed that avoid the bottleneck of a centralized strategies, while retaining some of their ability to achieve global balance quickly. Typically, processors are organized in a two-level hierarchy. Managers at the lower level behave like masters in centralized strategies for their domain of processors, and interact with other managers as processors in a distributed strategies. Alternatively, load balancing is initiated at the lowest levels in the hierarchy, and global balancing is achieved by ascending the tree and balancing the load between adjacent domains at each level in the hierarchy. Often, priorities are associated with tasks, especially for applications such as branch-and-bound or searching for one solution in a state-space search. Task balancing for these scenarios is complicated because of the need to balance load while ensuring that high priority work does not get delayed by low priority work. Sender-initiated random assignment as well as some hierarchical strategies have shown good performance in this context.
Related Entries Chaco Graph Partitioning Hypergraph Partitioning Reduce and Scan Scan for Distributed Memory, Message-Passing Systems Space-filling Curves Task Graph Scheduling Topology Aware Task Mapping
Bibliographic Notes and Further Reading There is a rich history of load balancing literature that spans decades. This entry is only able to cite a very small amount of this literature, and attempts to include modern papers with broad scope and great impact. Kumar et al. [] describe the scalability and performance characteristics of several task scheduling schemes, and Devine et al. [] gives a good overview of the range of techniques used for periodic balancing in modern scientific applications and the load balancing challenges faced by these applications. Xu and Lau [] give an in-depth treatment of distributed load balancing articles, covering both mathematical theory and actual implementations. For practical information on integrating load balancing frameworks into real applications, literature describing principles and practical use of such systems as DRAMA [], Charm++ [, ], Chombo [], and Zoltan [] is a crucial resource. More information is available on the implementation of task scheduling methods, whether they are centralized [, ], distributed [–], or hierarchical []. Work stealing is described in detail by Kumar et al. [], while associated work on Cilk is described in Frigo et al. []. Dinan et al. [] extended work stealing to run on thousands of processors using ARMCI.
Bibliography . Devine K, Hendrickson B, Boman E, St. John M, Vaughan C () Design of dynamic load-balancing tools for parallel applications. In: Proceedings of the international conference on supercomputing, Santa Fe, New Mexico, May . ACM Press, New York
Locality of Reference and Parallel Processing
. Kale LV, Zheng G () Charm++ and AMPI: adaptive runtime strategies via migratable objects. In: Parashar M (ed) Advanced computational infrastructures for parallel and distributed applications. Wiley-Interscience, Hoboken, pp – . Basermann A, Clinckemaillie J, Coupez T, Fingberg J, Digonnet H, Ducloux R, Gratien JM, Hartmann U, Lonsdale G, Maerten B, Roose D, Walshaw C () Dynamic load balancing of finite element applications with the DRAMA library. Appl Math Model :– . Chombo Software Package for AMR Applications. (October ) http://seesar.lbl.gov/anag/chombo/. Accessed October . Kumar V, Grama AY, Vempaty NR () Scalable load balancing techniques for parallel computers. J Parallel Distrib Comput ():– . Devine KD, Boman EG, Heaphy RT, Hendrickson BA, Teresco JD, Faik J, Flaherty JE, Gervasio LG () New challenges in dynamic load balancing. Appl Numer Math (–): – . Xu C, Lau FCM () Load balancing in parallel computers theory and practice. Kluwer Academic Publishers, Boston . Zheng G () Achieving high performance on extremely large parallel machines: performance prediction and load balancing . Zheng G, Meneses E, Bhatele A, Kale LV () Hierarchical load balancing for Charm++ applications on large supercomputers. In: Proceedings of the third international workshop on parallel programming models and systems software for high-end computing (PS), San Diego . Chow YC, Kohler WH () Models for dynamic load balancing in homogeneous multiple processor systems. IEEE Trans Comput c-:– . Ni LM, Hwang K () Optimal load balancing in a multiple processor system with many job classes. IEEE Trans Software Eng SE-:– . Corradi A, Leonardi L, Zambonelli F () Diffusive load balancing policies for dynamic applications. IEEE Concurrency ():– . Willebeek-LeMair MC, Reeves AP () Strategies for dynamic load balancing on highly parallel computers. IEEE Trans Parallel Distrib Syst ():– . Sinha A, Kalé LV () A load balancing strategy for prioritized execution of tasks. In: International parallel processing symposium, New Port Beach, CA, April . IEEE Computer Society, Washington, DC, – . Frigo M, Leiserson CE, Randall KH () The implementation of the Cilk- multithreaded language. In: ACM SIGPLAN ’ conference on programming language design and implementation (PLDI), Montreal, vol of ACM SIGPLAN notices, ACM, New York, pp – . Dinan J, Larkins DB, Sadayappan P, Krishnamoorthy S, Nieplocha J () Scalable work stealing. In: SC ’: proceedings of the conference on high performance computing networking, storage and analysis, Portland, OR. ACM, New York, pp –
L
Locality of Reference and Parallel Processing Keshav Pingali The University of Texas at Austin, Austin, TX, USA
Definition The term locality of reference refers to the fact that there is usually some predictability in the sequence of memory addresses accessed by a program during its execution. Programs are said to exhibit temporal locality if they access the same memory location several times within a small window of time. Programs that access nearby memory locations within a small window of time are said to exhibit spatial locality. Many parallel processors are nonuniform memory access (NUMA) machines in which the time required by a given processor to access a memory location may depend on the address of that location. Parallel programs that exploit this nonuniformity are said to exhibit network locality. Exploiting all types of locality is critical in parallel processing because the performance of most programs is limited by the latency of memory accesses. Hardware mechanisms for exploiting locality include caches and pre-fetching. Software mechanisms include program and data transformations, software pre-fetching, and topology-aware mappings of computations and data. Multithreading can reduce the impact of memory latency on programs that have a lot of parallelism but little locality of reference.
Discussion Introduction The sequence of memory addresses accessed by a program or a thread during its execution is called its memory trace. Memory traces are not random number sequences since there is some predictability to the addresses in most traces, which arises from fundamental features of the von Neumann model of computation. For example, instructions between branches are executed sequentially, so if an instruction at address i is executed and it is not a branch
L
L
Locality of Reference and Parallel Processing
instruction, the instruction at address i + must be executed next. This kind of predictability in address traces is called locality of reference, and exploiting it is key to obtaining good performance on sequential and parallel machines. There are three kinds of locality of reference: temporal, spatial, and network.
Temporal Locality A program execution exhibits temporal locality if the occurrences of a given memory address in its memory trace occur close to each other. A program is said to exhibit temporal locality if its executions for most inputs of interest exhibit temporal locality. Temporal locality in instruction addresses arises mainly from the execution of loops (in particular, innermost loops in loop nests). For example, when multiplying two N × N matrices using the standard algorithm shown below, the body of the innermost loop is executed N times back to back, resulting in temporal locality in instruction addresses. Temporal locality in instruction addresses may also arise from the execution of “hot” methods that are called repeatedly during program execution. for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j];
Temporal locality in data addresses may arise from accesses to constants and memory locations accessed repeatedly within loops. For example, in the matrix multiplication code shown above, every iteration of the innermost loop for given outer loop indices (i, j) accesses the same array location C[i][j], giving rise to temporal locality. Some compilers can exploit this temporal locality by loading C[i][j] into a register just before the innermost loop begins execution, and performing the additions to this register rather than to the memory location C[i][j] (the value in the register is written back to memory when loop finishes execution). In contrast, there is little temporal locality in the references to the B array since successive accesses to a given element of B are separated by O(N ) other addresses in the trace.
Spatial Locality A program execution exhibits spatial locality if the occurrences in its trace of nearby memory addresses
occur close together in that trace. A program is said to exhibit spatial locality if its executions for most inputs of interest exhibit spatial locality. Spatial locality in instruction accesses arises because instructions between branches are executed sequentially, so if an instruction at address i is executed and it is not a branch, the instruction at address i + must be executed next. Instruction accesses within innermost loops therefore exhibit both temporal and spatial locality. Programs that perform operations on vectors exhibit spatial locality if successive vector elements are accessed during the computation, such as the following program which adds the elements of two vectors: for (i = 0; i < N; i++) C[i] = A[i]+B[i];
Notice that in this program, there are no fewer than four sources of spatial locality: the accesses to the three vectors, and the instructions in the loop body. Programs that access vector elements with some small stride also exhibit spatial locality. Programs that access arrays of two or more dimensions will exhibit spatial locality if array elements are accessed successively (or with some small non-unit stride) in the order in which they are stored in memory. For example, in the C programming language, arrays are stored in row-major order, so the elements of the matrix are stored row by row in memory. If a program accesses such arrays row by row, these accesses will exhibit spatial locality. In the matrix multiplication code shown above, the accesses to arrays A and C have spatial locality of reference, since these arrays are accessed row by row, but the accesses to array B do not since this array is accessed by columns.
Network Locality In a parallel computer in which memory addresses are distributed across the processors, a processor may have faster access to memory locations mapped locally than to memory locations mapped to other processors. There may also be differences in the access times to memory locations mapped to different processors because most communication networks for parallel computers are multi-stage networks in which communication packets from a given processor may go through different numbers of stages to get to different processors, and each
Locality of Reference and Parallel Processing
stage adds some latency to the communication. Parallel computers that exhibit these kinds of nonuniform memory access times are called Non-Uniform Memory Access (NUMA) parallel computers. A parallel program that attempts to exploit features of NUMA parallel computers is said to exhibit network locality.
Exploiting Locality: Caches Cache memories exploit temporal and spatial locality of reference to provide programmers with the performance characteristics of a large and fast memory system but at reasonable cost. The key is a hierarchical organization of the storage, known usually as the memory hierarchy and shown pictorially below. On modern processors, there may be many memory hierarchy levels such as the registers, Level (L) cache, Level (L) cache, Level (L) cache, and main memory. The virtual memory system can be considered to be another level in the memory hierarchy. L cache memory is small, fast, and usually implemented in SRAM, while main memory is large, relatively slow, and implemented in DRAM; the other levels of cache fall somewhere in between these two extremes. For example, on the Intel Nehalem multicore processor, each core has KB L instruction and data caches, and a KB L cache; all the cores share a MB L cache. The size of the main memory may be many GB. On most processors, accesses to the L cache typically take around – cycles, and the latency of memory accesses increases by roughly an order of magnitude in going from one level to the next (Fig. ).
Reg (~32)
part of pipeline
L1 Cache (~32 KB) L2 Cache (~256 KB)
L3 Cache (~10 MB)
Memory (~1 GB)
~1 cycles
L
Given such a memory hierarchy, the effective latency of memory accesses as observed by the processor will be considerably less than DRAM access time provided most of the memory accesses are satisfied by the faster cache levels. One way to accomplish this is to store the addresses and contents of the most recently accessed memory locations in the cache memories. When a processor issues a memory request, the fastest cache level in which that address is cached responds to the request, so the long-latency trip to memory is required only if the address is not cached anywhere. Programs that have good temporal locality obviously benefit from this caching strategy. When a memory request goes all the way to main memory, most memory systems will fetch not just the contents of the desired address but the contents of a block of addresses, one of which is the desired address (this block of addresses is called the cache block or cache line containing that address). Because the latency of fetching a cache block from memory is almost the same as the latency of fetching the contents of a single address, programs with spatial locality benefit from this caching policy. The design of caches becomes more complex in parallel machines because of the cache-coherence problem. A major advantage of caches is that they are transparent to the programmer in the sense that although caches may confer performance benefits, they do not change the output of the program. Ensuring this transparency in a shared-memory multiprocessor is complicated by the fact that a memory location may be cached in the local caches of several cores/processors, so it is necessary to ensure that updates to that address made by different processors are coordinated in some way to preserve the transparency of caching to the programmer. This is referred to as the cache-coherence problem, and many solutions to this problem have been explored in the literature.
~10 cycles
~50 cycles
~250 cycles
Locality of Reference and Parallel Processing. Fig. Memory hierarchy of a typical processor
Program Transformations for Exploiting Caches While caches are transparent to the programmer, obtaining the performance benefits of caching may require the programmer to be aware of the cache hierarchy and to transform programs to become more “cachefriendly” by enhancing temporal and spatial locality. For the most part, these transformations can be divided
L
L
Locality of Reference and Parallel Processing
into two classes: () rescheduling of operations and () reordering of data structures. The goal of rescheduling operations is to ensure that operations that access the same memory address (or more generally, the same cache line) are executed more or less contemporaneously, thereby improving locality. In dense matrix computations, loop permutation and loop tiling are the most important transformations for enhancing locality. For example, consider the matrix multiplication program discussed above. As is well known, all six permutations of the three loops produce the same output; however, their locality characteristics are very different. If the j loop is permuted into the innermost position in the loop nest, the references to arrays C and B enjoy good spatial locality since the inner loop accesses these arrays row by row, and the reference to array A enjoys good temporal locality since all iterations of the inner loop for given outer loop indices access the same element of A. Conversely, permuting the i loop into the innermost position is bad for exploiting spatial locality since none of the matrices will be accessed by row in the innermost loop. While loop permutation is useful, the most important transformation for enhancing locality is loop tiling. A tiled version of matrix multiplication is shown below. The effect of this transformation is to produce block matrix programs, in which the inner loops can be viewed as operating on blocks of the original matrices. In the tiled code shown below, BLOCK_SIZE is a parameter that determines the size of the blocks multiplied in the inner three loops. The value of this parameter must be chosen carefully so that the working set of the three innermost loops fits in cache. If there are multiple cache levels, it may be necessary to tile for some or all these levels. Choosing an optimal tile size is a difficult problem in general, and it may be necessary to execute the program for different tile sizes and determine the best one by measurement. This approach is called empirical optimization and it is used by the ATLAS system, which generates high-performance Basic Linear Algebra Subroutines (BLAS). It is often possible to use simple machine models to reduce the search space. for (ii = 0; ii < N; ii += BLOCK_SIZE) for (kk = 0; kk < N; kk += BLOCK_SIZE) for (jj = 0; jj < N; jj += BLOCK_SIZE) for (i = ii; i < ii +
BLOCK_SIZE && i < N; i++) for (k = kk; k < kk + BLOCK_SIZE && k < N; k++) for (j = jj; j < jj + BLOCK_SIZE && j < N; j++) C[i][j] = C[i][j] + A[i][k] * B[k][j];
Spatial locality can also be improved by changing the layout of data structures in memory. For example, in the (i, j, k) order of the three nested loop version of matrix multiplication, changing the layout of B to columnmajor order will improve spatial locality for the references to array B. For data transformations to be useful, the overhead of transforming the data layout must be amortized by the benefits of improved spatial locality in computing with that data. For example, when performing the tiled matrix multiplication shown above, high performance implementations will usually copy blocks of the matrices into contiguous memory locations because the overhead of copying the data is amortized by the benefits of improved spatial locality when performing the block matrix multiplication. Many techniques for changing the layouts of other kinds of data structures have been explored in the literature. For example, records and structures that span multiple cache lines can be allocated so that fields that are often accessed contemporaneously are packed together into the same cache line, if possible, rather than being allocated in different cache lines. In managed languages like Java and C#, the garbage collector can improve locality by moving data items that are accessed contemporaneously into the same page of virtual memory.
Cache-Oblivious Programs Divide-and-conquer algorithms for many problems are naturally cache-friendly because they repeatedly generate sub-problems of smaller size until they reach a base case, and then assemble the solution to the problem from the solutions to these sub-problems. If the base case of the divide-and-conquer algorithm is chosen so that the working set of that sub-problem fits in the cache, and the data movement required for assembling the solution from the solutions to the sub-problems is small, the program may have excellent locality. Divideand-conquer algorithms for problems like matrix multiplication, FFT, and Cholesky factorization enjoy this
Locality of Reference and Parallel Processing
property and they are known as cache-oblivious programs because they have good temporal locality even though they are not explicitly optimized for any particular cache size or structure; this is in contrast to iterative algorithms for these problems, which require tiling to be made cache-friendly. Spatial locality in cache-oblivious matrix multiplication programs can be enhanced by using space-filling curves to order the storage of matrix elements.
Prefetching Caches can be classified as latency-avoidance mechanisms since the goal is to avoid long latency memory operations by exploiting locality of reference. The performance impact of long latency memory operations can also be reduced by latency-tolerance techniques such as prefetching. If the memory accesses of a program can be predicted well in advance of when the values are actually required in the computation, the processor might be able to prefetch these values from memory into the cache so that they are available immediately when they are needed for the computation. This is an instance of a more general optimization technique in parallel programming called overlapping communication with computation: the key idea is to perform data movement in parallel with computation to reduce the effective overhead of the data movement. Prefetching is particularly useful for streaming programs since these programs touch large volumes of data and have predictable data accesses but have little temporal locality, so caches are not effective for reducing the effective memory latency. In some processors, prefetching is under the control of the programmer. For example, the iA SSE instruction set has prefetch instructions for prefetching data into some or all cache levels. The programmer or compiler is responsible for inserting prefetch instructions into the appropriate places in the program. In other processors, prefetching is implemented in hardware. Usually, the prefetch unit monitors the stream of memory addreses from the processor and attempts to detect sequences of fixed stride memory accesses. If it finds such a sequence, it uses that sequence to determine memory addresses for prefetching. To avoid unnecessary page faults, prefetching usually does not cross page boundaries.
L
Exploiting Network Locality For techniques to exploit network locality, see the Encyclopedia entry on Topology Aware Task Mapping.
Using Parallelism to Tolerate Memory Latency When programs have little locality of reference and also make unpredictable memory accesses, neither caching nor prefetching is effective in reducing the performance impact of long latency memory accesses. Many irregular programs such as graph computations fall in this category. For such programs, it is possible to tolerate latency by exploiting parallelism, provided the processor can switch rapidly between parallel threads of execution. There are many implementations of this scheme, but a simple approach is the following. The processor issues instructions from a thread until that thread becomes idle waiting for a memory load to complete. At that point, the processor switches to a different thread that is ready to execute and begins issuing instructions from that thread, and so on. As long as there is enough parallelism in the program and the processor can switch quickly between threads, the performance impact of long latency loads from memory is minimized. In other designs, the processor switches between ready threads at every cycle. This approach requires specially designed processors since the processor needs to be able to switch between threads very rapidly. Tera’s MTA processor is an example. Each processor can support simultaneous threads. Simultaneous multi-threading (SMT) is an implementation of this idea in general-purpose processors. The same principle also underlies the design of dataflow computers, although threads in that case are very fine- grain, comprising of a single instruction.
Related Entries Numerical Linear Algebra Code Generation Loop Nest Parallelization Nonuniform Memory Access (NUMA) Machines Parallelization, Automatic Parallelization, Basic Block Scheduling Algorithms Topology Aware Task Mapping Unimodular Transformations
L
L
Lock-Free Algorithms
Bibliographic Notes and Further Reading Although the importance of exploiting locality of reference in programs was recognized from the very start of computing by pioneers like von Neumann, the first machine to incorporate cache memory was the IBM Model []. Work on transforming programs to optimize them for parallelism and locality of reference began at the University of Illinois, Urbana-Champaign in the s [, ]. The automation of loop and data transformations was enabled by the development of powerful integer linear programming tools in the s [, , ]. For the most part, these techniques are restricted to dense matrix programs. Data-shackling is a more general data-centric approach to reasoning about locality []. Data transformations for non-array data structures are described in []. Determining optimal values for parameters such as tile sizes is an active topic of research. The empirical optimization approach is incorporated in the popular ATLAS system, which produces automatically tuned Basic Linear Algebra Subroutines (BLAS) []. An approach using simple performance models and a detailed comparison with the empirical optimization approach can be found in []. Cache-oblivious algorithms [] are based on the notion of I/O complexity []. Latency tolerance through multithreading was implemented in the Tera computer []. Network locality and network-oblivious algorithms are explored in [].
Bibliography . Automatically tuned linear algebra software (ATLAS). http:// math-atlas.sourceforge.net/ . Allen R, Kennedy K () Optimizing compilers for modern architectures. Morgan Kaufmann Publishers Inc, San Francisco . Alverson G, Kahan S, Korry R, McCann C, Smith B () Scheduling on the Tera MTA. In: Proceedings of the workshop on job scheduling strategies for parallel processing. Lecture notes in computer science, vol . Springer, Berlin . Bilardi G, Pietracarpina A, Pucci G, Silverstri F () Networkoblivious algorithms. In: Proceedings of the st international parallel and distributed processing symposium. IEEE Publishers, New York . Chilimbi TM, Larus JR () Using generational garbage collection to implement cache-conscious data placement. In: Proceedings of the st international symposium on memory management, ISMM ‘. ACM Publishers, New York
. Cierniak M, Li W () Unifying data and control transformations for distributed shared memory machines. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, PLDI ‘. ACM Publishers, New York . Feautrier P () Some efficient solutions to the affine scheduling problem, I, one dimensional time. Int J Parallel Prog ():– . Frigo M, Leiserson C, Prokop H, Ramachandran S () Cacheoblivious algorithms. In: Proceedings of the th IEEE symposium on foundations of computer science (FOCS ), New York . Hennessy J, Patterson D () Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc, San Francisco . Hong J-W, Kung H () I/O complexity: the red-blue pebble game. In: Proceedings of the th annual ACM symposium on the theory of computing, Milwaukee . Intel and ia- architectures software development manual. Order number –US, December . Intel Corporation . Kodukula I, Ahmed N, Pingali K () Data-centric multi-level blocking. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, PLDI ‘, ACM Publishers, New York . Kuck D, Kuhn R, Padua D, Leasure B, Wolfe M () Dependence graphs and compiler optimizations. In: Proceedings of the th ACM SIGPLAN-SIGACT symposium on principles of programming languages, Williamsburg, – Jan . POPL ’. ACM Publishers, New York . Wolf M, Lam M () A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, PLDI ‘. ACM Publishers, New York . Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (). Is search really necessary to generate high-performance BLAS? Proc IEEE ():–
Lock-Free Algorithms Non-Blocking Algorithms
Locks Synchronization Transactional Memory
Logarithmic-Depth Sorting Network AKS Network
Logic Languages
Logic Languages Manuel Carro , Manuel Hermenegildo, Universidad Politécnica de Madrid, Madrid, Spain IMDEA Software Institute, Madrid, Spain
Synonyms Concurrent logic languages; Distributed logic languages; Prolog
Definition The common name of Parallel Logic Languages groups those languages which are based on logic programming and which, while respecting as much as possible the declarative semantics, have an operational semantics which exploits parallelism or concurrency, either explicitly or implicitly, to gain in efficiency or expressiveness.
Discussion Logic Programing The application of logic and mechanized proofs to express problems and their solutions is at the origins of computer science []. The basis of this approach is to express the knowledge on some problem (e.g., how a sorted tree is organized) as a consistent theory in some logic and to model the desired goal (e.g., storing an item on the tree) as a formula. If the formula is true in the theory capturing the problem conditions, then the objective expressed by such formula is achievable (i.e., the item can be stored). The set of formulas which are true in the theory are those which state the set of objectives achievable (computable) by the program. This constitutes the declarative semantics of logic programs. A usual proof strategy is to add to the initial theory a formula stating the negation of the desired goal and to prove the inconsistency of the resulting theory by generating a counterexample. The valuation(s) of the free variable(s) in the counterexample constitute a solution to the initial problem. As an example, Fig. presents a set T of formulas describing insertion in a sorted tree, where constants, predicate names, and term names are written
L
in lowercase, and variables start with uppercase letters, in conformance with logic programming languages. Arithmetic predicates are assumed to be available. void represents the empty tree. An argument of the form tree(L, I, R) represents a tree node with item I and left and right children L and R, respectively. A fact insert(T , I, T ) can be inferred from this theory if T is the resulting tree after inserting item I in tree T . Given the term T representing a sorted tree and an item I to be stored, the inconsistency of T ∧ ¬∃ T .insert(T , I, T ) can be proved with a counterexample for T which will be, precisely, the tree which results from inserting I into T . From the procedural point of view, the method used to generate a counterexample is basically a search, where at each step one of the inference rules in the logic is applied, maybe leaving other rules pending to be applied. Making this search as efficient as possible is paramount in order to make logic a practical programming tool. This can be achieved by reducing the set of inference rules to be applied and the number of places where they can be applied. Concrete logic programming languages specify the deduction method (operational semantics) further as well as the syntax and (possibly) subsets of the logic. The most successful combination is resolution (only one inference rule) on Horn clauses (conjunctions of nonnegated literals implying a non-negated literal, as in Fig. ). In particular, the usual operational semantics (SLD resolution) for the logic programming language Prolog uses the resolution principle with two rules to decide which literals the inference rule is applied to. SLD resolution in Prolog: ●
Explores conjunctions in every implication from left to right, as written in the program text (computation rule). ● Tries to match and evaluate clauses from top to bottom (search rule). From a goal like insert(tree(void, , void), , T ), SLD resolution selects one of the implications with a matching head, renames apart its parameters (i.e., variables with new names which do not clash with previous ones are created), and expands the right-hand side. The
L
L
Logic Languages
(∀I.insert ( void,I,tree ( void,I,void ))) ( ∀L,I,R,It,Il.insert ( tree ( L,I,R ) ,It,tree ( Il,I,R )) ← It < I ∧ insert (L, It, Il )) ( ∀L,I,R,It,Ir.insert ( tree ( L,I,R ) ,It,tree ( L,I,Ir )) ← It > I ∧ insert (R, It, Ir )) ( ∀L,I,R,It.insert ( tree ( L,I,R ) ,It,tree ( L,I,R )) ← It =I )
∧ (1) ∧ (2) ∧ (3) (4)
Logic Languages. Fig. Insertion in a sorted tree
first matching clause is (), and expanding its body results in: T = tree(Il, , void) ∧ < ∧ insert(void, , Il) This conjunction of goals is called a resolvent. The leftmost one is a unification [] (i.e., an equation which expresses syntactical equality between terms) which, when successful, generates bindings for the variables in it. In this case, as T is a variable, it just succeeds by assigning the term tree(Il, , void) to it. That leaves the resolvent: < ∧ insert(void, , Il) The next goal to be solved is < , which is false, and therefore the whole conjunction is false and another alternative for the last selection is to be taken. Bindings made since the last selection are undone, and clause () is selected: T = tree(void, , Ir) ∧ < ∧ insert(void, , Ir) In this case, the arithmetic predicate succeeds and the resolvent is reduced to insert(void, , Ir). This matches the head of the first clause and succeeds with the binding Ir = tree(void, , void). The initial goal is then proved with the output binding: T = tree(void, , tree(void, , void)) which represents a tree with a at its root and a in its (unique) right subtree. Logic programming languages include additional nonlogical constructs to perform I/O, guide the execution, and provide other facilities. They also offer a compact, programming-oriented syntax (e.g., avoiding as much as possible quantifiers; compare the rest of the
programs in this text with the mathematical representation of Fig. ).
Operational View and Comparison with Other Programming Paradigms Due to the fixed computation and search rules, SLD resolution can be viewed as a relatively simple operational semantics that has significant similarities to that of other programming paradigms. This provides a highly operational view of resolution-based theorem proving. In fact, in the procedural view, SLD is akin to a traditional execution that traverses clauses sequentially from left to right, in the same way execution traverses procedure bodies in traditional languages. Predicates take the role of procedures, clause head arguments are procedure arguments, clauses are similar to case statements, etc. Predicate invocation takes the role of procedure calling and unification amounts to parameter passing (both for input and output). Logical variables are similar to (declarative) pointers that can be used to pass data or pointers inside data structures. Thus, pure deterministic logic programs can be seen as traditional programs with pointers and dynamic memory allocation but without destructive update. In addition, in nondeterministic programs clause selection offers a built-in backtracking-based search mechanism not present in other paradigms, which allows exploring the possible ways to accomplish a goal. Finally, unification provides a rich, bidirectional pattern matching-style parameter passing and data access mechanism. Similarly, logic programming (specially variants that support higher order) can be seen as a generalization of functional programming where pattern matching is bidirectional, logic variables are allowed, and more than one definition is possible for each function. See, for example, [] for a more detailed discussion of the correspondence between logic programming and other programming paradigms.
Logic Languages
Toward Parallelism and Concurrency Parallel logic programming languages aim at retaining the observable semantics while speeding up the execution w.r.t. a sequential implementation. Concurrent logic languages (Sect. Concurrency in Logic Programming) focus more on expressiveness and often adopt an alternative semantics. Opportunities for parallel and concurrent execution come from relaxing the left-to-right, top-to-bottom order of the computation and search rules: it is, in principle, possible to select several different literals in a logical conjunction to resolve against simultaneously (and-parallelism), or explore several different clauses at the same time (or-parallelism) [, ]. The distinction between parallel and concurrent logic languages is not as clear-cut as implied so far. In fact, parallel execution models have been proposed for concurrent logic languages in which independent parts of the execution of concurrent processes are executed in different processors simultaneously. Conversely, the language constructs developed for parallelism have also been used for concurrency and distributed execution. The current trend is toward designing languages that combine in useful ways parallelism, concurrency, and search capabilities. Most parallel logic programming languages have been implemented for shared-memory multicomputers. There is no a priori reason not to execute a logic program in a distributed fashion, as demonstrated by distributed implementations []. However, the need to maintain a consistent binding store seen by all the agents involved in an execution imposes an overhead which does not appear in the shared memory case. The focus will, therefore, be on linguistic constructions which, although not necessarily requiring sharedmemory computers, have been mainly exploited in this kind of architectures.
Exploiting Parallelism in Logic Programming Independently from whether parallelism is exploited among conjunctions of literals or among different clauses, it can be detected in several ways: Implicit parallelism: The compiler and run-time system decide what can be executed in parallel with no programmer intervention, analyzing, for example, data dependencies. One disadvantage of this
L
approach is that it may impose considerable runtime overhead, and the programmer cannot give clues as to what can (or should) be executed in parallel. Explicit parallelism: The compiler and run-time system know beforehand which clauses/goals are to be executed in parallel typically via programmer annotations. This has the disadvantage that it puts the responsibility of parallelization (a hard task) solely in the hands of the programmer. Combinations: This is realized in practice as a combination of handwritten annotations, static analysis, and run-time checks. The last option is arguably the one with the most potential, as it aims at both reducing run-time effort and simplifying the task of the programmer, while still allowing annotating programs for parallel execution. In the following sections, variants of these sources of parallelism will be revised briefly, assuming that annotations to express such parallelism at the source language level are available and the programs have been annotated with them. Methods for introducing these annotations automatically (briefly mentioned in Sect. Automatic Detection of Parallelism) are the subject of another entry (Parallelization, Automatic).
And-Parallelism And-parallelism stems from simultaneously selecting several literals in a resolvent (i.e., several steps in a procedure body). While the semantics of logic programming makes it possible to safely select any goal from the resolvent, practical issues make it necessary to carefully choose which goals can be executed in parallel. For example, the evaluation of arithmetic expressions in classical Prolog needs variables to be adequately instantiated, and therefore the sequential execution order should be kept in some fragments; input/output also needs to follow a fixed execution order; and tasks which are too small may not benefit from parallel execution []. Whether a sequential execution order has to be respected (either for efficiency or correctness) can be decided based on goal independence, a general notion which captures whether two goals can proceed without mutual interference. Deciding whether the amount of
L
L
Logic Languages
work inside a goal is enough to benefit from parallel execution can be done with the use of (statically inferred) cost (complexity) functions. With them, the expected sequential and parallel execution time can be compared at compile time or at run time (but before executing the parallel goals). An Illustrating Example
The code in Fig. implements a parallel matrix by matrix multiplication. Commas denote sequential conjunctions (i.e., A, B means “execute goal A first, and when successfully finished, execute goal B”), while goals separated by an ampersand (A & B) can be safely executed in parallel. Matrices are represented using nested lists, and matrix multiplication is decomposed into loops performing vector by matrix and vector by vector multiplication, which are in turn expressed by means of recursion. Every iteration step can be performed independently of the others in the same loop and therefore parallelism can be expressed by stating that predicate calls in the body of the recursive clauses (e.g., mat_vec_multiply in mat_mat_multiply) can be executed in parallel with the corresponding recursive call (mat_mat_multiply). Note that the arithmetic operation R is Prod + NewRes is performed after executing the other goals in the clause, because it needs NewRes to be computed.
mat_mat_multiply([],_,[]). mat_mat_multiply([V0|Rest], V1,[R|Os]):mat_vec_multiply(V1,V0,R) & mat_mat_multiply(Rest, V1, Os). mat_vec_multiply([],_,[]). mat_vec_multiply([V0|Rest], V1,[R|Os]):vec_vec_multiply(V0,V1,R) & mat_vec_multiply(Rest, V1, Os). vec_vec_multiply([],[],0). vec_vec_multiply([H1|T1],[H2|T2], R):Prod is H1*H2 & vec_vec_multiply(T1,T2, NewRes), R is Prod + NewRes. Logic Languages. Fig. Matrix multiplication
The & construct is similar to a FORK-JOIN, where the existence of A & B implicitly marks a FORK for the goals A and B and a JOIN after these two goals have finished their execution. Such parallel execution can be nested to arbitrary depths. The strict fork-join structure is also not compulsory and other, more flexible parallelism primitives are also available. A relevant difference with procedural languages is the possible presence of search, that is, a goal can yield several solutions on backtracking, corresponding to the selection of different clauses, or even fail without any solution. Parallel execution should cope with this: when A & B can yield several solutions, an adequate operational semantics (and underlying execution mechanism) has to be adopted. Several approaches exist depending on the relationship between A and B. Perhaps, the simplest one is to execute in parallel and in separate environments A and B, gather all the solutions, and then build the cross product of the solutions. However, even if A and B can be safely executed in parallel, this approach is often not satisfactory for performance reasons. Independent And-Parallelism
Independent and-parallelism (IAP) only allows the parallel execution of goals which do not interfere through, for example, bindings to variables, input/output, assertions onto the shared database, etc. In absence of side effects, independence is customarily expressed in terms of allowing only compatible accesses to shared variables. Independent and-parallelism can be strict or nonstrict. The strict variant (SIAP) only allows parallel execution between goals which do not share variables at run time. The non-strict variant (NSIAP) relaxes this requirement to allow parallel execution of goals sharing variables as long as at most one of the parallel goals attempts to bind/check them or they are bound to compatible values. The matrix multiplication (Fig. ) features SIAP if it is assumed that the input data (the first two matrices) does not contain any variable at all (i.e., these matrices are ground) and therefore independence is ensured as no variables can be shared. As a further example, Fig. , left, is an implementation of the QuickSort algorithm, where it is assumed that a ground list of elements is given in order to be sorted. In this case, partition splits the input list into two lists which are recursively sorted and joined
Logic Languages
qsort([], []). qsort([X|Xs], Sorted):partition(Xs, X, Large, Small), qsort(Small, SmallSorted), qsort(Large, LargeSorted), append(Small, [X|Large], Sorted).
L
qsort([], []). qsort([X|Xs], Sorted):partition(Xs, X, Large, Small), qsort(Small, SmallSorted) & qsort(Large, LargeSorted), append(Small, [X|Large], Sorted).
% Partition is common to all % QuickSort versions partition([], _P, [], []). partition([X|Xs], P, [X|Ls], Ss):X > P, partition(Xs, P, Ls, Ss). partition([X|Xs], P, Ls, [X|Ss]):X =< P, partition(Xs, P, Ls, Ss). Logic Languages. Fig. QuickSort: sequential version (left) and SIAP version (right)
qsort(Unsrt, Sorted):qs(Unsrt, Sorted, []). qsort([], S, S). qsort([X|Xs], SortedSoFar, Tail):partition(Xs, X, Large, Small), qsort(Small, SortedSoFar, Rest), qsort(Large, Rest, Tail).
qsort(Unsrt, Sorted):qs(Unsrt, Sorted, []). qsort([], S, S). qsort([X|Xs], SortedSoFar, Tail):partition(Xs, X, Large, Small), qsort(Small, SortedSoFar, Rest) & qsort(Large, Rest, Tail).
Logic Languages. Fig. QuickSort, using difference lists and annotated for non-strict and-parallelism
afterward. The code on the right is the SIAP parallelization, correct under these assumptions, where the two calls to QuickSort are scheduled for parallel execution. A perhaps more interesting example is the implementation of QuickSort in Fig. . The sequential version on the left uses a technique called difference lists. The sorted list is, in this case, not ended with a nil, but with a free variable (a “logical pointer”) which is carried around. When this variable is instantiated, the tail of the difference list is instantiated. This makes it possible to append two lists in constant time. A procedural counterpart would carry around a pointer to the last element of a list, so that it can be modified directly. The two calls to Qsort share a variable and cannot be executed in parallel following the SIAP scheme. However, since only the second call will attempt to bind it they can be scheduled for NSIAP execution. A last example of IAP (this time including some search) appears in Fig. . It is a solution to the problem
of placing N queens on an N × N chessboard so that no two queens attack each other. It places the queens column by column picking (with select/3) a position from a list of candidates. not_attack/3 checks that this candidate does not attack already placed queens. Search is triggered by the multiple solutions of select/3, and the program can return all the solutions to the queens problem via backtracking. Checking that queens are attacked is independent from trying to place the rest of the queens, and so the two goals can be run in parallel, performing some speculative work. Unlike the previous examples, the not_attack/3 check can fail and force the failure of a series of parallel goals which should be immediately stopped. Applying the execution scheme sketched in Sect. An Illustrating Example (collecting solutions and making a cross product) is not appropriate: if a candidate queen cannot be placed on the chessboard, the work invested in trying to build the rest of the solutions is
L
L
Logic Languages
queens(N, Qs) :range(1, N, Ns), queens(Ns, [], Qs). queens([], Qs, Qs). queens(UnplacedQs, SafeQs, Qs) :select(UnplacedQs, UnplacedQs1, Q), not_attack(SafeQs, Q, 1) & queens(UnplacedQs1, [Q|SafeQs], Qs).
not_attack([], _, _). not_attack([Y|Ys], X, N) :X =\= Y+N, X =\= Y-N, N1 is N+1, not_attack(Ys, X, N1). select([X|Xs], Xs, X). select([Y|Ys], [Y|Zs], X) :select(Ys, Zs, X).
Logic Languages. Fig. N-Queens annotated for and-parallelism
wasted. If only one solution is needed and the candidate queen is correctly placed, the work spent in finding all the solutions is wasted. One workaround is to adapt the sequential operational semantics to the parallel case and perform recomputation. Assuming the conjunction A & B & C, forward execution is done as in a FORK-JOIN scheme, and backtracking is performed from right to left. New solutions are first sought for C and combined with the existing ones for A and B. When the solutions for C are exhausted, B is restarted to look for another solution while C is relaunched from the beginning. When the goal A finitely fails, the whole parallel conjunction fails. This generates the solutions in the same order as the sequential execution, which may be interesting for some applications. Additionally, in the case of IAP, if one of the goals (say, B) fails without producing any solution, the parallel conjunction can immediately fail. This is correct because B does not see bindings from A, and therefore no solution for A can cause B not to fail. This semiintelligent backtracking (combined with some constraints on scheduling) guarantees the no-slowdown property: an IAP execution will never take longer that a sequential one (modulo overheads due to scheduling and goal launching). Dependent And-Parallelism
IAP’s independence restriction has its roots in technical difficulties: a correct, efficient implementation of a language with logical variables and backtracking and in which multiple processing units are concurrently trying to bind these variables is challenging – that is the scenario in Dependent And-Parallelism (DAP). On one
hand, every access to (potentially) shared variables and their associated internal data structures has to be protected. Additionally, when two processes disagree on the value given to a shared variable, one of them has to backtrack to generate another binding. Parallel goals sharing the variable whose binding is to be undone may have to backtrack if the binding was used to select which clause to take. However, since DAP subsumes IAP, it offers a priori more opportunities for parallelism: in any of the QuickSort examples, the lists generated by partition can be directly fed to the qsort goals, which could run in parallel between them and with the partition predicate. When a new item is generated, the corresponding qsort call will wake up and advance the sorting process. If the goals to be executed in parallel do not bind shared variables differently (i.e., they collaborate rather than compete), the problem of ensuring binding coherence disappears (it is, in fact, NSIAP). Moreover, if they are deterministic, there is no need to handle backtracking, and the parallel goals act as if they were producers and consumers. This is called (deterministic) stream and-parallelism, and can be (efficiently) exploited. The QuickSort program is an example when it is called with a free variable as second argument, because this ensures that sorting will not fail and will produce a single solution. On the other hand, an initial goal such as ?- qsort([1,2], [2,1]) will fail, and cannot be executed using stream parallelism. Despite its complexity, proposals which fully deal with DAP exist. One approach to avoid considering all variables as potentially shared is to mark possibly shared variables at compile time, so that only these are tested at
Logic Languages
run time [] (note that in the example above X and Y are aliased after the execution of q): p(X, Y) :q(X, Y), dep([X, Y]) => r(X) & s(Y) & t(Y). q(A, A). Shared variables are given a special representation to, for example, make it possible to lock them. In order to control conflicts in the assignment to variables, goals are dynamically classified as producers or consumers of variables as follows: for a given variable X, the leftmost goal which has access to X is its producer, and can bind it. The rest of the goals are consumers of the variable, and they suspend should they try to bind the variable. In the previous code, r would initially be the producer of X (which is aliased to Y by q/2) and s and t its consumers. When r finishes, s becomes the producer of Y. The overhead of handling this special variable representation is large and impacts sequential execution: experiments show a reduction of % in sequential speed. Goal-Level and Binding-Level Independence
While IAP and DAP differ in the kind of problems they attack and the techniques needed to implement them, there are unifying concepts behind them: both IAP and DAP need, at some level, operations which are executed atomically and without interference. In the case of DAP, binding a shared variable and testing that this binding is consistent with those made by other processes is an atomic operation. Stream and-parallelism avoids testing
L
for consistency by requiring that bindings by different processes do not interfere; thereby it exhibits independence at the level of individual bindings, while in IAP noninterference appears at the level of complete goals. The requirement that variables are not concurrently written to makes variable locking unneeded for IAP. Other proposals (Sect. Eventual and Atomic Tells) have a notion of independence at the level of groups of unifications explicitly marked in the program source.
Or-Parallelism Or-Parallelism originates from executing in parallel the different clauses which can be used to solve a goal in the resolvent. Multiple clauses, as well as the continuation of the calling goal, can thus be simultaneously tried. Or-parallelism allows searching for solutions to a query faster, by exploring in parallel the search space generated by the program. For example, in the queens program of Fig. the two clauses of the select/3 predicate can proceed in parallel as well as the calls to not_attack/3 and queens/3 which correspond to every solution of select/3: branches continue in parallel after the predicate which forked the execution has finished and into the continuations of the clauses calling such predicate. This means that each possible selection of a queen will be explored in parallel. In principle, all the branches of the search tree of a program are independent and thus or-parallelism can apparently always be exploited. In practice, dependencies appear due to side effects and solution orderdependent control structures (such as the “cut”), and must be taken into account. Control constructs that are more appropriate for an or-parallel context have also
:- parallel select/3. queens(N, Qs) :range(1, N, Ns), queens(Ns, [], Qs). queens([], Qs, Qs). queens(UnplacedQs, SafeQs, Qs) :select(UnplacedQs, UnplacedQs1, Q), not_attack(SafeQs, Q, 1), queens(UnplacedQs1, [Q|SafeQs], Qs). Logic Languages. Fig. N-Queens annotated for or-parallelism
not_attack([], _, _). not_attack([Y|Ys], X, N) :X =\= Y+N, X =\= Y-N, N1 is N+1, not_attack(Ys, X, N1). select([X|Xs], Xs, X). select([Y|Ys], [Y|Zs], X) :select(Ys, Zs, X).
L
L
Logic Languages
been proposed. Much work has also been devoted to developing efficient task scheduling algorithms and, in general, or-parallel Prolog systems [, , ]. Or-parallelism frequently arises in applications that explore a large search space via backtracking such as in expert systems, optimization and relaxation problems, certain types of parsing, natural language processing, scheduling, or in deductive database systems.
Data Parallelism and Unification Parallelism Most work in parallel logic languages assumed an underlying MIMD execution model of parallel programs which were initially written for sequential execution. However, several approaches to exploiting SIMD parallelism exist. And-data-parallelism, as in [], tries to detect loops which apply a given operation to all the elements of a data structure in order to unroll them and apply the operation in parallel to every data item in the structure. In order to access quickly the different elements in the data structure, parallel unification algorithms were developed. This kind of data parallelism can be seen as a highly specialized version of independent and parallelism []. Or-data-parallelism, as in [], changes the standard head unification to keep several environments simultaneously. Unifications of several heads can then be executed in parallel. Combinations of Or- and And-Parallelism Combining or- and and-parallelism is possible, and in principle it should be able to exploit more opportunities for parallel execution. As an example, Fig. shows the Queens program annotated to exploit
and+or-parallelism, by combining the annotations in Figs. and . The efficient implementation of and+orparallelism is difficult due to the antagonistic requirements of both cases: while and-parallelism needs variables shared between parallel goals to be accessible by independent threads, or-parallelism needs shared variables to have distinct instantiations in different threads. Restricted versions of and+or-parallelism have been implemented: for example, in [], teams of processors execute deterministic dependent and-parallel goals. When a nondeterministic goal is to be executed, the team moves to a phase which executes the goal using or-parallelism. Every branch of the goal is deterministic, from the point of view of the agent executing it, so deterministic dependent and-parallelism can be used again.
Automatic Detection of Parallelism A number of parallelizing compilers have been developed for (constraint) logic programming systems [, , ]. They exemplify a combined approach to exploiting parallelism (Sect. Exploiting Parallelism in Logic Programming): the compiler aims at automatically uncovering parallelism in the program by producing a version annotated with parallel operators, which is then passed to another stage of the compiler to produce lower-level parallel code. At the same time, programmers can introduce parallel annotations in the source program that the parallelizer will respect (and check for correctness, possibly introducing run-time checks for independence). The final parallelized version could always have been directly written by hand, but of course the possibility of being able to automatically parallelize (parts of)
:- parallel select/3. queens(N, Qs) :range(1, N, Ns), queens(Ns, [], Qs). queens([], Qs, Qs). queens(UnplacedQs, SafeQs, Qs) :select(UnplacedQs, UnplacedQs1, Q), not_attack(SafeQs, Q, 1) & queens(UnplacedQs1, [Q|SafeQs], Qs).
not_attack([], _, _). not_attack([Y|Ys], X, N) :X =\= Y+N, X =\= Y-N, N1 is N+1, not_attack(Ys, X, N1). select([X|Xs], Xs, X). select([Y|Ys], [Y|Zs], X) :select(Ys, Zs, X).
Logic Languages. Fig. N-Queens annotated for and- and or-parallelism
Logic Languages
programs initially written for sequential execution is very appealing. The parallelization can leave some independence checks for run time, also combining static and dynamic techniques. Experimental results have shown that this approach can uncover significant amounts of parallelism and achieve good speedups in practice [, ] with minimal human intervention. See the entry Parallelization, Automatic for more details on this topic.
Extensions to Constraint Logic Programming Constraint logic programming extends logic programming with constraint solving capabilities by generalizing unification while preserving the basic syntax and resolution-based semantics of traditional logic programming. These extensions open a large class of new applications and also bring improved semantics for certain aspects of traditional logic programming, such as fully relational support for arithmetic operations. Constraint logic programs also explore a search space and Or-parallelism is directly applicable in this context. Parallelism can also be exploited within the constraint solving process itself in a similar way to unification parallelism. Finally, and-parallelism can also be exploited, and the constraints point of view brings new, interesting generalizations of the concepts of independence.
Concurrency in Logic Programming Concurrent logic programming languages [, ] do not aim primarily at improving execution speed but rather at increasing language expressiveness. As a consequence, their concurrency capabilities remain untouched in single-processor implementations. A strong point of these languages is the capability to express reactive systems. This affects their view of non-determinism: sequential logic programs adopt don’t know non-determinism, where predicate clauses express different ways to achieve a goal, but it is not known which one of these clauses leads to a solution for a given problem instance before their execution successfully finishes. Unsuccessful branches can be discarded (i.e., backtracked over) at any moment and their intermediate results are undone, except for side effects such as input/output or changes to the internal database. These computations are not reactive, as their only “real” result is the substitutions produced after success.
L
Don’t care non-determinism reflects the fact that, among the clauses that are eligible to generate a solution, any of them will do the job, and it does not matter which one is selected. Once a clause is committed to, the computation it performs is visible by the rest of the processes collaborating in the computation – hence the reactiveness of the language – and backtracking cannot make this effect disappear.
Syntax and (Intuitive) Semantics A general, simplified syntax which is adopted in several proposals for concurrent logic languages will be used here. A concurrent logic program is a collection of clauses of the form Head :- Ask : Tell ∣ Body. where variables have been abstracted. Their meaning is: ●
Head is a literal whose formal arguments are distinct variables. ● Ask, the clause guard, is a conjunction of constraints (equality, disequality, simple tests on variables, and arithmetic predicates in the simplest case). These constraints are used to check the state of the head variables and do not introduce new bindings. If the Ask constraints succeed, the clause can be selected. Otherwise, the clause suspends until it becomes selectable or until another clause of the same predicate is selected to be executed (only one clause of a predicate can be selected to execute a given goal). If all clauses for all pending goals are suspended, the computation is said to deadlock. ● Tell is a conjunction of constraints which, upon clause selection, are to be executed to create new bindings, which are seen inside and outside the clause. ● Body is a conjunction of goals which can be true (in which case it can be omitted) to express the end of a computation, a single literal to express the transformation of a process into another process, or a conjunction of literals to denote spawning of a series of processes interconnected by means of shared variables. Fig. (taken from []), left, shows an example. A nondeterministic merger accepts inputs from two lists of elements and generates an output list taking elements from both lists while respecting the relative order in them. Clauses are selected first depending on the
L
L
Logic Languages
merge(In1, In2, Out):In1 = [X|Ins] : Out = [X|Rest] | merge(Ins, In2, Rest). merge(In1, In2, Out):In2 = [X|Ins] : Out = [X|Rest] | merge(In1, Ins, Rest). merge(In1, In2, Out):In1 = [] : Out = In2 | true. merge(In1, In2, Out):In2 = [] : Out = In1 | true.
use_merge(L):skip_starting(0, 2, Evens), skip_starting(1, 2, Odds), merge(Evens, Odds, L). skip_starting(Base, Skip, List):List = [Base|Rest] | NextBase is Base + Skip, skip_starting(NextBase, Skip, Rest).
Logic Languages. Fig. Stream merger (left) and how to use it (right)
availability of elements in the input lists: when there is an element available in some input list (e.g., In1 = [X|Ins]), it is used to construct stepwise the output list (Out = [X|Rest]), and the process is recursively called with new arguments. The use of such a merger is shown in Fig. , right. Predicate use_merge is expected to return in L the list of natural numbers. Three processes are spawned: two of them generate a list of even numbers starting at (resp. odd and starting at ) and the third one merges these lists. The shared variables Evens and Odds act as communicating channels which are read by merge as they are instantiated. A note on syntax: Other concurrent languages have adopted a different syntax to distinguish Tell from Ask constraints and, in some cases, to specify input and output modes. For example, the original syntax by Ueda [] uses head unification for Ask constraints, and the first clause of merge would therefore be: merge([X|Ins], In2, Out):Out = [X|Rest] | merge(Ins, In2, Rest). Other languages (e.g., Parlog) include mode declarations to specify which formal arguments are going to be read from or written to, and require the programmer to provide them to the compiler. Another example of the expressiveness of concurrent logic languages (which is in some sense related to that of lazy functional languages) appears in Fig. , which shows a solution [] to the generation of the Hamming sequence: the ordered stream of numbers of the form i j k , where i, j, k are natural numbers, without repetition. The standard procedural solution keeps
a pool of candidates to be the next number and outputs the smallest one when it is sure that no smallest candidate can be produced. Fig. assumes that head unifications are Ask guards. The concurrent logic solution starts with a seed for i = j = k = and spawns three multiply processes which generate, from an existing partial stream, streams X2, X3, and X5 with a last element depending on the previous last. These streams are merged, filtering duplicates, by two ord_merge predicates. The result of the last ord_merge predicate is fed back to the three multiply processes which can take then a step to generate the next candidate item in the answer stream every time Xs is further instantiated.
Variations on Semantics There are some variability points on the semantics of concurrent logic languages which, in addition to slightly different syntax, cause large differences among the languages. These semantic differences also have an important impact on the associated implementation techniques. It is remarkable that the history of concurrent logic languages was marked by a “simplification” of the semantics, which brought about less expressiveness of the language (such as, for example, eliminating the possibility of generating multiple solutions) and, accordingly, less involved implementations []. This has been partly attributed to such very expressive mechanisms not being needed for the tasks these languages were mainly used for, and in part to the complication involved in their implementation. Other research aimed at including concurrency without giving up multiple solutions includes some concurrent constraint languages such as Oz and AKL, as well as more traditional
Logic Languages
L
ord_merge([X|Xs], [X|Ys], Out):Out = [X|Os] | ord_merge(Xs, Ys, Os). ord_merge([X|Xs], [Y|Ys], Out):X < Y : Out = [X|Os] | ord_merge(Xs, [Y|Ys], Os). ord_merge([X|Xs], [Y|Ys], Out):X > Y : Out = [Y|Os] | ord_merge([X|Xs], Ys, Os). ord_merge([], Ys, Out):- Out = Ys. ord_merge(Xs, [], Out):- Out = Xs.
hamming(Xs):multiply([1|Xs], 2, X2), multiply([1|Xs], 3, X3), multiply([1|Xs], 5, X5), ord_merge(X2, X3, X23), ord_merge(X23, X5, Xs). multiply(In, N, Out):In = [] : Out = []. multiply(In, N, Out):In = [X|Xs] : Out = [Y|Ys], Y is X * N | multiply(Xs, N, Ys). Logic Languages. Fig. Generation of the Hamming sequence
logic programming systems which include extensions for distributed and concurrent execution (e.g., Ciao, SWI, or Yap). Deep and Flat Guards
Flat guard languages restrict the Ask guards to simple primitive operations (i.e., unifications, arithmetic predicates, etc.). Deep guards, on the other hand, can invoke arbitrary user predicates. Any bindings generated by deep guards have to be isolated from the rest of the environment (e.g., the caller) in order not to influence other guards which may be suspended or under evaluation. Deep guards are more expressive than flat guards but the implementation techniques are much more complex. The technology necessary is related to that of the implementation of or-parallel systems, as bindings to variables in deep guards need to be kept in separate environments, just as in the or-parallel execution of logic programs. Fairness
The question of whether processes which can progress will effectively do so is decomposed into two different concepts, or-fairness and and-fairness, respectively, related to which clause (among those satisfying their guards) and process (among the runnable ones) is selected to execute. Or-fairness can be exemplified with the merger in Fig. . For infinite stream producers, it does not
guarantee that items from any of the input streams will eventually be written onto its output stream. Whether this happens will depend on the concrete semantics under which it is being executed. If this is guaranteed to eventually happen, then the language has or-fairness. And-fairness concerns the selection of which process can progress among the runnable ones. In Fig. , right, the two instances of skip_starting do not need to suspend waiting for any condition. Therefore, it would be possible that one of them runs without interruption, generating a stream of numbers which are consumed by merge, which would forward them to its output list. Had ord_merge been used instead of merge, the stream producers would alternatively suspend and resume. Eventual and Atomic Tells
Tell guards can be executed at once before the goals in the body, or be separately scheduled. In the former case, Tells are said to be atomic because all the bindings in the Tell are generated at once. In the latter case, they are said to be eventual, as they are handled according to the and-fairness rules of the language. Atomic Tell makes it possible to write some algorithms more elegantly than eventual Tell. However, eventual Tells make it possible to treat unifications and constraints more homogeneously, and are more adequate to reason on the actual properties of distributed implementations of concurrent logic languages.
L
L
LogP Bandwidth-Latency Model
Related Entries Functional Languages Parallelization, Automatic Prolog Machines
Bibliographic Notes and Further Reading Pointers to a brief selection of the related literature have been provided within the text and are included in the references. For a much more detailed coverage of this very large topic the reader is referred to the comprehensive surveys in [, , ].
Bibliography . Ali K, Karlsson R () The Muse approach to or-parallel Prolog. Int J Parallel Program ():– . Bevemyr J, Lindgren T, Millroth H () Reform Prolog: the language and its implementation. In: Warren DS (ed) Proceedings of tenth international conference on logic programming. MIT Press, Cambridge, pp – . Bueno F, García de la Banda M, Hermenegildo M (March ) Effectiveness of abstract interpretation in automatic parallelization: a case study in logic programming. ACM Trans Program Lang Syst ():– . Conery JS () Parallel execution of logic programs. Kluwer, Norwell . Costa VS, Warren DHD, Yang R () Andorra-I compilation. New Gener Comput ():– . Green C () Theorem proving by resolution as a basis for question-answering systems. In: Machine intelligence, . Gupta G, Pontelli E, Ali K, Carlsson M, Hermenegildo M (July ) Parallel execution of Prolog programs: a survey. ACM Trans Program Lang Syst ():– . Hermenegildo M (December ) Parallelizing irregular and pointer-based computations automatically: perspectives from logic and constraint programming. Parallel Comput (– ):– . Hermenegildo M, Carro M (July ) Relating data-parallelism and (and-) parallelism in logic programs. Comput Lang J (/):– . Hermenegildo M, Greene K () The &-Prolog system: exploiting independent and-parallelism. New Gener Comput (,):– . Kacsuk P () Distributed data driven Prolog abstract machine (DPAM). In: Kacsuk P, Wise MJ (eds) Implementations of distributed Prolog. Wiley, Aachen, pp – . Lloyd JW () Foundations of logic programming, nd extended edn. Springer, New York
. Lusk E, Butler R, Disz T, Olson R, Stevens R, Warren DHD, Calderwood A, Szeredi P, Brand P, Carlsson M, Ciepielewski A, Hausman B, Haridi S () The aurora or-parallel Prolog system. New Gener Comput (/):– . Muthukumar K, Bueno F, García de la Banda M, Hermenegildo M (February ) Automatic compile-time parallelization of logic programs for restricted, goal-level, independent and-parallelism. J Log Program ():– . Shapiro EY (September ) The family of concurrent logic programming languages. ACM Comput Surv ():– . Shen K (November ) Overview of DASWAM: exploitation of dependent and-parallelism. J Log Program (–):– . Smith D () MultiLog and data or-parallelism. J Log Program (–):– . Tick E (May ) The deevolution of concurrent programming languages. J Log Program ():– . Ueda K () Guarded Horn clauses. In: Shapiro EY (ed) Concurrent Prolog: collected papers. MIT Press, Cambridge, pp –
LogP Bandwidth-Latency Model Bandwidth-Latency Models (BSP, LogP)
Loop Blocking Tiling
Loop Nest Parallelization Utpal Banerjee University of California at Irvine, Irvine, CA, USA
Synonyms Parallelization
Definition Loop Nest Parallelization refers to the problem of finding parallelism in a perfect nest of sequential loops. It typically deals with transformations that change the execution order of iterations of such a nest L by creating a different nest L′ , such that L′ is equivalent to L and some of its loops can run in parallel.
Loop Nest Parallelization
Discussion Introduction For a better understanding of the current essay, the reader should first go through the essay Unimodular Transformations in this encyclopedia. The program model is a perfect nest L of m sequential loops. An iteration of L is an instance of the body of L. The program consists of a certain set of iterations that are to be executed in a certain sequential order. This execution order imposes a dependence structure on the set of iterations, based on how they access different memory locations. A loop in L carries a dependence if the dependence between two iterations is due to the unfolding of that loop. A loop can run in parallel if it carries no dependence. In this essay, one considers transformations that change L into a perfect nest L′ with the same set of iterations executing in a different sequential order. (The number of loops may differ from L to L′ .) The new nest L′ is equivalent to the old nest L (i.e., they give the same results), if whenever an iteration depends on another iteration in L, the first iteration is executed after the second in L′ . If a loop in L′ carries no dependence, then that loop can run in parallel. The goal is to transform L into an equivalent nest L ′ , such that some inner and/or outer loops of L′ can run in parallel. Two major loop nest transformations are discussed here: unimodular and echelon. Unimodular transformations have already been introduced in this encyclopedia in the essay under that name. There it is shown by examples how to achieve inner and outer loop parallelization via unimodular transformations. Detailed algorithms for achieving those goals are given in this essay. Echelon transformations are explained after that, and it is pointed out how a combination of both transformations can maximize loop parallelization.
Mathematical Preliminaries Lexicographic order and Fourier’s method of elimination, as explained in the essay Unimodular Transformations, will be used in the following. In this section, some of the elementary definitions and algorithms of matrix theory are stated, customized for integer matrices. From now on, a matrix is an integer matrix, unless explicitly designated to be otherwise.
L
The transpose of a matrix A is denoted by A′ . The m × m unit matrix is denoted by Im . A submatrix of a given matrix A is the matrix obtained by deleting some rows and columns of A. If A is an m × p matrix and B an m × q matrix, then the column-augmented matrix (A; B) is the m × (p + q) matrix whose first p columns are the columns of A and the last q columns are the columns of B. Thus, ⎛ ⎞ ⎛ ⎞ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎟ and B = ⎜ ⎟ A=⎜ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎝ ⎠ ⎝⎠ ⎛ ⎜ ⎜ yield (A; B) = ⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠
A row-augmented matrix is defined similarly. The (main) diagonal of an m × n matrix (aij ) is the vector (a , a , . . . , app ) where p = min(m, n). The rank of a matrix A, denoted by rank(A), is the maximum number of linearly independent rows (or columns) of A. For a given m × n matrix A, let ℓ i denote the column number of the leading (first nonzero) element of row i. (For a zero row, ℓ i is undefined.) Then A is an echelon matrix, if for some integer ρ in ≤ ρ ≤ m, the following holds: . The rows through ρ are nonzero rows. . The rows ρ + through m are zero rows. . For ≤ i ≤ ρ, each element in column ℓi below row i is zero. . ℓ < ℓ < ⋯ < ℓ ρ . If there is such a ρ, then ρ = rank(A). A zero matrix is an echelon matrix for which ρ = . Three nonzero echelon matrices with the values ρ = , , , respectively, are given below: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ − ⎠
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ − ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎠
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎠
L
L
Loop Nest Parallelization
For any given matrix, there are three types of elementary row operations: . Reversal: multiply a row by −. . Interchange: interchange two rows. . Skewing: add an integer multiple of one row to another row. The elementary column operations are defined similarly. An elementary matrix is any matrix obtained from a unit matrix by one elementary row operation. Thus, there are three types of elementary matrices. The elementary matrices can also be derived by column operations. For example, the same matrix is obtained from the × unit matrix if we interchange rows and , or interchange columns and . A unimodular matrix is a square integer matrix with determinant or −. Each elementary matrix is unimodular, and each unimodular matrix can be expressed as the product of a finite number of elementary matrices (Lemma . in []). For a given matrix A, performing an elementary row operation is equivalent to forming a product of the form EA, where E is the corresponding elementary matrix. Applying a finite sequence of row operations to A is the same as forming a product of the form UA, where U is a unimodular matrix. Reducing an m × n matrix A to echelon form means finding an m × m unimodular matrix U and an m × n echelon matrix S, such that UA = S. Figure gives an algorithm that reduces any given matrix A to echelon form. Note that applying the same row operation to the two matrices U and S separately is equivalent
to applying that operation to the column-augmented matrix (U; S). For a given matrix A, one may need to find a unimodular matrix V and an echelon matrix S such that A = VS. This V could be the inverse of the unimodular matrix U of Algorithm . However, if U is not needed, Algorithm can be modified to yield directly a unimodular V and an echelon S such that A = VS. This is the algorithm of Fig. . In Algorithm , whenever a row operation is applied to S, it is followed by a column operation applied to V. The row operation is equivalent to forming a product of the form ES, where E is an elementary matrix. The column operation that follows is equivalent to forming the product VE − . Hence, the relation A = VS remains unchanged since VS = (VE− )(ES). Each algorithm can be modified by using extra reversal operations, as needed, so that any given rows of the computed echelon matrix are nonnegative. The program model for all transformations is a perfect nest of loops L = (L , L , . . . , Lm ) shown in Fig. . The limits pr and qr are integer-valued linear functions of I , I , . . . , Ir− . All distance vectors are assumed to be uniform. If there is no distance vector d with d ≻r , then the loop Lr carries no dependence and is able to run in parallel.
Unimodular Transformations Let U denote an m×m unimodular matrix. The unimodular transformation of the loop nest L induced by U is
Algorithm 1 Given an m × n integer matrix A, this algorithm finds an m × m unimodular matrix U and an m × n echelon matrix S = (sij), such that UA = S. It starts with U = Im and S = A, and works on the matrix (U; S). Let i0 denote the row number in which the last processed column of S had a nonzero element. set U ← Im, S ← A, i0 ← 0 do j = 1, n, 1 if there is at least one nonzero sij with i0 < i ≤ m then set i0 ← i0 + 1 do i = m, i0 + 1, −1 do while sij = 0 set σ ← sgn (si−1,j∗sij) z ← |si−1,j|/|sij| subtract σz times row i from row (i − 1) in (U; S) interchange rows i and (i − 1) in (U; S)
Loop Nest Parallelization. Fig. Echelon Reduction Algorithm
Loop Nest Parallelization
L
Algorithm 2 Given an m × n integer matrix A, this algorithm finds an m × m unimodular matrix V and an m × n echelon matrix S = (sij ), such that A = VS. It starts with V = Im and S = A, and works on the matrices V and S. Let i0 denote the row number in which the last processed column of S had a nonzero element. set V ← Im, S ← A, i0 ← 0 do j = 1, n, 1 if there is at least one nonzero sij with i0 < i ≤ m then set i0 ← i0 + 1 do i = m, i0 + 1, −1 do while sij = 0 set σ ← sgn (si−1,j∗sij) z ← |si−1,j|/|sij| subtract σz times row i from row (i − 1) in S add σz times column (i − 1) to column i in V interchange rows i and (i − 1) in S interchange columns i and (i − 1) in V
Loop Nest Parallelization. Fig. Modified Echelon Reduction Algorithm L1 : L2 : .. . Lm :
do I1 = p1, q1 do I2 = p2, q2 .. . do Im = pm, qm H(I)
Loop Nest Parallelization. Fig. Loop Nest L L1 :
L2 : .. . Lm :
do K1 = α1, β1 do K2 = α2, β2 .. . do Km = αm, βm H(K)
Loop Nest Parallelization. Fig. Loop Nest L′
a loop nest L′ of the form shown in Fig. , where K = IU and H ′ (K) = H(KU− ). The transformed program has the same set of iterations as the original program, executing in a different sequential order. The transformation L ⇒ L′ is Valid if and only if dU ≻ for each distance vector d of L. In that case, the vectors dU are precisely the distance vectors of the new nest L′ .
{dU : d ∈ D}. A loop L′r of L′ can run in parallel if there is no d ∈ D satisfying dU ≻r . Thus, a unimodular matrix U will transform the loop nest L into an equivalent loop nest L′ where the loops L′ , L′ , . . . , L′m can all run in parallel, if and only if U satisfies the condition dU ≻
(d ∈ D).
The following informal algorithm shows in detail that a unimodular matrix with this property can always be constructed. Algorithm Given a perfect nest L of m loops, as shown in Fig. , with a set of distance vectors D, this algorithm finds an m × m unimodular matrix U = (urt ) that transforms L into an equivalent loop nest L′ , such that the inner loops L′ , L′ , . . . , L′m of L′ can all run in parallel. . If D = /, then all loops of L can execute in parallel; take for U the m×m unit matrix and exit. Otherwise, go to step . . Condition () is equivalent to the following system of inequalities d u + d u + ⋯ + d m um ≥
Inner Loop Parallelization Let D denote the set of distance vectors of L. For the transformation L ⇒ L′ to be valid, the matrix U must be so chosen that dU ≻ for each d ∈ D. After a valid transformation by U, the set of distance vectors of L′ is
()
(d ∈ D),
()
where d = (d , d , . . . , dm ) and the fixed second subscript of ur has been dropped for convenience. For each r in ≤ r ≤ m, define the set Dr = {d ∈ D : d ≻r }.
L
L
Loop Nest Parallelization
At least one of these sets is nonempty. Since d = d = ⋯ = dr− = for each d ∈ Dr , one can break up the system () into a sequence of subsystems of the following type: d m um
≥
dm− um− + dm um
≥ ⋮
d u + d u + ⋯ + dm− um− + dm um
≥
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (d ∈ Dm− ) ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (d ∈ D ). ⎪ ⎪ ⎭ (d ∈ Dm )
such that row s of U is (, , , . . . , ), its first column is (u , u , . . . , um ), and the submatrix obtained by deleting row s and the first column of U is the (m − ) × (m − ) unit matrix. Since det U = , U is unimodular. . Note that the inverse of U is given by
Since dr > for d ∈ Dr , this sequence can be rewritten as um
≥ /dm
um−
≥ ( − dm um ) /dm− ⋮
u
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (d ∈ Dm− ) ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (d ∈ D ). ⎪ ⎪ ⎭ (d ∈ Dm )
≥ ( − d u − ⋯ − dm um ) /d
U− ()
. If Dm = /, then take um = . Otherwise, take um = . Suppose um , um− , . . . , ur+ have been chosen, where ≤ r < m. If Dr = /, then take ur = . Otherwise, ur has a set of lower bounds of the form: ur ≥ ( − dr+ ur+ − dr+ ur+ − ⋯ − dm um ) /dr (d ∈ Dr ). Take for ur the smallest nonnegative integer that would work:
I
+
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ U=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
u ⋮ us− ⋮
⋯ ⋯ ⎞ ⎟ ⎟ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⎟ ⎟ ⎟ ⎟ ⋯ ⋯ ⎟ ⎟ ⎟ ⎟ ⋯ ⋯ ⎟ ⎟, ⎟ ⎟ ⋯ ⋯ ⎟ ⎟ ⎟ ⎟ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⎟ ⎟ ⎟ ⎟ ⋯ ⋯ ⎠
⋯ −u ⋮ ⋱ ⋮
⋮
⋯ −us− ⋯
⋮ ⋱ ⋮
⋮
⋯
⋯ ⎞ ⎟ ⎟ ⋯ ⎟ ⎟ ⎟ ⎟ ⋮ ⋱ ⋮ ⎟ ⎟ ⎟ ⎟ ⋯ ⎟ ⎟, ⎟ ⎟ ⋯ ⎟ ⎟ ⎟ ⎟ ⋮ ⋱ ⋮ ⎟ ⎟ ⎟ ⎟ ⋯ ⎠
⋮
Is
=
Is+ = ⋮ Im
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Ks ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ K − u K − ⋯ − us− Ks ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Ks+ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Km . ⎪ ⎭
= K
Is− =
. Find s in ≤ s ≤ m such that us is the first nonzero element in the sequence: um , um− , . . . , u . Note that us = . Set
⋯
such that column s is (, −u , . . . , −us− , , . . . , ), row is (, . . . , , , , . . . , ), and the submatrix obtained by deleting column s and the first row is the (m − ) × (m − ) unit matrix. From the equation (I , I , . . . , Im ) = (K , K , . . . , Km )U− , it follows that
ur = ⌈max{( − dr+ ur+ − dr+ ur+ − ⋯ − dm um ) /dr }⌉ . d∈D r
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
=
In the system of inequalities pr (I , . . . , Ir− ) ≤ Ir ≤ qr (I , . . . , Ir− ) ( ≤ r ≤ m), substitute for each Ir the corresponding expression in K , K , . . . , Km . Get the loop limits α r (K , . . . , Kr− ) and βr (K , . . . , Kr− ) of Fig. by
Loop Nest Parallelization
solving this system using Fourier’s method of elimination. . Now each distance vector d of L satisfies the condition dU ≻ . Hence, the transformed nest L′ of Fig. created by this matrix U is equivalent to the original nest L, and the inner loops of L′ can all execute in parallel. Remark There exist infinitely many vectors u = (u , u , . . . , um ) satisfying () and gcd(u , u , . . . , um ) = . For each such vector u, there exist infinitely many unimodular matrices with u as the first column. (See Sect. . in [].) Any such matrix will satisfy the goal of Algorithm . Among all possible choices for u, an ideal vector is one that minimizes the number of iterations (β − α + ) of the outermost loop L′ that must run sequentially. This poses an optimization problem; see Chap. in [] for details. Since K = IU and (u , u , . . . , um ) is the first column of U, one has K = u I + u I + ⋯ + um Im . An equation of the form K = c, where c is a constant, represents a hyperplane in the vector space Rm with coordinate axes I , I , . . . , Im . The integral values of K from α to β then represent a packet of parallel hyperplanes through the index space of the loop nest L, perpendicular to the vector u. Geometrically speaking, Algorithm finds a sequence of parallel hyperplanes, such that the planes are taken sequentially, and iterations for all index points on each plane are executed in parallel. This algorithm is derived from Lamport’s CACM paper []. The name Hyperplane Method, often used to describe it, also comes from []. Example To better understand how Algorithm creates the matrix U from a given set of distance vectors D, consider a loop nest L = (L , L , L , L ) with the set: D = {(, , , ), (, , , −), (, , −, ), (, −, , ), (, , , ), (, , −, )}. The only loop that carries no dependence is L . Hence, only L can run in parallel. The goal is to find a unimodular matrix U that will transform L into an equivalent loop nest L′ , where all the inner loops can execute in parallel. The first column (u , u , u , u ) of U must satisfy the inequality d u + d u + d u + d u ≥
L
for each (d , d , d , d ) ∈ D. After simplification, one gets a set of six inequalities: ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ( − u + u )/ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ( + u )/ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ + u − u − u ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ( − u )/ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ( + u − u )/. ⎪ ⎪ ⎭
u ≥ / u ≥ u ≥ u ≥ u ≥ u ≥
For each ur , select the smallest possible nonnegative integral value that will work. So, take u = . Since D = / (i.e., there is no distance vector d with d ≻ ), take u = . The constraints on u are then u ≥ and u ≥ /. Take u = . Finally, the lower bounds on u are given by u ≥ , u ≥ , and u ≥ , so that one takes u = . The unimodular matrix U as constructed in Step of Algorithm is ⎛ ⎜ ⎜ ⎜ ⎜ U=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
Here s = . Row of U is (, , , ), its first column is (u , u , u , u ) = (, , , ), and the submatrix obtained by deleting the fourth row and the first column is the × unit matrix. The set of distance vectors of the equivalent loop nest L′ = (L′ , L′ , L′ , L′ ) is the set of vectors dU for d ∈ D, that is, the set {(, , , ), (, , , ), (, , , −), (, , −, ), (, , , ), (, , , −)}. None of the inner loops L′ , L′ , L′ carries a dependence, and hence they can all run in parallel.
Outer Loop Parallelization For a given m×n matrix A = (aij ), the rows are denoted by a , a , . . . , am and the columns by a , a , . . . , an . Note that the proof of the following theorem uses both the set D of distance vectors and the distance matrix D of L, while the previous algorithm used only D.
L
L
Loop Nest Parallelization
Theorem Consider a nest L of m loops. Let D denote the distance matrix of L and ρ the rank of D. Then there exists a valid unimodular transformation L ⇒ L′ such that the outermost (m−ρ) loops and the innermost (ρ−) loops of L′ can execute in parallel. Proof Let D denote the set of distance vectors of L. A unimodular matrix U will induce a valid transformation L ⇒ L′ if and only if dU ≻ for each d ∈ D. After a valid transformation by U, the distance vectors of L′ are the vectors dU, where d ∈ D. The outermost (m − ρ) loops and the innermost (ρ − ) loops of L′ can execute in parallel, if and only if dU ≻r is false for each d ∈ D when r ≠ m − ρ + . Let n = m − ρ. Then each d ∈ D must satisfy dU ≻n+ . The transpose D′ of D has m rows and its columns are the distance vectors of L. By Algorithm , find an m × m unimodular matrix V and an echelon matrix S such that VD′ = S. (S and D′ have the same size.) The number of nonzero rows of S is the rank ρ of D, and the number of zero rows is n = m − ρ. Since the lowest n rows of S are zero vectors, it follows that each of the lowest n rows of V is orthogonal to each d ∈ D. Next, find an m-vector u by Algorithm such that du > for each d ∈ D. Let A denote the m × (n + ) matrix where the columns a , a , . . . , an are the lowest n rows of V, and an+ = u. Then each d ∈ D satisfies the equations: da = , da = , . . . , dan = , dan+ > .
()
By Algorithm , find an m × m unimodular matrix U and an m × (n + ) echelon matrix ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ T=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
t t t ⋯ tn t,n+ t t ⋯ tn t,n+ t ⋯ tn t,n+
⋮
⋮
⋮
⋱
⋯ tnn tn,n+
⋯
tn+,n+
⋯
⋮
⋮
⋮
⋱
⋮
⋮
⋯
⋮
⋮
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
such that A = UT and the diagonal element tn+,n+ is nonnegative. The unimodular transformation L ⇒ L′ induced by U satisfies the theorem. After writing the relation A = UT in the form (a , a , . . . , an , an+ ) = (u , u , . . . , un , un+ , u n+ , . . . , um )⋅T,
it becomes clear that ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ t u + t u ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ () ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ tn u + tn u + ⋯ + tnn un ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n n+ ⎪ t,n+ u + ⋯ + tn,n+ u + tn+,n+ u . ⎪ ⎪ ⎭
a
= t u
a
= ⋮
an
=
an+ =
Since a , a , . . . , an form distinct rows of a unimodular matrix V, they are linearly independent. Hence, the diagonal elements t , t , . . . , tnn of T must be nonzero. Multiply the equations of () by any d ∈ D and use () to get ⎫ = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ = ⎪ t (du ) + t (du ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⋮ ⎬ ⎪ ⎪ ⎪ ⎪ = ⎪ tn (du ) + tn (du ) + ⋯ + tnn (dun ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n n+ ⎪ t,n+ (du ) + ⋯ + tn,n+ (du ) + tn+,n+ (du ) > . ⎪ ⎭ t (du )
Since t , t , . . . , tnn are all nonzero and tn+,n+ ≥ , this implies du = , du = , . . . , dun = , dun+ > , that is, dU ≻n+ , for each d ∈ D. This completes the proof. ∎ Example Consider a loop nest L = (L , L , L ) whose distance matrix D and its transpose D′ are given by ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ D = ⎜ − ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
and
⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ′ ⎜ D =⎜ ⎟ ⎟. ⎜ ⎟ ⎜ ⎟ ⎝ − ⎠
Loop Nest Parallelization
Here m = . By Algorithm , find two matrices ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎟ V=⎜ ⎜ − ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ − − ⎠
and
⎛ − ⎞ ⎜ ⎟ ⎜ ⎟ ⎟ S=⎜ ⎜ − ⎟ , ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
such that V is unimodular, S is echelon, and VD′ = S. Then ρ = and n = . Since the bottom row of S is zero, it follows that the bottom row (, −, −) of V is orthogonal to each column of D′ , that is, to each distance vector d of L. This will be the first column of a × matrix A that is being constructed. Next, one needs a vector u such that du > , or equivalently, du ≥ for each distance vector d. The set of inequalities to be satisfied is: ⎫ ⎪ u + u + u ≥ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u − u ≥ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u + u ≥ ⎪ ⎪ ⎭
⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎟ A=⎜ ⎜ − ⎟ . ⎜ ⎟ ⎜ ⎟ ⎝ − ⎠ By Algorithm , find two matrices ⎛ − ⎜ ⎜ U=⎜ ⎜ ⎜ ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
and
⎛ − ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ T=⎜ ⎟ ⎟, ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
such that U is unimodular, T is echelon (with a nonnegative diagonal element on second row), and A = UT. Since (, , )U = (, , ), (, , −)U = (, , ), and (, , )U = (, , )
are all positive vectors, the transformation (L , L , L ) ⇒ (L′ , L′ , L′ ) induced by U is valid. The distance vectors of (L′ , L′ , L′ ) are (, , ), (, , ) and (, , ). Since the loops L′ and L′ do not carry a dependence, they can run in parallel.
Echelon Transformation Some background needs to be prepared before echelon transformations can be defined. Let N denote the number of distance vectors of the loop nest L of Fig. , and D its distance matrix. Apply Algorithm to the N × m matrix D to find an N × N unimodular matrix V and an N × m echelon matrix S = (str ) with nonnegative rows, such that D = VS. Let ρ denote the number of positive rows of S, so that ρ = rank(D). The top ρ rows of S are positive rows and the bottom (N − ρ) rows are zero rows. Let ̂ S denote the ρ × m ̂ submatrix of S consisting of the positive rows, and V the N × ρ submatrix of V consisting of the leftmost ρ columns. Then, D can be written in the form: ̂̂ D=V S.
or
⎫ ⎪ ⎪ u ≥ + u ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u ≥ ( − u − u )/ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u ≥ − u . ⎪ ⎭ Algorithm returns the vector u = (, , ). This will be the second column of the matrix A. Thus,
L
Lemma
()
̂ are positive vectors. The rows of V
̂ Then v̂ Proof Let v denote any row of V. S = d, where d is a distance vector. Since d must be positive, v cannot be the zero vector. Hence, it has the form v = (, . . . , , vt , . . . , v ρ ), where ≤ t ≤ ρ and vt ≠ . Let ℓ = ℓ t denote the column number of the leading (first nonzero) element on row t of ̂ S. Then, stℓ > , column ℓ of ̂ S has the form (sℓ , sℓ , . . . , stℓ , , . . . , ), and each column j for ≤ j < ℓ has the form S has (sj , sj , . . . , st−,j , , . . . , ). Hence, the product v̂ the form (, . . . , , vt stℓ , . . . ). Now, vt stℓ ≠ since vt ≠ and stℓ > . However, being the leading element of a distance vector, vt stℓ must be positive. This means vt > , since stℓ > . Thus, v is a positive vector. ◻ Lemma For ≤ t ≤ ρ, let ℓt denote the column number of the leading element on row t of ̂ S. For each I ∈ Z m , there exists a unique Y ∈ Zm and a unique K ∈ Z ρ , such that ≤ Yℓt ≤ stℓ t − ( ≤ t ≤ ρ)
()
I = Y + K̂ S.
()
and
L
L
Loop Nest Parallelization
Proof Note that the leading elements stℓ t are all positive by construction. Let I = (I , I , . . . , Im ) ∈ Zm . Define K = (K , K , . . . , K ρ ) ∈ Z ρ by
(in this order) to get bounds of the form: α ≤ Y ≤ β α (Y ) ≤ Y ≤ β (Y )
K
= ⌊I ℓ /sℓ ⌋
K
= ⌊(I ℓ − sℓ K ) /sℓ ⌋
⋮ αm (Y , Y , . . . , Ym− ) ≤ Ym ≤ βm (Y , Y , . . . , Ym− )
⋮ Kρ
αm+ (Y) ≤ K ≤ βm+ (Y)
= ⌊(I ℓ ρ − sℓ ρ K − sℓ ρ K − ⋅ ⋅ ⋅ − sρ−,ℓ ρ Kρ− ) /sρℓ ρ ⌋ .
αm+ (Y, K ) ≤ K ≤ βm+ (Y, K ) ⋮
After defining K, define Y ∈ Zm by Y = I − K̂ S. Then Y and K are well defined and () holds. For any real number x, one has ≤ x − ⌊x⌋ < . If I and s are integers with s > , then ≤ I/s − ⌊I/s⌋ < implies ≤ I − s⌊I/s⌋ ≤ s − . Hence, if I = Y + s⌊I/s⌋, then ≤ Y ≤ s − . Since () implies
Iℓ
= Y ℓ + sℓ K
Iℓ − sℓ K
= Y ℓ + sℓ K ⋮
Iℓ ρ − sℓ ρ K − sℓ ρ K − ⋅ ⋅ ⋅ − sρ−,ℓ ρ Kρ− = Y ℓ ρ + sρℓ ρ Kρ ,
it follows from the definition of (K , K , . . . , Kρ ) that ◻ Yℓ , Yℓ , . . . , Yℓ ρ satisfy (). The index variables I , I , . . . , Im of the loop nest of Fig. satisfy the constraints: pr (I , I , . . . , Ir− ) ≤ Ir ≤ qr (I , I , . . . , Ir− ) ( ≤ r ≤ m).
For I , I , . . . , Im in these inequalities, substitute the corresponding expressions in Y , Y , . . . , Ym , K , K , . . . , K ρ from (). Add () to that system to get a system of inequalities in Y , Y , . . . , Ym , K , K , . . . , Kρ . Apply Fourier’s method to the combined system and eliminate the variables
αm+ρ (Y, K , K , . . . , Kρ− ) ≤ Kρ ≤ βm+ρ (Y, K , K , . . . , Kρ− ),
where Y = (Y , Y , . . . , Ym ). For each index point I of L, there is a unique (m + ρ)-vector (Y, K) satisfying these constraints, and conversely. The nest L′ of (m+ ρ) loops shown in Fig. , where I = Y+K̂ S and H ′ (Y, K) = H(I), has the same set of iterations as L, but the order of execution is different. The transformation L ⇒ L′ is called the echelon transformation of the loop nest L. Two iterations H(i) and H(j) in L become iterations H ′ (y, k) and H ′ (z, l), respectively, in L′ , where i = y+k̂ S ′ ̂ and j = z + lS. The transformed nest L is equivalent to the original nest L, if whenever H(j) depends on H(i) in L, H ′ (y, k) precedes H ′ (z, l) in L′ , that is, (y, k) ≺ (z, l). Theorem The loop nest L′ obtained from a nest L by echelon transformation is equivalent to L, and the m outermost loops of L′ can run in parallel. Proof To prove the equivalence of L′ to L, consider two iterations H(i) and H(j) of L, such that H(j) depends on H(i). Let (y, k) denote the value of (Y, K) corresponding to the value i of I, and (z, l) the value corresponding to j. Since d = j − i is a distance vector of ̂ L, it follows from () that d = v̂ S for some row v of V.
L1 : do .. . Lm : Lm+1: .. . Lm+ρ :
Y1 = α1, β1 .. . do Ym = αm, βm do K1 = αm+1, βm+1 .. . do Kρ = αm+ρ, βm+ρ H(Y,K)
K ρ , K ρ− , . . . , K , Ym , Ym− , . . . , Y
Loop Nest Parallelization. Fig. Echelon Transformation
Loop Nest Parallelization
Then j = i + d = y + k̂ S + v̂ S = y + (k + v)̂ S. Since y satisfies (), (y, k + v) is the image of j under the mapping I ↦ (Y, K). But the image of j is also (z, l) by assumption. Since this mapping is well defined, it follows that z = y and l = k + v. Hence, (z, l) − (y, k) = (z − y, l − k) = (, v). Since v is positive by Lemma , this implies (y, k) ≺ (z, l). Hence, L′ is equivalent to L by definition. It is clear from the above discussion that the distance vectors of the nest L′ are of the form (, v), where v is ̂ This means the m outermost Y-loops carry a row of V. no dependence, and hence they can run in parallel. ◻ Corollary In L′ , the distance matrix of the innermost ̂ nest of K-loops is V. Example Consider the loop nest L: L : L : L :
⎛ ⎜ ⎜ D=⎜ ⎜ ⎜ ⎜ ⎝ −
⎞ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎠
⎛ − ⎞ ⎟ , () (I , I , I ) = (Y , Y , Y ) + (K , K )⋅ ⎜ ⎜ ⎟ ⎝ ⎠ and the constraints ≤ Y ≤
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
and
⎛ − ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ S=⎜ ⎟ ⎟, ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
such that V is unimodular, S is echelon, and D = VS. The leading elements of nonzero rows of S are positive. It is clear that ρ = rank(D) = rank(S) = . The submatrix ̂ S of S consisting of its nonzero rows is given by ⎛ − ̂ S=⎜ ⎜ ⎝
⎞ ⎟. ⎟ ⎠
and
≤ Y ≤ .
()
Equation () is equivalent to the system: ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ = Y − K + K ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ = Y . ⎪ ⎭
I = Y + K I I
()
Substituting for I , I , I from () into the constraints defining the loop limits of L, one gets the following set of inequalities: ≤ Y + K ≤ Y − K + K Y + K ≤ Y
⎫ ⎪ ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ≤ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ≤ . ⎪ ⎪ ⎭
L ()
Eliminate the variables K , K , Y , Y , Y from () and () by Fourier’s method: ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ min( − Y , Y − Y ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . ⎭
⌈( − Y + K )/⌉ ≤ K ≤ ⌊( − Y + K )/⌋ − Y
By Algorithm , find two matrices ⎛ ⎜ ⎜ V=⎜ ⎜ ⎜ ⎜ ⎝
Here sℓ = and sℓ = . Define a mapping (I , I , I ) ↦ (Y , Y , Y , K , K ) of the index space of L into Z by the equation
do I = , do I = , do I = I , X(I , I , I ) = X(I − , I − , I ) + X(I − , I − , I ) + X(I − , I + , I )
whose distance matrix is
L
≤ K
≤
≤ Y
≤
≤ Y
≤
≤ Y
≤
Using the fact that Y = , simplify expressions and get the following loop nest L′ equivalent to L: L : L : L : L :
do Y = , do Y = , do K = , min(, Y ) do K = ⌈( − Y + K )/⌉, ⌊( − Y + K )/⌋ X(K , Y − K + K , Y ) = X(K − , Y − K + K − , Y ) + X(K − , Y − K + K − , Y ) + X(K − , Y − K + K + , Y )
L
Loop Nest Parallelization
In L′ , the Y and the Y loops can run in parallel. (The Y -loop with a single iteration has been omitted.) The distance matrix of the nest of K-loops is the submatrix consisting of the two leftmost columns of V, that is, the matrix: ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎟ ̂=⎜ V ⎜ ⎟. ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ Since the K -loop carries no dependence, it can run in parallel. Remark When ρ = m, no valid unimodular transformation can give a parallel outer loop (see Corollary to Theorem . in []). A simple example would be a nest of two loops with two distance vectors (, ) and (, ), for which ρ = = m. However, echelon transformation of this nest will yield a parallel Y -loop with iterations and a parallel Y -loop with iterations. If both transformations are available, the best strategy would be to apply the echelon transformation first to get a nest of m parallel outermost Y-loops and ρ sequential innermost K-loops. Then apply Algorithm to the nest of K-loops to get one sequential outermost loop followed by (ρ − ) parallel inner loops.
Related Entries Code Generation Parallelism Detection in Nested Loops, Optimal Parallelization, Automatic Unimodular Transformations
Bibliographic Notes and Further Reading Unimodular Transformation. Research on this topic goes back to Leslie Lamport’s paper in [], although he did not use this particular term. This essay is based on the theory of unimodular transformations as developed in the author’s books on loop transformations [, ]. The general theory grew out of the theory of unimodular transformations of double loops described in []. (Watch for some notational differences between these references and the current essay.) The paper by Michael Wolf and Monica Lam [] covers many useful aspects of unimodular transformations. See also
Michael Dowling [], Erik H. D’Hollander [], and François Irigoin and Rémi Triolet [, ]. Consider a sequential loop nest where each iteration (other than the first) depends on the previous one. The Hyperplane method would still transform it into a nest where the outermost loop is sequential and all inner loops are parallel. There is no great surprise here. It simply means that each inner loop in the new nest will have a single iteration. (Such a loop can run in parallel according to the definition given.) See [] for the case when there are too many distance vectors. Echelon Transformation. In his PhD thesis [], David Padua developed the greatest common divisor method for finding a partition of the index space. Peir and Cytron [] described a partition based on minimum dependence distances. Shang and Fortes [] offered an algorithm for finding a maximal independent partition. D’Hollander’s paper [] builds on and incorporates these approaches. The general theory of echelon transformations presented here is influenced by these works. See also Polychronopoulos []. See [] for details on unimodular, echelon, and other transformations used in loop nest parallelization, and more references.
Bibliography . Banerjee U () Unimodular transformations of double loops. In: Proceedings of the third workshop on languages and compilers for parallel computing, Irvine, – August . Available as Nicolau A, Gelernter D, Gross T, Padua D (eds) () Advances in languages and compilers for parallel computing. MIT, Cambridge, pp – . Banerjee U () Loop transformations for restructuring compilers: the foundations. Kluwer Academic, Norwell . Banerjee U () Loop transformations for restructuring compilers: loop parallelization. Kluwer Academic, Norwell . D’Hollander EH (July ) Partitioning and labeling of loops by unimodular transformations. IEEE Trans Parallel Distrib Syst ():– . Dowling ML (Dec ) Optimal code parallelization using unimodular transformations. Parallel Comput (–):– . Irigoin F, Triolet R (Jan ) Supernode partitioning. In: Proceedings of th annual ACM SIGACT-SIGPLAN symposium on principles of programming languages, San Diego, pp – . Irigoin F, Triolet R () Dependence approximation and global parallel code generation for nested loops. Parallel Distrib Algorithms (Cosnard M et al. eds), Elsivier (North-Holland), New York, pp –
Loops, Parallel
. Lamport L (Feb ) The parallel execution of DO loops. Commun ACM ():– . Padua DA (Nov ) Multiprocessors: discussions of some theoretical and practical problems. PhD thesis, Report , Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana . Peir J-K, Cytron R (Aug ) Minimum distance: a method for partitioning recurrences for multiprocessors. IEEE Trans Comput C-():– . Polychronopoulos CD (Aug ) Compiler optimizations for enhancing parallelism and their impact on architecture design. IEEE Trans Comput C-():– . Shang W, Fortes JAB (June ) Time optimal linear schedules for algorithms with uniform dependences. IEEE Trans Comput ():– . Wolf ME, Lam MS (Oct ) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst ():–
Loop Tiling Tiling
Loops, Parallel Roland Wismüller University of Siegen, Siegen, Germany
Synonyms
L
there are two principal ways of specifying parallel activities. The first one is to specify several different code regions, i.e., tasks, which should be executed in parallel. This construct is typically called parallel regions or parallel case. It offers only a very limited scalability, since the maximum degree of parallelism is determined by the number of regions. The second way is to use parallel loops. In a parallel loop, the parallel processors execute the same code region, namely, the loop body, but with different data. Thus, parallel loops are a special kind of SPMD programming. Typically, parallel loops are used within a shared memory programming model, for example, OpenMP and Intel’s Threading Building Blocks. However, they have also been included in some distributed memory programming models, for example, Occam. In a sequential loop, the loop body is executed for each element in the loop’s index domain in a fixed order implied by the domain. For example, in for (i=0; i
Doall loops; Forall loops
Definition Parallel loops are one of the most widely used concepts to express parallelism in parallel languages and libraries. In general, a parallel loop is a loop whose iterations are executed at least partially concurrently by several threads or processes. There are several different kinds of parallel loops with different semantics, which will be discussed below. The most prominent kind is the Doall loop, where the iterations are completely independent.
Discussion Introduction In a task parallel program, where the execution of different pieces of code is distributed to parallel processors,
Types of Parallel Loops Today’s parallel programming interfaces offer different constructs for parallel loops, which typically fall into one of the following classes which are described by Polychronopoulos [] and Wolfe []: First, there are parallel loops which can be viewed as sequential loops extended by additional properties. A Doall loop assures that its iterations can be executed completely independently, while a Doacross loop may contain forward dependences. These loops can also be executed sequentially without changing their semantics. In addition, some parallel languages offer a Forall loop; however, the term is used with a nonuniform meaning. For example, the Forall loop in Vienna Fortran [] is actually a Doall loop, while in High Performance Fortran (HPF)
L
L
Loops, Parallel
and Fortran, it has a special nonsequential execution semantics, which is described below.
Doall Loops In a Doall loop, every iteration can be executed independently from any other iteration. This means that the iterations can be executed in any sequential order or concurrently. More formally, a Doall loop is not allowed to contain loop carried dependences. This means that when W(i) denotes the set of variables the loop’s body writes to in some iteration i and R(i) denotes the set of variables read in iteration i the following conditions, often called Bernstein’s conditions, must hold for any two different iterations i and i of the loop: W(i ) ∩ W(i ) = / and R(i ) ∩ W(i ) = / and W(i ) ∩ R(i ) = / In this context, the term variable means either a scalar variable or an element of an array or some other structured data type. As an example, consider the following loops: for (i=0; i
When a Doall loop is to be executed by a fixed number of processors (or threads), the iterations can be arbitrarily scheduled to the processors without changing the loop’s computational results. However, the scheduling can have a large impact on the performance. With today’s cache-based memory systems, it is beneficial if each processor executes a contiguous block of loop iterations since this will maximize the locality of the memory accesses on each processor. For example, when a loop iterating from to is executed by four processors, the first one would execute iterations –, the second one –, and so on. This situation is visualized in Fig. . The blockwise scheduling works well, however, only when all iteration have approximately the same execution time. When the execution time differs largely between iterations, blockwise scheduling can lead to a poor performance. A typical example for this situation is a triangular loop nest, where the outer loop is parallel: for (i=1; i<=100; i++) for (j=1; j<=i; j++) a[i][j] = some_computation(i,j); With a blockwise scheduling, processor gets the largest amount of work. Since there is typically a global synchronization (barrier) between all processors at the end of a parallel loop, the other three processors will have to wait, which results in a load imbalance. In such cases, a cyclic schedule, where single iterations are assigned to processors in a round-robin fashion offers a better load balancing, as it is shown in Fig. . However, the cache performance of such an execution may be poor. As a compromise between cache performance and load balancing, blocks of iterations may be assigned to the processors in a cyclic manner.
Processor 1:
1
2
...
25
Processor 2:
26
27
...
50
Processor 3:
51
52
...
75
Processor 4:
76
77
...
100
Time
Loops, Parallel. Fig. Blockwise schedule for a simple parallel loop
Loops, Parallel
Processor 1: 1 2
...
25 ...
Processor 2: 26 27 Processor 3:
51 52
Processor 4:
76
77
50 ...
75 ...
100
Processor 1:
1 5 ...
51
...
97
Processor 2:
2 6 ...
52
...
98
Processor 3:
3 7 ...
53
...
99
Processor 4:
4 8 ...
54
...
100
Time
a
L
Time
b
Loops, Parallel. Fig. Blockwise and cyclic schedule for a triangular loop (a) blockwise schedule, (b) cyclic Schedule
In addition to the static scheduling discussed above, dynamic scheduling may be used when the iterations’ execution times vary irregularly. In this case, each processor fetches an iteration from a global work pool, executes it, and then fetches the next one, until the work pool is empty. Dynamic scheduling achieves very good load balancing at the price of the additional overhead for the synchronized access to the global work pool. Like in the static case, processors may fetch contiguous blocks of iterations from the pool in order to increase the cache performance and to decrease the overhead.
Doacross Loops A Doacross loop is a parallel loop, where individual iterations are scheduled to the processors in a round-robin fashion, either dynamically or statically. The iterations are started in the same order as in a sequential loop and then execute overlappingly in a pipelined fashion. This allows Doacross loops to possess forward dependences between iterations. The following example shows a best case for a Doacross loop: for (i=1; i
for (i=1; i
Forall Loops The Fortran D language [] introduced a parallel loop, called FORALL, which deals with dependences in a special way: A read operation on a variable will return a new value only if that value was assigned in the current loop iteration, otherwise, it returns the old value of the variable at the time the loop was entered. Thus, conceptually each iteration works on its own copy of the data space. In High Performance Fortran (HPF) [], this construct was inherited with some modifications under the
L
L
Loops, Parallel
Processor 1:
a[1] = f(1)
b[1] = a[1]−a[0]
a[5] = f(5)
b[5] = a[5]−a[4]
...
Processor 2:
a[2] = f(2)
b[2] = a[2]−a[1]
a[6] = f(6)
b[6] = a[6]−a[5]
...
Processor 3:
a[3] = f(3)
b[3] = a[3]−a[2]
a[7] = f(7)
b[7] = a[7]−a[6]
...
Processor 4:
a[4] = f(4)
b[4] = a[4]−a[3]
a[8] = f(8)
b[8] = a[8]−a[7]
...
Time
Loops, Parallel. Fig. Best case execution of a Doacross loop: full overlap
Processor 1:
a[1] =b[0]
b[1] =f(1)
Processor 2:
a[3] =b[2] a[2] =b[1]
... ...
b[2] =f(2) Time
Loops, Parallel. Fig. Worst case execution of a Doacross loop: no overlap
name FORALL construct, which was later included into the Fortran language standard. The statements inside a FORALL construct are executed in their sequential order; however, each statement is executed for all elements of the index domain by first performing all read operations and only afterward doing the write operations. For example, in the construct FORALL ( i=2:n-1 ) a(i) = a(i-1) + a(i+1) b(i) = a(i) END FORALL first the expression a(i-1) + a(i+1) will be evaluated for all values of i, and only afterward all the assignments to a(i) will occur. Then, the second statement is executed in a similar way to copy a into b. Thus, the FORALL construct in HPF and Fortran actually is rather a convenient way to express a sequence of vectorizable and/or parallelizable array assignments than a (parallel) loop.
Variables in Parallel Loops In a shared memory environment, variables in a parallel loop may either be shared between all threads executing the loop or they may be private, which means that each thread has its own copy of the variable. Typically, the data the loop is working on is shared, while the loop variables of the parallel loop and all its inner loops must be private. Private variables can also be used to compute thread local intermediate results inside a loop, which are combined only after the loop terminated.
A special case of this situation is a reduction variable. A reduction variable is a variable v such that each assignment in the loop has the form v = v ⊕ ⋯ , with ⊕ being an associative operator, and v is not otherwise read in the loop. A typical example is the computation of the integral of a function: double s = 0; for (i=0; i
Loops, Parallel
Processor 1: Processor 2: Processor 3:
a
Processor 4:
f( )
Processor 4:
Processor 4:
f( ) f( )
f( ) f( )
f( )
s += f( )
s +=
s1+=
f( ) s2+=
f( )
s3+=
f( ) s4+=
f( )
f( )
s1+=
f( )
... ...
f( )
s +=
...
...
s+=s1
f( )
s2+=
...
f( )
s3+=
f( )
s3+=
...
s4 +=
s += f( )
s2+=
f( )
...
f( )
f( )
s4+=
...
s +=
s +=
s +=
...
...
s +=
s += f( )
f( )
s1+=
f( )
...
...
f( ) s +=
f( )
f( ) s +=
s +=
s +=
s +=
s +=
s +=
f( )
f( )
s +=
f( )
s +=
f( ) s +=
f( )
s +=
f( )
f( )
s +=
s +=
f( )
Processor 3:
c
s +=
s +=
f( )
f( )
f( )
Processor 1: Processor 2:
s +=
s +=
f( )
Processor 3:
b
f( )
f( )
Processor 1: Processor 2:
s +=
L
s+=s2 s+=s3 s+=s4
Loops, Parallel. Fig. Three ways of implementing a parallel reduction (a) Doacross loop, (b) Doall loop, (c) Doall loop with local summation
parallel loops and transparently introduce a local reduction variable as outlined above. A code example using OpenMP is given in the next section.
Parallel Loops in OpenMP OpenMP is a parallel programming model for shared memory computers, which extends traditional sequential languages with directives controlling the parallel execution. The main construct is the parallel region, which indicates that the region’s code should be executed by a prespecified number of threads. Inside such a parallel region, loops may be designated as parallel loops, resulting in the loop iterations being distributed across the available threads. In C and C++, this is done using the directive #pragma omp for when the loop is already contained in a parallel region, or the combined directive #pragma omp parallel for which additionally creates a parallel region around the loop. These directives must be placed immediately before the loop. OpenMP puts a couple of restrictions on a parallel loop: it must be a for-loop with an
integer loop variable, constant increment, and a simple termination test. The loop must not exit prematurely, for example, via a break or return statement or an exception. In addition, the loop body must neither modify the loop variable nor the loop bounds. The detailed behavior of the loop can be controlled by a number of optional clauses, which may be appended to the directive: ●
The schedule clause controls how loop iterations are scheduled to threads. The default behavior is a static blockwise schedule as shown in Fig. . Using the clause schedule(static,size ), a fixed size for the blocks may be specified. The blocks then are assigned to threads in a round-robin fashion. Thus, for example, schedule(static,1) results in a cyclic schedule like in Fig. b. It is also possible to specify a dynamic scheduling, again with an optional block size. The schedule type guided uses dynamic scheduling with exponentially decreasing block sizes, which can help to improve the load balancing while keeping the overhead reasonably. Finally, the definition of the schedule type can be postponed to the runtime, in order to quickly experiment with different schedules.
L
L
Loops, Parallel
●
Several other clauses concern the variables used inside the loop. The shared and private clauses explicitly define variables as being shared between threads or being thread private. The loop variable and all local variables declared inside the loop body are always private. Variables declared outside the loop are shared by default, although this behavior can be changed by using the default clause. The value of a private variable is undefined at the beginning of the loop and is lost at its end. The clauses firstprivate and lastprivate allow the initialization and finalization of the value, respectively. The reduction clause is used to specify a reduction variable together with the operation used in the reduction. The reduction will then be implemented as shown in Fig. c. ● Finally, two clauses control the synchronisation. By default, at the end of each parallel loop, all threads synchronize using a barrier. The nowait clause instructs OpenMP to omit this synchronization. The ordered clause indicates that the loop contains an ordered directive, which is discussed below. OpenMP executes parallel loops as Doall loops. However, OpenMP does not check whether the iterations are independent, thus, parallel loops in OpenMP may actually contain dependences. It is the responsibility of the programmer to ensure that the results of the parallel execution are correct. A means to achieve this is the use of synchronization constructs inside the loop. For mutually exclusive access to shared variables, OpenMP provides the omp critical and omp atomic directives, where the latter is an optimized form for statements like x += expr . Another synchronization directive is omp ordered, which ensures that the statement(s) controlled by the directive are executed in the same order as in the sequential loop. This directive can only be used in loops having an ordered clause and roughly results in the loop behaving like a Doacross loop. Such a loop should always use a (static,1) schedule to achieve a reasonable performance. In order to illustrate the usage of the mentioned directives, the following code examples show how the
parallel reduction outlined before can be implemented using OpenMP. (a) As Doacross loop: double s = 0; #pragma omp parallel for \ ordered schedule(static,1) for (i=0; i
Loops, Parallel
Parallel Loops in Intel(R) Threading Building Blocks The Threading Building Blocks (TBB) are a template library for ISO C++, which includes a variety of constructs supporting the parallel programming with shared memory. Among these constructs are two different kinds of Doall loops, as well as a reduction operation. Since TBB is just a library and does not require any compiler support, parallel loops must be expressed by calling a library function which receives the loop body in the form of a functor, that is, a function object. For example, the body of the Doall loop shown in the beginning of this entry can be expressed in a class like this: class Body1 { double *a; double *b; public: void operator()(const tbb:: blocked_range
L
invoked. The execution of each subrange forms a task, which is dynamically scheduled to one of the available worker threads. By default, their number is one less than the number of CPU cores, thus reserving one core for the manager thread. The way how the index range is partitioned can be controlled by passing a partitioner as an optional argument to the parallel_for function. The default partitioner uses a heuristics to determine a suitable size of the subranges. It can be finetuned using an optional parameter in the constructor of blocked_range, which allows to specify a minimum size for the subranges. In order to achieve a good speedup, it should be chosen such that the ()-operator executes at least some , instructions. For parallel loops which are executed repeatedly, TBB provides an additional partitioner that tries to improve cache affinity. A second kind of parallel loop construct available in TBB implements loops based on C++ iterators. Iterators are extensively used in C++ class libraries in order to iterate over all elements of a collection, for example, a linked list. Although the parallel_do loop must usually fetch the work items serially, since most iterators only support a strictly sequential traversal of a collection, the processing of these work items in the loop body may be initiated in an arbitrary order. Thus, also parallel_do implements a Doall loop where all iterations must be independent. However, the body of a parallel_do is allowed to extend the iteration space by inserting items into the collection. This is illustrated in the following example: class Body2 { public: void operator()(Item& item, tbb:: parallel_do_feeder
L
L
Loops, Parallel
tbb::parallel_do(l.begin(), l.end(), Body2()); The loop iterates over all elements of a linked list and invokes the function processItem on each of these elements. The processing of an item may result in a new item being generated, which is added to the collection via a parallel_do_feeder object passed as an optional argument to the ()-operator. Since TBB makes extensive use of templates, the items may have an arbitrary data type. Besides independent loops, TBB allows to implement parallel loops with reductions by using the parallel_reduce function. Reduction loops are executed by recursively splitting the index range into subranges, computing the local values for the leaf subranges, and then recursively combining these local results into the global one. Since the initialization of the local result and the combining operation depend on the concrete loop, the class defining the loop’s body must provide an additional constructor and a join operation. The following code implements the integration used as an example before: class Sum { public: double sum; void operator()(const tbb:: blocked_range
The dummy argument of type tbb:split is used to distinguish the splitting constructor from a copy constructor.
Related Entries Array Languages Dependence Analysis HPF (High Performance Fortran)
Intel Threading Building Blocks (TBB) Loop Nest Parallelization Metrics OpenMP Parallelization, Automatic Reduce and Scan SPMD Computational Model
Bibliographic Notes and Further Reading In the s, many parallel and vector supercomputers, for example, Cray X-MP or Alliant FX/, started to offer parallel loops as a programming construct, which was implemented by their proprietary compilers. Karp [] provides an overview of the state of the art of parallel programming at that time, including a discussion of parallel loops. Since parallel loops are typically supported by a compiler, either by the use of directives like in OpenMP, or by autoparallelization, in-depth discussions can be found in books on parallelizing compilers. Wolfe presents a very good overview on data dependences and techniques for loop parallelization in []. In [], Polychronopoulos discusses different methods for scheduling parallel loops in detail. Since Fortran is the predominant language in the area of numerical computations, there have been many research efforts to integrate parallel loops into this language, most of them addressing distributed memory parallel computers. The group around Ken Kennedy at Rice University developed Fortran D [], while in parallel Hans Zima and his coworkers devised Vienna Fortran []. Both languages influenced the High Performance Fortran language, whose first specification appeared as a special issue of the Scientific Computing journal []. Programming with OpenMP is discussed in ample detail in several books [, , ]. There is also a textbook [] about the Intel Threading Building Blocks, which
LU Factorization
contains many examples. Additional information about OpenMP and TBB can be found in the related entries of this encyclopedia.
Bibliography . Chandra R et al () Parallel programming in OpenMP. Academic Press, San Diego . Chapman B, Jost G, van der Pas R () Using OpenMP – portable shared memory parallel programming. MIT Press, Cambridge, MA . Chapman BM, Mehrotra P, Zima HP () Programming in Vienna Fortran. Sci Prog ():– . High Performance Fortran Forum () Section : Data parallel statements and directives. Sci Comput (–):– . Hiranandani S, Kennedy K, Koelbel C, Kremer U, Tseng C-W (Aug ) An overview of the Fortran D programming system. Languages and compilers for parallel computing. Lecture Notes in Computer Science, vol . Springer, Berlin/Heidelberg, pp –
L
. Karp AH () Programming for parallelism. Computer ():– . Polychronopoulos CD () Parallel programming and compilers. Kluwer Academic, Norwell . Reinders J () Intel threading building blocks. O’Reilly, Sebastopol, CA . Wilkinson B, Allen M () Parallel programming – techniques and applications using networked workstations and parallel computers, nd edn. Pearson Education, Prentice-Hall Eaglewood Cliffs . Wolfe M () Optimizing supercompilers for supercomputers. Pitman, London
LU Factorization Dense Linear System Solvers Sparse Direct Methods
L
M Manycore Green Flash: Climate Machine (LBNL)
MapReduce Massive-Scale Analytics
MasPar Synonyms SIMD (Single Instruction, Multiple Data) Machines MASPAR [] were SIMD machines built by MasPar Computer Corporation. The first system was delivered in .
Related Entries Flynn’s Taxonomy
Bibliography . Tom Blank () Compcon Spring ’: the MasPar MP- Architecture, San Francisco, CA, USA, Feb– Mar , pp –
Massively Parallel Processor (MPP) Distributed-Memory Multiprocessor
Massive-Scale Analytics Amol Ghoting , John A. Gunnels , Mark S. Squillante IBM Thomas. J. Watson Research Center, Yorktown Heights, NY, USA IBM Corp., Yorktown Heights, NY, USA IBM, Yorktown Heights, NY, USA
Synonyms Deep analytics; Large-scale analytics; MapReduce
Definition Massive-scale analytics refers to a combination of mathematical methods and high-performance computational methods for the analysis of vast amounts of data in order to gain crucial insights and facilitate decision making across a broad spectrum of application domains, sometimes in an automated data-driven fashion. This requires a unified approach that addresses the two key dimensions of massive-scale analytics: data and computation.
Discussion Introduction The volumes of data available to various organizations throughout society over the past many years have been growing at an explosive rate. Important advances in computer sciences and technologies have made this possible, which in turn have resulted in the development and growth of data warehouses and reporting capabilities to view information concerning various aspects of an enterprise. These basic reporting capabilities have helped organizations improve the accuracy and timeliness of the data used for the purposes of enterprise decision making. However, significantly increasing the rewards for investments in data collection and data warehouse construction well beyond basic reporting requires the development of massive-scale analytics that provide deeper insights and understanding of the enterprise and optimization of its performance. The overall goal of massive-scale analytics is to enable organizations with data-driven decision-making capabilities based on a combination of advanced mathematical and computational methods, including automated tools and applications. There are many challenges to realizing the full benefits of massive-scale analytics. This includes important challenges along the two key dimensions of applying massive-scale analytics against huge volumes of data, namely the data dimension which involves machine
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
M
Massive-Scale Analytics
learning, knowledge discovery, and data management, together with the computational dimension which involves high-performance and massively parallel computing. The main challenges for the data dimension entail extracting knowledge and insight from complex queries against massive data sets in an efficient and effective manner. The main challenges for the computational dimension entail efficient access to massive data sets and efficient parallel algorithms for computing advanced analytics over these data sets. Far too often, massive-scale analytics are pursued solely from the perspective of one of these dimensions. However, it is only through a general approach which simultaneously addresses these two key dimensions in a unified fashion that the full potential of massive-scale analytics can be realized. In addition to leveraging and combining solutions along both dimensions for creating and developing predictive models, this general massive-scale analytics approach includes the stochastic analysis and optimization of such predictive models to support decision making, scenario analysis, and optimization across a wide variety of applications in a domain-specific manner. This additional aspect of massive-scale analytics raises a number of important challenges to obtain scalable solutions for complex multidimensional stochastic models and decision processes. Some of the application areas where massive-scale analytics are being actively pursued include advertising and marketing, business intelligence, energy (power grid), health care (genomics, public health), natural sciences (biology), security (internet, national), social networks, workforce management, and transportation. The list of such applications continues to grow at a rapid pace and this trend is expected to endure for the foreseeable future.
Data Dimension The massive scale analytics challenge along the data dimension is to provide knowledge and insight on data sets using methods that can gracefully scale with increasing data set sizes. Research on this front has been conducted in the following distinct areas.
Platforms for Scalable Data Management and Analysis Successfully processing and analyzing massive data sets requires a platform that can store and allow one to
process data efficiently. Early platforms for scalable data management were provided through database systems. Database systems typically provide a relational data model for data representation, a system for data storage and indexing, and a query processor that can answer queries against the data stored in the system. Parallel database systems improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries []. Although data may be stored in a distributed fashion, the distribution is governed solely by performance considerations. Such systems improve processing by using multiple CPUs and disks in parallel. Additional programmability is afforded through user defined functions – these allow users to embed their custom parallel computations inside a database system. Parallel database systems have been successfully used in the scalable implementation of analytics solutions that go beyond simple database queries []. More recently, researchers have also delved into the problem of building database systems that are optimized for data warehouse workloads [] rather than transactional workloads. While parallel relational database systems work very well for structured data, more recently, with the advent of applications that need to digest unstructured data (such as plain text, webpages, XML files, RSS feeds) and the need for increased programmability, there has been the development of new systems, programming abstractions, and query languages to store, process, and analyze such data. An example of such an effort is the increasingly popular Apache Hadoop project that uses the MapReduce programming model [] to provide a simplified means of processing large files on a parallel system through user-defined Map and Reduce primitives. A MapReduce job consists of two phases – a Map phase and a Reduce phase. During the Map phase, the user-defined Map primitive transforms the input data into (key, value) pairs in parallel. These pairs are stored and then sorted by the system so as to accumulate all values for each key. During the Reduce phase, the userdefined Reduce primitive is invoked on each unique key with a list of all the values for that key; usually, this phase is used to perform aggregations. Finally, the results are output in the form of (key, value) pairs. Each key can be processed in parallel during the Reduce phase. A user can implement a variety of parallel analytics computations by submitting one or more MapReduce jobs to
Massive-Scale Analytics
the system. Such systems have been designed to run on large commodity clusters and recover from both data as well as compute node failures. For many domains, the MapReduce paradigm is too low level and there has been the development of custom data management and analysis platforms on top of MapReduce ranging from key-value stores like Cassandra to machine learning platforms such as System-ML [].
Algorithms for Scalable Knowledge Discovery and Data Mining Spurred by advances in data collection technologies, the field of data mining has emerged, merging ideas from statistics, machine learning, databases, and high performance computing. The main challenge in data mining is to extract knowledge and insight from massive data sets in a fast and efficient manner. Data mining methods must scale near-linearly with the data set size. Thus, as data set sizes increase, these techniques must allow one to process increasing volumes of data and at the same time provide a reasonable response time by employing parallelism. Data mining commonly involves four classes of tasks: association rule mining, classification, clustering, and regression. Classification, clustering, and regression have been well studied in the machine learning and statistics communities, but the algorithms were not always designed to scale linearly with the data set size. Over the past decades, the data mining community has borrowed methods from these communities and devised new methods to provide actionable knowledge from large data sets efficiently. Parallel solutions have also been devised that scale using data parallelism. Furthermore, when data parallelism is not an option, techniques that learn on partitions of the data independently and then use the resulting models collectively as an ensemble have also been devised []. The reader can look at the entry on parallel data mining for further details. Approximate Data Processing and Mining An alternate strategy to deal with increasing data set sizes is to develop approximate versions of existing algorithms. The approximation should be tunable in the sense that the process should take less time if one can make do with a less accurate answer. This allows one to process increasingly larger data sets in the same amount of time by requesting a less accurate answer.
M
Database and data mining researchers have developed a suite of approximation techniques to process very large data sets in linear and sometimes even sub-linear time. These techniques make use of summary structures and/or sampling to provide approximate answers, oftentimes with a deterministic or probabilistic error bound. Approximate data processing and mining techniques have also been been extended to process data streams where one can only perform one pass over the data. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. The reader can refer to the papers by Garofalakis et al. [] and Babcock et al. [] for an introduction to approximate query processing techniques and the application of these techniques to processing data streams, respectively, and the book by Charu Aggarwal [] for an introduction to issues and solutions related to data stream mining.
Computational Dimension Architectural Resources The peak computational ability of systems has been growing at a rate greater than that indicated by Moore’s Law, doubling more frequently than every months. For example, the fastest disclosed computer in the world was running just over one teraflop ( FLOPS) in according to the TOP list []. In the analogous system at the top of this list broke the petaflop ( FLOPS) barrier []. Thus, for traditional scientific computations, e.g., physics simulations, floating-point arithmetic is a bountiful resource. As analytics, even deep analytics, applications are rarely more computationally intensive than such scientific computations, it is unlikely that these algorithm’s performance will be limited by floating-point resource availability. Memory tends to be the bottleneck for analytics, as it is often the limiting factor for scientific applications. On large HPC systems memory access has a complex characterization. On any system, the standard performance measures of memory are the latency to access and the bandwidth available and both characteristics tend to degrade as one’s focus moves from the on-chip cache to the larger main memory and the attached high-capacity nonvolatile memory (typically a hard disk drive). With large, distributed-memory systems, off-chip, off-node, and off-rack memory characteristics come into play. It
M
M
Massive-Scale Analytics
is typically the case that as one increases the working set of the problem under consideration it is necessary to use the memory systems of other nodes and to carry out implicit or explicit message passing between those nodes to share data. It is usually the case that the larger the domain under consideration, the greater the latency and the lower the bandwidth available. However, this relationship is not always true. For example, the bandwidth of a hard drive system, especially when viewed as the shared resource that it is almost assured to be, can be lower than the intranode network on a large HPC system. Further, there are architectural features that can ameliorate the situation. Examples include massive multithreading and out-of-order execution, which both tend to greatly mask issues related to high-latency accesses.
HPC System Utilization Modes There are two fundamental ways in which large HPC systems are currently leveraged for analytics applications. It is usually the case that a small part of a larger system is devoted to a single program execution. This allows analytics to leverage high performance computing systems using current software and, given the nature of stochastic modeling, is an excellent way to leverage such systems. More rarely, analytics codes are rewritten to take advantage of highly parallel (,-way and beyond) systems in a unified fashion. This usually requires new algorithmic insights or the revival of algorithms that were once dismissed as impractical (due to their serial performance) as well as significant software engineering [, ].
Software and Programmability In order to efficiently utilize high performance computing resources for analytics the algorithms employed must be parallel. Software that is already executing in parallel, through automatically (compiler) extracted parallelism, lightweight explicit parallelism (e.g., OpenMP, HPF, and X [, , ]), or somewhat intrusive and highly orchestrated parallelism (e.g., Pthreads and MPI [, ]) must be extended to encompass larger (higher thread and node count) systems. This can present significant challenges as most analytics codes contain sequential portions (due to the complexity of the algorithm or assumptions about resources available, such as the amount of memory available in a single coherence domain). There is also a good deal of analytics software designed and written for strictly sequential execution. For the most part, this software must be heavily redesigned and the algorithms employed, rethought. As the number of processes increases, Amdahl’s law makes it necessary to parallelize an ever-increasing percentage of the overall algorthm. Thus, in either case, a good deal of work is likely required in order to enable analytics software for this domain. This has motivated research into such enabling technologies as the Parallel Global Address Space (PGAS) languages [] as well as investment in creating software tools in this space (e.g., IBM’s High Performance Computing Toolkit [] and Intel’s Parallel Studio []).
Stochastic Decomposition In addition to deep insights and understanding of the enterprise, a unified approach that addresses both data and computational dimensions of massive-scale analytics will provide an organization with predictive modeling, scenario analysis, and decision-making capabilities. This involves the stochastic analysis and optimization of an underlying mathematical representation of such predictive models and decision processes. The multidimensional aspects of these underlying stochastic processes are a major source of complexity, often involving various dependencies and dynamic interactions among the different dimensions of the multidimensional process. Hence, the stochastic analysis and optimization of predictive models and decision processes in a scalable manner as part of massive-scale analytics will often need to leverage a number of additional mathematical and computational methodologies. This includes Monte Carlo methods, a class of computational algorithms based on the use of repeated probabilistic sampling to compute the results of interest; the reader is referred to the entry on Monte Carlo methods for further details. Another related general methodology is based on stochastic decomposition, which essentially consists of decomposing the complex multidimensional stochastic process into a combination of various forms of simpler processes with reduced dimensionality. All of the mathematical and computational methodologies that support stochastic analysis and optimization
Massive-Scale Analytics
for massive-scale analytics can also provide benefits to and exploit benefits from the areas covered in sections Data Dimension and Computational Dimension.
Nearly Completely Decomposable Systems One general class of stochastic decomposition approaches is based on the theory of nearly completely decomposable stochastic systems. Consider a discretetime Markov process with transition probability matrix such that block submatrices along the main diagonal consist of relatively large probability mass, while the elements of the remaining block submatrices are very small in comparison. Stochastic matrices of this type are called nearly completely decomposable [], in which case the transition probabily matrix can be written in a form comprised of stochastic and completely decomposable micro (conditional) models and stochastic macro (unconditioning) models. When the system size is very large, computing the stationary distribution (as well as functionals of the stationary distribution) directly from the transition probability matrix can be prohibitively expensive in both time and space. On the other hand, the overall solution for nearly completely decomposable stochastic matrices can be efficiently computed based on extensions of the SimonAndo approximations for the stationary distribution of the corresponding Markov process. As a specific example, Aven, Coffman, and Kogan [] consider model instances where the functional of the stationary distribution represents the stationary page fault probability for a stochastic computer program model and a finite storage capacity. A solution for instances of such computer storage models arising in parallel computing environments, where standard nearly completely decomposable methods breakdown and yield improper solutions, is derived by Peris et al. [] and Squillante [] based on first passage times, recurrence times, taboo probabilities and first entrance methods. Fixed-Point Based Solutions Another general class of stochastic decomposition approaches is based on models of each dimension of the multidimensional process in isolation together with a
M
set of fixed-point equations to capture the dependencies and dynamic interactions among the multiple dimensions. One of the most well-known examples of this general approach is the Erlang fixed-point approximation developed to address the computational complexity of the exact product-form solution for the classical Erlang loss model. This approximate solution is based on a stochastic decomposition in which the multidimensional exact solution is replaced by a system of nonlinear equations in terms of an exact solution for each dimension in isolation. It is well known that there exists a solution of the Erlang fixed-point equations and that this solution converges to the exact solution of the original Erlang loss model in the limit as the number of dimensions grow large under a certain type of scaling; refer to, e.g., Kelly []. The corresponding capacity planning problem to optimize an objective function of the stationary loss probabilities and capacities has also been considered within this context by Kelly []. The asymptotic exactness of the Erlang fixed-point approximation and optimization based on this approximation is an important aspect of this general decomposition approach for the analysis and optimization of complex multidimensional stochastic processes.
Exploiting Priority Structural Properties Yet another general class of stochastic decomposition approaches is based on exploiting various priority structural properties to reduce the dimensionality of the stochastic process in a recursive manner. Although this general approach was originally developed for singleserver queueing systems with multiple queues under a priority scheduling discipline (see, e.g., [, ]), it has been extended and generalized in many different ways for the stochastic analysis and optimization of multidimensional stochastic systems. The basic idea consists of a recursive mathematical procedure starting with the two highest priority dimensions of the process that involves: (a) analyzing the probabilistic behavior of the so-called completion-time process, which characterizes the intervals between consecutive points of interaction between the starts of busy periods for the lower priority dimension; (b) obtaining the distributional characteristics of related busy-period processes through an analysis of associated stochastic processes and modified dimensional probability distributions in isolation; and
M
M
Massive-Scale Analytics
(c) determining the solution of the two-dimensional priority process from a combination of these results. These steps are repeated to obtain the solution for the (n+)-dimensional priority process using the results for the n-dimensional priority process, until reaching the final solution for the original multidimensional stochastic process. Several related extensions of this general approach have been developed for the stochastic analysis and optimization of various parallel computing environments. These approaches generally exploit distinct priority structures in the underlying multidimensional stochastic process together with the probabilistic behavior of dependence structures and dynamics resulting from the complex massively parallel computing environment; for one such example, refer to [].
Related Entries Data Mining
Bibliographic Notes and Further Reading For a comprehensive reference on the application of popular machine learning methods to large amounts of data, with a heavy emphasis on parallelization using modern parallel computing frameworks (MPI, MapReduce, CUDA, DryadLINQ, etc.), the reader can refer to the book by Bekkermann et al. []. The interested reader is referred to the book by Zaki and Ho [] for more details on parallelization of data mining algorithms for association rule mining, clustering, and classification. For additional details on high-performance computational algorithms and architectures, the interested reader is referred to []. The reader can refer to [] for a brief overview of HPC operating system issues. For additional details on the stochastic analysis and optimization of multidimensional stochastic processes, the interested reader is referred to [] and the references cited therein. Nearly completely decomposable stochastic systems and their solutions have received considerable attention in the literature; see, e.g., [, ]. Kelly [] studies the Erlang loss model and the socalled Erlang fixed-point approximation. Gaver [] and Jaiswal [] consider stochastic decomposition based on priority structural properties.
Bibliography . Aggarwal C () Data streams: models and algorithms. Springer, New York . Aven OI, Coffman EG Jr, Kogan YA () Stochastic analysis of computer storage. D. Reidel, Amsterdam . Babcock B, Babu S, Datar M, Motwani R, Widom J () Models and issues in data stream systems. In: Proceedings of the twentyfirst ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, ACM, New York, pp – . Beckman P, Iskra K, Yoshii K, Coghlan S () Operating system issues for petascale systems. SIGOPS Oper Syst Rev :– . Bekkermann R, Bilenko M, Langford J () Scaling up machine learning. Cambridge University Press, New York . Bikshandi G, Almasi G, Kodali S, Saraswat V, Sur S () A comparative study and empirical evaluation of global view HPL program in X. In: rd conference on partitioned global address space programming models (PGAS), ACM, New York . Chapman B, Jost G, van der Pas R () Using open MP: portable shared memory parallel programming (scientific and engineering computation). The MIT Press, Cambridge, MA . . Chung I-H, Seelam S, Mohr B, Labarta J () Tools for scalable performance analysis on petascale systems. Parallel Distrib Process Symp, Int :– . Courtois PJ () Decomposability. Academic, New York . Dean J, Ghemawat S () MapReduce: simplified data processing on large clusters. Commun ACM ():– . DeWitt D, Gray J () Parallel database systems: the future of high performance database systems. Commun ACM ():– . Dietterich T () Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First International Workshop on multiple classifier systems. Springer, New York, pp – . Garofalakis M, Gibbons P () Approximate query processing: taming the terabytes. In: Proceedings of very large databases, Rome, Italy . Gaver DP Jr. () A waiting line with interrupted service, including priorities. J R Stat Soc, Ser B :– . Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S () SystemML: declarative machine learning on MapReduce. In: Proceedings of the IEEE international conference on data engineering . Ghoting A, Makarychev K () Proceedings of the ACM/IEEE conference on high performance computing, SC , – Nov , Portland, OR. In: SC. ACM, New York . Gropp W, Lusk E, Thakur R () Using MPI-: advanced features of the message passing interface. MIT Press, Cambridge, MA . Gunnels J, Lee J, Margulies S () Efficient high-precision matrix algebra on parallel architectures for nonlinear combinatorial optimization. Math Program Comput :–. doi: ./s--- . Gupta M, Midkiff S, Schonberg E, Seshadri V, Shields D, Wang K-Y, Ching W-M, Ngo T () An HPF compiler for the IBM SP. In: Proceedings of the ACM Conference on Supercomputing, December
Maude
. Intel Corporation Intel parallel studio – compiler, libraries and analysis tools (). http://software.intel.com/en-us/articles/ intel-parallel-studio-home/ . Jaiswal NK () Priority queues. Academic, New York . Kelly FP () Loss networks. Ann Appl probab ():– . Kistler M, Gunnels J, Brokenshire D, Benton B () Programming the Linpack benchmark for the IBM PowerXCell i processor. Sci. Program :– . Kumar V, Grama A, Gupta A, Karypis G () Introduction to parallel computing: design and analysis of algorithms. BenjaminCummings, Redwood City . Nichols B, Buttlar D, Farrell JP () Pthreads programming. O’Reilly & Associates, Sebastopol . Peris VG, Squillante MS, Naik VK () Analysis of the impact of memory in distributed parallel processing systems. In: Proceedings of ACM SIGMETRICS conference on measurement and modeling of computer systems, ACM, New York, pp – . Sarawagi S, Thomas S, Agrawal R () Integrating association rule mining with relational database systems: alternatives and implications. Data Min Knowl Discov ():– . Squillante MS () Stochastic analysis of resource allocation in parallel processing systems. In: Gelenbe E (ed) Computer system performance modeling in perspective: a tribute to the work of Prof. Sevcik KC. Imperial College Press, London, pp – . Squillante MS () Stochastic analysis and optimization of multiserver systems. In: Ardagna D, Zhang L (eds) Run-time models for self-managing systems and applications, chapter Springer, Berlin . Squillante MS, Zhang Y, Sivasubramaniam A, Gautam N, Franke H, Moreira J () Modeling and analysis of dynamic coscheduling in parallel and distributed environments. In: Proceedings of ACM SIGMETRICS conference on measurement and modeling of computer systems, ACM, New York, pp – . Stonebraker M, Abadi D, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E () C-store: a column-oriented DBMS. In: Proceedings of the st international conference on very large data bases, VLDB Endowment, Trondheim, Norway, pp – . Su J, Yelick K () Automatic communication performance debugging in PGAS languages. In: Adve V, Garzarán MJ, Petersen P (eds) Languages and compilers for parallel computing. Springer, Berlin/Heidelberg, pp – . TOP Organization Performance development – TOP supercomputing sites. http://www.top.org/lists/// performance_development. Accessed on November, . Zaki M, Ho C () Large-scale parallel data mining. Springer, New York
Matrix Computations Linear Algebra, Numerical
M
Maude José Meseguer University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Rewriting logic
Definition Maude [, ] is a declarative language whose modules are theories in rewriting logic [], and whose computation is logical deduction by concurrent rewriting modulo the equational structural axioms of each theory. Since virtually all models of concurrent computation can be naturally and easily expressed as rewrite theories [–], Maude and its tool environment can be used both as a parallel language, and as a flexible semantic framework to define, execute, formally analyze, and implement other parallel languages.
Discussion Rewriting Logic and Maude in a Nutshell Many different models of concurrency have been proposed and used: Petri nets, parallel functional programming, Actors, the π-calculus and other process calculi, and so on. Rewriting logic [] is not just one more model of concurrency. It is instead a logical framework for concurrency in which different models and different parallel languages can be formally specified, executed, and analyzed as rewrite theories. This can be done without imposing a representational bias that favors some specific concurrency model, and without complex encodings that distort and obscure the specific features of each model. The reasons why this natural expression of different concurrecy models within rewriting logic is possible are essentially twofold, namely, the ease and generality with which: () different kinds of distributed states and () different kinds of local concurrent transitions can be specified. A rewrite theory (Σ, E, R) consists of an equational theory (Σ, E) and a set of rewrite rules R, and specifies a concurrent system. The distributed states of such a system are specified by the equational theory
M
M
Maude
(Σ, E), where Σ is a collection of typed operators which includes the state constructors that build up a distributed state out of simpler state components, and where E specifies the algebraic identities that such distributed states enjoy. That is, the distributed states are specified as an algebraic data type defined as the initial algebra semantics (see, e.g., []) of the equational theory (Σ, E). Concretely, this means that a distributed state is mathematically represented as an E-equivalence class [t]E of terms (i.e., algebraic expressions) built up with the operators declared in Σ, modulo provable equality using the equations E, so that two state representations t and t′ describe the same state if and only if one can prove the equality t = t′ using the equations E. Instead, the local concurrent transitions of the concurrent system specified by (Σ, E, R) are described by its set R of rewrite rules. Each rewrite rule in R has the form t → t′ , where t and t′ are algebraic expressions in the syntax of Σ. The lefthand side t describes a local firing pattern, and the righthand side t′ describes a corresponding replacement pattern. That is, any fragment of a distributed state which is an instance of the firing pattern t can perform a local concurrent transition in which it is replaced by the corresponding instance of the replacement pattern t ′ . Both t and t′ are typically parametric patterns, describing not single states, but parametric families of states. The parameters appearing in t and t ′ are precisely the mathematical variables that t and t′ have, which can be instantiated to different concrete expressions by a mapping θ, called a substitution, sending each variable x to a term θ(x). The instance of t by θ is then denoted θ(t). The most basic logical deduction steps in a rewrite theory (Σ, E, R) are precisely atomic concurrent transitions, called atomic rewrites, corresponding to applying a rewrite rule t → t′ in R to a state fragment that is an instance of the firing pattern t by some substitution θ. That is, up to E-equivalence, the state is of the form C[θ(t)], where C is the rest of the state not affected by this atomic transition. Then the resulting state is precisely C[θ(t ′ )], so that the atomic transition has the form C[θ(t)] → C[θ(t′ )]. Rewriting is intrinsically concurrent, because many other atomic rewrites can potentially take place in the rest of the state C (and in the substitution θ), at the same time that the local atomic transition θ(t) → θ(t′ ) happens. That is, in general one may have complex concurrent transitions of the form C[θ(t)] → C′ [θ ′ (t ′ )], where the rest of the state C has evolved to C′ and the substitution θ has evolved to θ ′
by other (possibly many) atomic rewrites simultaneous with the atomic rewrite θ(t) → θ(t ′ ). The rules of deduction of rewriting logic [, ] (which in general allow rules in R to be conditional) precisely describe all the possible, complex concurrent transitions that a system can perform, so that concurrent computation and logical deduction coincide. Maude [, ] is a language and system implementation directly based on rewriting logic so that: () a Maude Module is exactly a rewrite theory (Σ, E, R) satisfying some simple executablilty conditions that make it easy to compute with the rules R modulo the equations E (see [] for details) and () computation in Maude is logical deduction by rewriting. A Simple Example. All the ideas discussed above can be illustrated by means of a simple example of a concurrent object-based system specified in Maude. Maude’s syntax is user-definable; therefore, all state-building operators can be declared by the user with any desired “mixfix” syntax. A concurrent state made up of objects and messages can be thought of as a “soup” in which objects and messages are freely floating and can come into contact with each other in communication events. Mathematically, this means that the concurrent state, called a configuration, is modeled as a multiset or bag built up by a multiset union operator which satisfies the axioms of associativity and commutativity, with the empty multiset as its identity element. In Maude one can, for example, choose to express multiset union with empty syntax, that is, just by juxtaposition. One can achieve this by declaring the type (called a sort) Configuration of configurations, which contains the sorts Object and Msg as subsorts, and the empty configuration, say none, and the configuration union operator by the declarations: sorts Object Msg Configuration . subsorts Object Msg < Configuration . op none : -> Configuration [ctor] . op _ _ : Configuration Configuration -> Configuration [ctor config assoc comm id: none] .
Each operator is declared with the op keyword, followed by its syntax, the list of its argument sorts, an arrow ->, and its result sort. And all declarations are finished by a space and a period. In the case of the configuration union operator, there are two argument positions, which are marked by underbars. Before and/or after such underbars, any desired syntax
Maude
tokens can be declared. In this case an empty syntax (juxtaposition) has been chosen, so that no tokens at all are declared. If, instead, one wanted to describe configurations as comma-separated collections of objects and messages, one could give the alternative syntax declaration: op _,_ : Configuration Configuration -> Configuration [ctor config assoc comm id: none] .
Note that constants like none are viewed as operators with no arguments. The keyword config declares that this is a union operator for configurations of objects and messages. This allows Maude to execute such configurations with an object and message fair strategy, so that all objects having messages addressed to them do eventually process such messages. The assoc comm id: none declarations declare the associativity axiom (x y) z = x (y z), the commutativity axiom x y = y x, and the identity axiom x none = x. Maude then supports rewriting modulo such axioms, so that a rule can be applied to a configuration regardless of parentheses, and regardless of the order of arguments. The ctor keyword declares that both none and _ _ are state-building constructors, as opposed to functions defined on constructors such as, for example, a cardinality function, say |_|, that counts the total number of objects and messages present in a configuration, and which could be defined with the following syntax and equations: op |_| : Configuration -> Nat . var O : Object . var M : Msg . Configuration . eq | none | = 0 . eq | O C | = 1 + | C | . eq | M C | = 1 + | C | .
var C :
where Nat is the sort of natural numbers in the NAT module. Note that the above equations are applied by Maude modulo the associativity, commutativity, and identity axioms for _ _. For example, by instantiating C to none, the second equation defines the cardinality of a single object O to be 1 + | none | = 1. Consider an object-based system containing three classes of objects, namely, Buffer, Sender, and Receiver objects, so that a sender object sends to the corresponding receiver a sequence of values (say natural numbers) which it reads from its own buffer, while the receiver stores the values it gets from the sender in its
M
own buffer. In Maude’s Full Maude language extension (see Part II of []), such object classes can be declared as subsorts of the Object sort in class declarations, which specify the names and sorts of the attributes of objects in the class. The above three classes can be defined with class declarations: class Buffer | q : NatList, owner : Oid . class Sender | cell : Nat?, cnt : Nat, receiver : Oid . class Receiver | cell : Nat?, cnt : Nat .
An object of a class Cl declared with attributes a1 of sort A1, . . ., an of sort An, is a record-like structure of the form: < oid : Cl | a1 : v1, ... , an : vn >
where each vi is a term of sort Ai. For example, assuming that one uses quoted identifiers as object identifiers (by importing the QID module and giving the subsort declaration Qid < Oid), and that the supersort Nat? of Nat containing an empty value mt, and the sort NatList of lists of natural numbers have been declared as: sorts Nat? NatList . subsorts Nat < Nat? NatList . op mt : -> Nat? [ctor] . op nil : -> NatList [ctor] . op _._ : NatList NatList -> NatList [ctor assoc id: nil] .
then the following is an initial configuration of a sender and a receiver object, each with their own buffer, and each with their cell currently empty: < ’a : Buffer | q : 1 . 2 < ’b : Sender | cell : mt receiver : ’d > < ’c : Buffer | q : nil , < ’d : Receiver | cell
. 3 , owner : ’b > , cnt : 0 , owner : ’d > : mt , cnt : 1 >
A sender object can send messages to its corresponding receiver object. The programmer has complete freedom to define the format of such messages by declaring an operator of sort Msg, using the msg keyword instead of the more general op keyword to emphasize that the resulting terms are messages. For example, one can choose the following format: msg to_::_from_cnt_ : Oid Nat Oid Nat -> Msg.
where a message, say, to ’d::3 from ’b cnt 1, means that ’b sends to ’d the data item , with counter ,
M
M
Maude
indicating that this is the first element transmitted. This last information is important, since message passing in a configuration is usually asynchronous, so that messages could be received out of order. Therefore, receiver objects need to use the counter information to properly reassemble a list of transmitted data. Of course, out-of-order communication is just one possible situation that can be modeled. If, instead, one wanted to model in-order communication, the distributed state could contain channels, similar for example to the buffer objects, so that axioms of associativity and identity are satisfied when inserting messages into a channel, but not commutativity, which is the axiom allowing out-oforder communication in an asynchronous configuration of objects and messages. All that has been done up to now is to define the distributed states of this object-based system as the algebraic data type associated to an equational theory, namely, the equational theory (Σ, E), where Σ is the signature whose sorts have been declared with the sort (and class) keywords, with subsort relations declared with the subsort keyword, and whose operators have been declared with the op (or msg) keywords. And where the equations E have been declared in one of two ways: () either as explicit equations, specified with the eq keyword, or as () equational axioms of associativity and/or commutativity and/or identity associated to specific operators, declared with the assoc, comm and id: keywords. What about the concurrent transitions for buffers, senders, and receivers? As already mentioned, they are specified by a set R of rewrite rules which describe the local concurrent transitions possible in the system. One can, for example, define such transitions by the following four rewrite rules (note that object attributes not changed by a rule need not be mentioned in it): vars X Y Z : Oid . vars N E : Nat . vars L L’ : NatList . rl [read] : < X : Buffer | owner : Y > < Y : Sender cnt : N > => < X : Buffer | q : L, < Y : Sender | cell : E,
q : L . E , | cell : mt, owner : Y > cnt : N + 1 >
rl [write] : < X : Buffer | q Y > < Y : Receiver | cell : => < X : Buffer | q : E . L < Y : Receiver | cell : mt
.
: L , owner : E > , owner : Y > > .
rl [send] : < Y : Sender | cell : E , cnt : N , receiver : Z > => < Y : Sender | cell : mt, cnt : N > (to Z :: E from Y cnt N) . rl [receive] : < Z : Receiver | cell : mt , cnt : N > (to Z :: E from Y cnt N) => < Z : Receiver | cell : E , cnt : N + 1 > .
That is, senders can read data from the buffer they own and update their count; and receivers can write their received data in their own buffer. Also, each time a sender has a data element in its cell, it can send it to its corresponding receiver with the appropriate count; and a receiver with an empty cell can receive a data item from its sender, provided it has the correct counter. Note that rewriting is intrinsically concurrent; for example, ’b could be sending the next data item to ’d at the same time that ’d is receiving the previous data item or is writing it into its own buffer. Note also that the rules send and receive describe the asynchronous message passing communication between senders and receivers typical of the Actor model []. Instead, the read and write rewrite rules describe synchronization events, in which a buffer and its owner object synchronously transfer data between each other. This illustrates the flexibility of rewriting logic as a semantic framework: No assumption of either synchrony or asynchrony is built into the logic. Instead, many different styles of concurrency and of in-order or out-of-order communication can be directly modeled without any difficulty. The behavior of the above system can be simulated by means of Maude’s rewrite (abbreviated rew) command. For example, one can see the result of a terminating execution of the initial configuration described above by giving to Full Maude the following command: Maude> (rew < ’a : Buffer | q : 1 . 2 . 3 , owner : ’b > < ’b : Sender | cell : mt , cnt : 0 , receiver : ’d > < ’c : Buffer | q : nil , owner : ’d > < ’d : Receiver | cell : mt , cnt : 1 > .) result Configuration : < ’a : Buffer | owner : ’b,q : nil > < ’b : Sender | cell : mt,cnt : 3,receiver : ’d > < ’c : Buffer | owner : ’d,q :(1 . 2 . 3)> < ’d : Receiver | cell : mt,cnt : 4 >
Maude
Since a concurrent system can be non-terminating, a rewrite command may never terminate; for this reason, the rewrite command can be given an extra numerical parameter, indicating the maximum number of rewrite steps that should be simulated. Furthermore, a concurrent system can often be highly nondeterministic, so that the result of a rewrite command provides just one fair simulation of the system among possibly many others. Instead, functions defined by equations with the eq keyword (as opposed to the concurrent transitions defined by rules with the rl keyword) should have a single final result. The evaluation of functional expressions to their unique final result using the module’s equations from left to right is supported by a different command, namely, the reduce (abbreviated red) command. For example, one can compute the cardinality of the above initial configuration by giving to Full Maude the command: Maude> (red | < ’a : Buffer | q : owner : ’b > < ’b : Sender | cell : mt , cnt : receiver : ’d > < ’c : Buffer owner : ’d > < ’d : Receiver | cell : mt , cnt
1 . 2 . 3 , 0 , | q : nil ,
M
where an equation declared with the “otherwise” (owise) keyword abbreviates a conditional equation that is applied only when all other equations defining the same function cannot be applied (see Sect. .. of []). One can then give the following search command to Full Maude to verify the desired invariant by searching for a state violating such an invariant (note Maude’s support for on-the-fly introduction of mathematical variables by appending to a variable’s name a colon and the sort of the variable, as done below for C:Configuration and L:NatList): Maude> (search < ’a : Buffer | q : 1 . 2 . 3 , owner : ’b > < ’b : Sender | cell : mt , cnt : 0 , receiver : ’d > < ’c : Buffer | q : nil , owner : ’d > < ’d : Receiver | cell : mt , cnt : 1 > =>* C:Configuration < ’c : Buffer | q : L:NatList , owner : ’d > such that postfix(L:NatList,1 . 2 . 3) = false .)
: 1 > | .) No solution.
result NzNat : 4
Simulating a system with the rewrite command is useful but insufficient. One would like, for example, to verify that for all behaviors some undesired situation (for example a deadlock) does not happen; or one might like to find what different states of a certain kind (final states, or states satisfying a certain pattern and/or condition) can be reached from a given initial state. This can be achieved by using Maude’s search command, which performs breadth-first search from an initial state looking for states specified by a pattern (a constructor term with variables, so that the desired states are instances of it), and possibly by an additional semantic condition specified with the such that keyword. For example, one may wish to check the invariant that all states reachable from our initial configuration exhibit in-order reception of the list 1 . 2 . 3 of numbers initially stored in the sender’s buffer; that is, that any list stored in the receiver’s buffer is always a postfix of the list 1 . 2 . 3. This can be done by first equationally defining a postfix predicate on pairs of lists as follows: op postfix : NatList NatList -> Bool . eq postfix(L,L’ . L) = true . eq postfix(L,L’) = false [owise] .
In this example, the number of reachable states from the initial configuration is finite, so that the search command can exhaustively search all reachable states. However, the search command can be given even when the number of reachable states is infinite. It will then return all the states it finds satisfying the given pattern and semantic condition. In particular, if one searches for the violation of an invariant and the invariant is indeed violated, it will find such a violation in finite time. Besides the possibility of using the search command for verifying invariants, whenever the set of states reachable form an initial state is finite, Maude also provides the possibility of model checking any desired linear-time logic (LTL) formula. This can be easily done by equationally defining the desired state predicates used as atomic propositions in the given LTL formula, importing Maude’s MODEL-CHECKER module, and then giving a command to model check the desired formula from the given initial state. For example, for the above initial configuration one may wish to model check the liveness property that all system behaviors eventually terminate with the sender’s buffer empty, the sender’s and receiver’s cells empty, and the receiver’s buffer containing the list originally stored in the sender’s
M
M
Maude
buffer. A detailed description of Maude’s LTL model checking capabilities can be found in Chap. of [].
Rewriting Logic Semantics of Programming Languages The flexibility of rewriting logic to naturally express many different models of concurrency can be exploited not just at the theoretical level, for expressing such models both deductively, and denotationally in the model theory of rewriting logic [, ]: It can also be applied to give formal definitions of concurrent programming languages by specifying the concurrent model of a language L as a rewrite theory (ΣL , EL , RL ), where: () the signature ΣL specifies both the syntax of L and the types and operators needed to specify semantic entities such as the store, the environment, input-output, and so on; () the equations EL can be used to give semantic definitions for the deterministic features of L (a sequential language typically has only deterministic features and can be specified just equationally as (ΣL , EL )); and () the rewrite rules RL are used to give semantic definitions for the concurrent features of L such as, for example, the semantics of threads. By specifying the rewrite theory (ΣL , EL , RL ) as a Maude module, it becomes not just a mathematical definition but an executable one, that is, an interpreter for L. Furthermore, one can leverage Maude’s generic search and LTL model checking features to automatically endow L with powerful program analysis capabilities. For example, the search command can be used in the module (ΣL , EL , RL ) to detect any violations of invariants, e.g., a deadlock or some other undesired state, of a program in L. Likewise, for terminating concurrent programs in L one can model check any desired LTL property. All this can be effectively done not just for toy languages, but for real ones such as Java and the JVM, and with performance that compares favorably with state-of-the-art tools such as Java PathFinder [, ]. There are essentially three reasons for this surprisingly good performance. First, rewriting logic’s distinction between equations EL , used to give semantics to deterministic features of L, and rules RL , used to specify the semantics of concurrent features, provides in practice an enormous state space reduction. Note that a state of (ΣL , EL , RL ) is, by definition, an EL -equivalence class [t]EL , which in practice is represented as the state of the program’s execution after all deterministic execution steps possible at a given
stage have been taken. That is, the equations EL have the effect of “fast forwarding” such an execution by skipping all intermediate deterministic steps until the next truly concurrent interaction is reached. For example, for L = Java, EJava has hundreds of equations, but RJava has just rules. The second reason is of course that the underlying Maude implementation is a high-performance one that can reach millions of rewrite steps per second. The third reason is that the intrinsic flexibility of rewriting logic means that it does not prescribe a fixed style for giving semantic definitions. Instead, many different styles, for example, small-step or big-step semantics, reduction semantics, CHAM-style semantics, or continuation semantics, can all be naturally supported []. But not all styles are equally efficient; for example, small-step semantics makes heavy use of conditional rewrite rules, insists on modeling every single computation step as a rule in RL , and is in practice horribly inefficient. Instead, the continuation semantics style described in [] and used in [] is very efficient. The good theoretical and practical advantages of using rewriting logic to give semantic definitions to programming languages have stimulated an international research effort called the rewriting logic semantics project (see [, , ] for some overview papers). Not only have semantic definitions allowing effective program analyses been given for many real languages such as Java, the JVM, Scheme, and C, for hardware description languages such as ABEL and Verilog, and for software modeling languages such as MOF, Ptolemy and AADL: It has also been possible to build a host of sophisticated program analysis tools for real languages based on different kinds of abstract semantics. The point is that instead of a “concrete semantics” (Σ L, EL , RL ), describing the actual execution of programs in a language L, one can just as easily define A A A ) describing any , EL , RL an “abstract semantics” (ΣL desired abstraction A of L. To give just two examples, one can give an abstract semantics for type checking by replacing operations on concrete values by operations on their corresponding types []; and one can likewise give an abstract semantics to scientific programs by operating not on their numerical values, but on their intended units of measurement []. All this means that many different forms of program analysis, much more scalable than the kind of search and model checking based on a language’s concrete semantics,
Maude
become available essentially for free by using Maude to execute and analyze one’s desired abstract semantics A A A (ΣL ). , EL , RL Two further developments of the rewriting logic semantics project, both pioneered by Grigore Ro¸su with several collaborators, are worth mentioning. One is the K semantic framework for programming language definitions [], which provides a very concise and highly modular notation for such definitions. The K-Maude tool then automatically translates language definitions in K into their corresponding rewrite theories in Maude for execution and program analysis purposes. Another is matching logic [], a program verification logic with substantial advantages over both Hoare logic and separation logic which uses a language’s rewriting logic semantics, including the possibility of using patterns to symbolically characterize sets of states, to mechanize the formal verification of programs, including programs that manipulate complex data structures using pointers.
Maude’s Implementation and Formal Environment Maude is implemented in C++ as a high-performance semi-compiled interpreter. Both equations and rules are (semi-)compiled into matching and replacement automata [, ]. This makes it possible to trace every single step of rewriting with both equations and rules. In the sequential Maude implementation, the execution of concurrent systems is simulated by the rewrite or frewrite commands, which each apply a module’s rules with a specific fair strategy []. However, because of its built-in support for sockets (see Sect. . in []), Maude is also a parallel language, which can be used to program in a declarative, high-level way interesting distributed applications, for example, mobile languages (see Chap. of []). Using sockets, a concurrent configuration of objects and messages, which at the sequential level would be represented by a single mutiset data structure, can now be distributed and executed over several machines, where each machine holds its own local configuration of objects and messages as a multiset, and rewrites it sequentialy in a fair way. Objects in a local configuration can send messages not only to other objects in the same local configuration, but to remote objects in other machines using sockets. The fact that the rewriting of objects and messages is automatically fair for each object means that
M
no explict scheduling is needed; that is, by the intrinsic fairness with which rewriting of objects and messages is implemented, all objects in the different local configurations always get to process the messages sent to them either from other objects in the same local configuration or from remote objects. An upcoming next generation of Maude, currently under development at SRI International and the University of Illinois, will add to the present capabilities of Maude for distributed computating the additional ability of exploiting the finer parallelism provided by multicore machines. This means that the local configurations of objects and messages running in each multicore machine will themselves also be executed concurrently, with different objects in the same local configuration executing on different cores. Besides the search and LTL model checking features directly supported by the Maude implementation, which allow reachability analysis and LTL formal verification of Maude programs, Maude also has a formal environment of tools supporting other types of formal verification, including: () inductive theorem proving for equational theories (called functional modules in Maude) using its ITP tool; () proofs of termination using its MTT tool; () proofs of local confluence for equations with its CRC tool (which ensure that functional evaluations with equations using the reduce command are deterministic), and of coherence of rules with respect to equations with its ChC tool (which ensure that execution of equations and of rules “commutes” in an appropriate sense); () proofs of sufficient completeness in its SCC tool, ensuring that enough equations have been given to fully define each nonconstructor function symbol; and () execution and model checking analysis of real-time systems with its Real-Time Maude tool. A common feature of all these tools (and also of the Full Maude language extension, see Part II of []) is their systematic use of reflection to manipulate the Maude modules that they analyze as terms within rewriting logic itself. This powerful form of logical reflection is based on the fact that rewriting logic and equational logic are both reflective logics [], and is efficiently supported by Maude’s META-LEVEL module (see Chaps. – of []). Besides the formal tools used to analyze Maude modules, Maude can likewise be used to build domain-specific tools for specific languages or logical systems, including the already-mentioned tools for programming languages.
M
M
Media Extensions
Some of these other tools, as well as the Maude formal tools, are discussed in Chap. of []. Maude itself, its documentation, and its formal tools are all freely available at maude.cs.uiuc.edu.
.
Related Entries
.
Actors CSP (Communicating Sequential Processes) Functional Languages Linda
. .
Logic Languages Petri Nets Pi-Calculus
.
Process Algebras .
Bibliography . Agha G () Actors: model of concurrent computation in distributed systems. MIT Press, Cambridge, MA . Bruni R, Meseguer J () Semantic foundations for generalized rewrite theories. Theor Comput Sci (–):– . Chen F, Ro¸su G, Venkatesan RP () Rule-based analysis of dimensional safety. In: Nieuwenhuis R (ed) Rewriting techniques and applications: th international conference (RTA’: Proc.), Valencia, – June . Lecture notes in computer science, vol . Springer, Berlin, pp – . Clavel M, Durán F, Eker S, Lincoln P, Martí-Oliet N, Meseguer J, Quesada J () Maude: specification and programming in rewriting logic. Theor Comput Sci :– . Clavel M, Durán F, Eker S, Meseguer J, Lincoln P, Martí-Oliet N, Talcott C () All about Maude – a high-performance logical framework. Lecture notes in computer science, vol . Springer, Berlin . Clavel M, Meseguer J, Palomino M () Reflection in membership equational logic, manysorted equational logic, Horn logic with equality, and rewriting logic. Theor Comput Sci :– . Eker S () Fast matching in combination of regular equational theories. In: Meseguer J (ed) Proc. First Intl. Workshop on Rewriting Logic and its Applications, Pacific Grove, – Sept . Electronic Notes in Theoretical Computer Science, vol . Elsevier, Amsterdam . Eker SM () Associative-commutative rewriting on large terms. In: Nieuwenhuis R (ed) Rewriting techniques and application: th international conference (RTA’: Proc.), Valencia, – June . Lecture notes in computer science, vol . springer, Berlin, pp – . Ellison C, Serbanuta T, Rosu G () A rewriting logic approach to type inference. In: Proc. of WADT , Pisa, – June . Lecture notes in computer science, vol . Springer, Berlin, pp – . Farzan A, Cheng F, Meseguer J, Ro¸su G () Formal analysis of Java program in JavaFAN. In: Alur R, Peled DA (eds) Proc. CAV’: th international conference, Boston, – July .
. .
. .
Lecture notes in computer science, vol . Springer, Berlin, pp – Farzan A, Meseguer J, Ro¸su G () Formal JVM code analysis in JavaFAN. In: Proc. AMAST’: th international conference, Stirling, – July . Lecture notes in computer science, vol . Springer, Berlin, pp – Martí-Oliet N, Meseguer J () Rewriting logic: roadmap and bibliography. Theor Comput Sci :– Meseguer J () Conditional rewriting logic as a unified model of concurrency. Theor Comput Sci ():– Meseguer J () Rewriting logic as a semantic framework for concurrency: a progress report. In: Proc. CONCUR’: th international conference, Pisa, – Aug. . Lecture notes in computer science, vol . Springer, Berlin/New York, pp – Meseguer J, Goguen J () Initiality, induction and computability. In: Nivat M, Reynolds J (eds) Algebraic methods in semantics. Cambridge University press, Cambridge, pp – Meseguer J, Ro¸su G () Rewriting logic semantics: from language specifications to formal analysis tools. In: Basin D, Rusinowitsch M (eds) Proc. Intil. Joint Conf. on Automated Reasoning IJCAR’, Cork, – July . Lecture notes in computer science, vol . Springer, Berlin, pp – Meseguer J, Ro¸su G () The rewriting logic semantics project. Theor Comput Sci :– Ro¸su G, Ellison C, Schulte W () Matching logic: an alternative to Hoare/Floyd logic. In: AMAST : th international conference, Lac-Beauport, – June . Lecture notes in computer science, vol . Springer, Berlin Ro¸su G, Serbanuta T () An overview of the K semantic framework. J Log Algebra Program ():– Serbanuta T, Ro¸su G, Meseguer J () A rewriting logic approach to operational semantics. Inf Comput ():–
Media Extensions Vector Extensions, Instruction-Set Architecture (ISA)
Meiko James H. Cownie , Duncan Roweth Intel Corporation (UK) Ltd., Swindon, UK Cray (UK) Ltd., UK
Synonyms Computing surface; CS-
Definition Meiko was a UK-based parallel computer company that was founded in by six employees of Inmos. It
Meiko
grew to a peak of employees based in Bristol and Boston Massachusetts. It ceased to trade in ; shortly afterward, some of the communications technology and a small engineering team were transferred to Quadrics.
Discussion Meiko machines were all distributed memory clusters. There were three distinct machine generations during the years of Meiko’s life: Transputer systems, hybrids that used Transputers for communication and an Intel i or SPARC processor, and, finally, SPARC systems in which the Transputers had been replaced by Meiko designed communications ASICs. At the time of their construction, the takeover of high performance computing by cluster machines had yet to take place, the fastest machines at the time being vector supercomputers form Cray and Japanese vendors. Meiko machines contributed to the demonstration that it was possible to use cluster machines with no shared memory for high performance computing.
Customers Meiko had over different customers, some of the more important are mentioned here.
HPC Users Major users of Meiko machines for high performance scientific computing included ●
●
● ● ● ●
CERN who used the machines in the NA experiment data recording (a project continued by Quadrics) Edinburgh Parallel Computer Centre who used the machines for general scientific supercomputing, and QCD calculations Hewlett Packard printer division who used the machines for simulation of inkjet printer heads Lawrence Livermore National Laboratory who used the machines in the stockpile stewardship program Southampton University who used the machines in the EECS department Toyota who used the machines for producing photo-realistic, ray traced images of cars from design data
Many other customers in Europe also used the machines, some of which were funded as part of
M
the EEC’s Esprit program, the GPMIMD project in particular.
Transputer-Based Machines Meiko’s first-generation machines, known as “The Computing Surface,” were based on the Inmos T Transputer. Early systems consisted of up to -bit integer processors connected by MBit/s serial links. They were programmed in Occam. With the introduction of the Inmos T Transputer, which had an integrated FPU, floating point performance increased to MFlops per CPU, MFlops for the system as a whole. Customers wanted to be able to program these machines in conventional programming languages, so that existing codes could be more easily ported to the machine. Recognizing this, Meiko produced Fortran and C compilers for the Transputer. In these languages, communication was achieved by calls to message passing libraries that mapped directly onto the underlying Transputer hardware channel communication instructions. These early machines were normally connected to a host (a DEC micro-VAX, or Sun server), which booted the user program directly into the Computing Surface, and provided I/O. These machines were therefore really attached compute servers, rather than stand-alone computers. In the initial T machines, the topology of the Transputer link network was set by plugging wires in the back of the machine. Later machines used a Meiko designed ASIC to allow the link configuration and hence the network topology to be changed between jobs. The topology of the machine was, however, fixed for the duration of a single job – Inmos added adaptive routing support to later generations of product. Since Transputers had no memory protection, or operating system, multitasking was achieved by physically partitioning the machine and running separate jobs on separate processors.
Hybrid Machines When Intel released the i () processor in , Meiko built boards that used these in conjunction with Transputers. The main program executed on the i, with communication handled by the Transputers. While the peak performance of the system was high (- MFlops per processor) achieving this proved
M
M
Meiko
difficult, with some hand coded assembler routines achieving – MFlops but most compiled code getting less than . Meiko also produced SPARC boards with Transputer communications. These were normally used to replace an external host machine, running SunOS (and later Solaris). These boards could all be used in machines with Transputers as nodes, providing a simple upgrade path for existing customers. Inmos’ follow up Transputer design, the T, was not a success and Meiko switched to designing its own interconnect for use with conventional microprocessors.
Sun SPARC-Based Machines In , Meiko introduced the “Computing Surface ” (CS-) machines. These were based on MHz SPARC processors with Meiko’s own interconnect, using two custom ASICs, the Elan network interface and the Elite switch. The CS- system had an asymptotic bandwidth of MBytes per second per link, a put latency of μs and a message passing latency of μs. The communication network used by the CS- was built from Elite switches, each of which was a sixteenway full crossbar. Since the links were unidirectional, these provided an eight-way bidirectional crossbar. The network was source routed, with each NIC being able to select at random (or in round robin order) from a table of four possible routes to each destination. Packets in the network were routed by byte-stripping, each Elite stripping the first byte from the incoming message and using that to determine the outgoing route. The Elite chip also included the ability to broadcast an incoming message to a contiguous range of outgoing links, and to combine acknowledgments from a range of links into a single message. In the CS- machine, the Elite switches were connected in a full fat-tree configuration to provide a scalable interconnect with linear increase in bisectional bandwidth and logarithmic increase in latency as the machine size increased. In addition to the SPARC boards, Meiko produced a board using a SPARC processor and a pair of Fujitsu vector processors. The node had a double precision peak performance of MFlop with MBytes of
memory and . GB/s of memory bandwidth per vector unit (three words per flop!) and an additional MB/s port shared by the SPARC and the Elan. The high peak performance was attractive (particularly to LLNL, the first customer) but the node design complicated the programming model – a new vectorizing compiler was developed by The Portland Group. In the initial design, the vector processors were not cache coherent with the SPARCs, despite sharing memory with them. This made programming hard, or forced the SPARC processor to execute with most of the memory marked as uncacheable. Of course, this slowed the non-vectorizable portions of the code, making overall performance poor. (A simple application of Amdahl’s Law with a slowdown applied to the non-vectorized component shows that if the slowdown is S, even with an infinitely fast vector system a proportion ( − /S) of the program must vectorize to achieve parity with the scalar system; for a slowdown of × this leads to a requirement that more than % of the computation vectorizes). Later versions implemented cache coherence, but the complexity of the logic and PCB manufacture required made the design uneconomic. Meiko had a competing board design that used four SPARC CPUs. It was lower cost, much easier to program and generally sustained higher performance, but it lacked the high floating point peak and only sold to a small number of customers. The CS- system pictured comprised modules each with four boards. It was typically configured with either vector or scalar nodes together with I/O card and switching. The largest system shipped (to LLNL) consisted of five such racks with a total of , mostly vector, nodes. A fourth-generation system based on SPARC processors and a new network was under development in when the company ran out of money. Significant elements of this design were used in QsNet, although Quadrics moved away from the SPARC/MBUS design to the recently standardized PCI host interface. Ever since then, network adapters for HPC systems have been on “wrong side” of a commodity host interface with reduced bandwidth and increased latency as a consequence.
Meiko
M
Meiko. Fig. Meiko CS- “Hugh” at Lawrence Livermore National Lab (Photograph courtesy of Lawrence Livermore National Laboratory)
M Uses Outside High Performance Computing Meiko worked with Oracle to port their database (Oracle Parallel Server) to the CS-, and a number of database server machines were sold late in the life of the company. These systems were used on early data warehousing applications by UK retailers including WH Smiths and Bass Taverns. Sales data was transferred from the company mainframe overnight enabling marketing users to develop and test complex decision support queries on recent data. Other users ran a mix of online transaction processing (OLTP) and decision support on their CS- systems. Meiko was unable to capitalize on this product – customers for database systems of its class were not generally willing to buy them from a small UK company.
Influence The experience of the Meiko designs has spread widely. In , ex-employees of Meiko are to be found at companies such as AMD, Cray, HP, Intel, and Sun. They
have influenced designs as diverse as AMD’s “HyperTransport” interconnect, current database machines, and parallel file system design. For instance, the technical lead for the Lustre parallel file system product at Sun is one of the founders of Meiko who led the parallel Oracle project years earlier.
Contributions Direct User Space Communication The Meiko CS- machine was one of the first machine designs to implement direct, cache-coherent, secure multi-user communication from user address space to user address space with no OS intervention. Other contemporary machines like the Thinking Machines CM- (also based on SPARC chips with a “fat-tree” interconnect) allowed only single user use of the compute nodes, did not execute a full OS on the node, and used the CPU to transfer data. Avoiding OS intervention on message passing is critical if low latency is to be achieved, as simply entering and leaving kernel mode is expensive. Since
M
Meiko
most message passing codes have many small messages, achieving high compute performance requires lowlatency message passing. OS intervention is normally required in I/O operations for these reasons ●
Protection: to ensure that the user is not attempting to transmit (or receive into) parts of their address space which are unmapped, or that the user is not attempting to access a device which they have no permission to access. ● Pinning: to ensure that the physical address of data which is being communicated does not change while the communication is occurring, since DMA I/O devices traditionally deal with physical addresses. ● Device Sharing: to ensure that multiple users do not incorrectly interleave access to the same physical device, all access is serialized through the OS device driver. The Elan communications interface was designed to avoid these issues. The protection and pinning issues were avoided by giving the Elan its own set(s) of page tables, so that all requests (both local and remote) could be made in terms of user virtual addresses. The Elan could then convert these into physical addresses, or (if the page was not present), raise a local page fault to cause the OS to load the page (if it was paged out), or kill the process (if the access was invalid). The page tables in the Elan were maintained by the Solaris kernel as it modified its own page tables. Device sharing and security were achieved by giving the Elan multiple, duplicate, copies of its control registers at different physical addresses. When a process initially requested access to the Elan, the resource manager would allocate a set of control registers to the process, and map them directly into the process’ address space. It would also program the Elan so that it knew which page tables to use when it received a request through that set of control registers, and what global identifier it should use to identify incoming and outgoing messages from that process. Thus when a remote request arrived at the Elan it could match the global identifier and route it into the correct local process’ address space with no OS intervention. The Elan chip sat on the SPARC MBUS, so could participate in the cache-coherency traffic in the local node. There was therefore no requirement that data to
be transmitted be flushed to memory, or that data must be received into uncached store. To eliminate the involvment of the CPU in processing message protocols, the Elan include its own processor (which executed a subset of the SPARC instruction set) and its own DMA engine. Using the techniques already outlined, this processor could execute code in the context of a user process, without requiring that the code be in a kernel-space device driver. Thus, each parallel job in the system could use whatever protocol it wished, the code to implement it being loaded as a normal shared library.
Remote Memory Access The fundamental communication primitive provided by the Elan was a remote memory access (put or get). This allows a less synchronized style of communication than message passing, since communication occurs when the initiator of the request requires it, it does not require any cooperation from the target. This leads to a programming style similar to that of the partitioned global address space languages today (UPC, Co-Array Fortran, Titanium), rather than a message passing style. The message passing libraries (such as MPI) were built on top of the remote memory access, rather than building remote memory access on top of message passing, as one would do in an active message style system. However, the model here differs from a fully shared memory system because remote memory is not available to the local processor directly through normal load and store operations. Rather accesses to remote memory must be explicitly requested from the network interface. Communication here is thus explicit, whereas in a shared memory system it is implicit in normal memory accesses. This model fits well with the “partitioned global address space” languages, since there the compiler always knows whether a memory access is local or remote, and can generate appropriate code to fetch remote data as required. Network Collective Operations The Elite chip was designed to support network collective operations, broadcasting packets to a contiguous range of links, combining the results and returning a single answer to the initiating process. This technique
Memory Models
was used to implement fast broadcast and barrier functions, without the use of a separate network as in the CM-. The combining operation could be used, for instance, to test whether all processes had the same value at a specific virtual address, by sending a single message and receiving a single acknowledgment. The aggregation of the result occurring in the network, rather than requiring aggregation up a tree of processors, with each one combining results from those below it. The root process performed network conditionals until it received the acknowledgment that all processes had entered the barrier. It then broadcast a completion notification. These techniques were developed by Quadrics in future generations of Elite router.
Software Development Tools For the Transputer machines, Meiko implemented their own development environment, which allowed user code to configure the machine topology and then load code into it. For the SPARC-based machines, Meiko ported the TotalView debugger at the request of LLNL, and enhanced it to interact with the resource manager of the CS-. The SPARC systems included an early parallel filesystem implemented as a Solaris virtual filesystem. Meiko pioneered development of C and Fortran compilers for Transputers and later HPF compilers for the SPARC systems. It supported PVM and PARMACS on the CS-, contributed to the MPI standard, and the MPICH MPI implementation. Operating System Meiko implemented kernel patches to the Solaris operating system to maintain the Elan page tables in synchrony with the kernel page tables. Quadrics continued to develop these for Linux, and their ultimate descendant was finally adopted into Linus’ kernel in .
M
Bilbliographic Notes and Further Reading The definitive papers on the design and programming of the Meiko CS- are [] and []. The Inmos Transputer is described in []. The Thinking Machines CM-, which forms an interesting comparison with the Meiko CS- is described in []. UPC, one of the first PGAS languages is described in []. The first message passing interface (MPI) standard is described in [].
Bibliography . Barton E, Cownie J, McLaren M () Parallel Comput : – . Beecroft J, Homewood M, McLaren M () Parallel Comput :– . El-Ghazawi T, Carlson W, Sterling T, Yelick K () UPC: distributed shared-memory programming. Wiley-Interscience, Hoboken . Hillis WD, Tucker LW () Commun ACM :– . MPI () Technical report Knoxville, TN, USA . Whitby-Strevens C () In: ISCA ’: Proceedings of the th annual international symposium on computer architecture, IEEE Computer Society, Los Alamitos, pp –
Memory Consistency Models Memory Models
Memory Models Sarita V. Adve , Hans J. Boehm University of Illinois at Urbana-Champaign, Urbana, IL, USA HP Labs, Palo Alto, CA, USA
Synonyms Related Entries
Fences; Memory consistency models; Memory ordering
CSP (Communicating Sequential Processes) HPF (High Performance Fortran)
Definition
MPI (Message Passing Interface)
In the context of shared-memory systems, the memory model specifies the values that a shared-memory read in a program may return. It is part of the interface between
PGAS (Partitioned Global Address Space) Languages PVM (Parallel Virtual Machine)
M
M
Memory Models
a shared-memory program and any hardware or software that may transform that program – it both specifies the possible behaviors of the memory accesses for the programmer and constrains the legal transformations and executions for the implementer. For high-level language programs, the memory model describes the behavior of accesses to shared variables or heap objects, while for machine code, the memory model describes the behavior of hardware instructions that access shared memory. Without an unambiguous memory model, it is not possible to reason about the correctness of a shared-memory program, compiler, dynamic optimizer, or hardware.
Discussion Sequential Consistency In a single-threaded program, a read is expected to return the value of the last write to the same address, where last is uniquely defined by the program text or program order. This allows the programmer to view the memory accesses as occurring one at a time (atomically) in program order. It also allows the hardware or compiler to aggressively reorder memory accesses as long as program order is preserved between a write and other accesses to the same address as the write. A natural extension for shared-memory programs is the sequential consistency memory model [] which offers simple interleaving semantics. With sequential consistency, all memory accesses appear to be performed in a single total order preserving the program order of each thread. Each read access to a memory location yields the value stored by the last preceding write access to the same memory location. Although sequential consistency appears to be an appealing programming model, it suffers from two deficiencies: ●
It can be difficult or inefficient to implement. Hardware often makes memory accesses visible to other threads out of order, for example, by buffering stores. Compilers often do the same, for example, by moving an access to a loop-invariant variable out of the loop. Unlike with single-threaded programs, these optimizations can be observed by the program, and may thus violate sequential consistency for multithreaded programs.
●
The definition of sequential consistency is based on the interleaving of individual memory accesses, and is thus only meaningful if the programmer understands how memory is accessed, for example, a byte versus a word at a time. This is sensible at the machine architecture level, but may not be realistic at the programming language level.
Relaxed Memory Models Recognizing the limitations of sequential consistency, more relaxed or weak memory models have been proposed. Until the late s, most of this work was in the context of hardware instruction set architectures and implementations. These implementation-centric models are mostly driven by specific optimizations desirable for hardware. The most common class relaxes the program order requirement of sequential consistency. For example, the SPARC Total Store Order (TSO) [] and the x models [] allow reordering a write followed by a read (to a different location) in program order. The IBM PowerPC memory model [] allows reordering both reads and writes (to different locations). Such models additionally provide various forms of fence instructions to enable programmers to explicitly impose program orders not guaranteed by default. Other optimizations exploited by such models include subtle relaxations of the memory model to allow writes from multiple threads to become visible in inconsistent orders to different observer threads, as well as the ability to make memory references visible to other threads out of order even if a data- or control dependence exists between the references. Although most of the work on this genre of implementation-centric models is in the context of hardware, some programming environments such as OpenMP . [] also express their memory models in terms of fence instructions to ensure explicit program ordering. These relaxed models provide performance benefits over sequential consistency (through both hardware and compiler optimizations), but are inappropriate as programming interfaces. They often require description of subtle interactions and/or are not well-matched to software requirements, making them too complex to reason and/or suboptimal for performance.
Memory Models
The Data-Race-Free Memory Models The data-race-free models [, ] (also referred to as properly labeled models []) take a programmercentric approach. They observe that good programming practice requires programs to be well synchronized or data-race-free. In such programs, concurrent non-read-only accesses to the same variable are either explicitly marked as synchronization or are separated by explicit synchronization. This ensures that no thread can ever observe another thread between synchronization points, allowing the implementation to aggressively transform code between synchronization points while still maintaining the appearance of sequential consistency. The data-race-free models formally define the notion of data-race-free programs and guarantee sequential consistency only for those programs. For full implementation flexibility, they do not provide any guarantees for programs that contain data races. This approach allows nearly standard optimizations, while providing a simple model for programmers and was the approach adopted by the Ada programming language [] and the POSIX standard [] (albeit informally).
State-of-the-art in In the last decade, there has been much work to standardize high-level language programming models, specifically in the context of Java and C++. There is now convergence toward data-race-free as the model of choice. For all practical purposes, the Java memory model [, ] and the upcoming C and C++ standards [, ] provide the data-race-free model, but with two unfortunate caveats. First, Java’s safety and security properties require that some semantics need to be provided for all programs, including programs with data races. It has been extraordinarily difficult to develop reasonable semantics for racy programs, while still preserving full implementation flexibility. The current formal specification of the Java memory model is therefore quite complex and has an unresolved bug. Second, although data-race-free maps well to current implementation-centric hardware models for most purposes, there are useful programming idioms where available hardware-level fences are too heavyweight for
M
language-level synchronization mechanisms. For this reason, Java and C++ provide low-level synchronization constructs that provide potentially better performance on some current hardware, but at the cost of significant complexity to the programming model. Although most language specifications are converging to the data-race-free model, current implementations often lag. In particular, it is still quite common for C and C++ compilers to effectively introduce copies of an otherwise unmodified variable to itself, for example, by overwriting adjacent structure fields when a small field is modified []. These can introduce visible data races, and are hence incompatible with the model. In summary, although data-race-free provides the best trade-off between performance and ease-of-use today, it falls short for safe languages and current implementations.
Bibliographic Notes and Further Reading A more complete overview and retrospective on memory models is provided in []. It also surveys some possible approaches for overcoming the deficiencies of current data-race-free models. A tutorial on hardwareoriented relaxed memory models can be found in []. A description of the x hardware memory model, together with a discussion of some of the issues arising in precisely defining such models, is given in [].
Bibliography . Adve SV () Designing memory consistency models for shared-memory multiprocessors. Ph.D. thesis, University of Wisconsin-Madison . Adve SV, Boehm HJ () Memory models: a case for rethinking parallel languages and hardware. Commun ACM ():– . Adve SV, Gharachorloo K () Shared memory consistency models: a tutorial. IEEE Comput ():– . Adve SV, Hill MD () Weak ordering – a new definition. In: Proceedings of th international symposium on computer architecture, Seattle, pp – . Boehm HJ () Threads cannot be implemented as a library. In: Proceedings of conference on programming language design and implementation, Chicago . Boehm HJ, Adve SV () Foundations of the C++ concurrency memory model. In: Proceedings of conference on programming language design and implementation, Tucson, pp –
M
M
Memory Ordering
. C ++ Standards Committee, Becker P () Programming Languages – C ++ (final committee draft). C ++ standards committee paper WG/N=J/-, http://www.open-std.org/ JTC/SC/WG/docs/papers//n.pdf, March . Gharachorloo K () Memory consistency models for shared memory multiprocessors. Ph.D. thesis, Stanford University . IBM Corporation () Power ISA Version . Revision B. http://www.power.org/resources/downloads/PowerISA_V. B_V_PUBLIC.pdf, . IEEE and The Open Group () IEEE Standard .- . Intel Corporation () Intel and IA- Architectures software developer’s manual, vol a, system programming guide, Part. http://www.intel.com/products/processor/manuals/, . Lamport L () How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans Comput C-():– . Manson J, Pugh W, Adve SV () The Java memory model. In: Proceedings symposium on principles of programming languages, Long Beach . Pugh W, JSR Expert Group () The Java memory model. http://www.cs.umd.edu/_pugh/java/memoryModel/ and referenced pages, July . Sewell P, Sarkar S, Owens S, Nardelli FZ, Myreen MO () x-TSO: a rigorous and usable programmer’s model for x multiprocessors. Commun ACM ():– . The OpenMP ARB () OpenMP application programming interface: Version .. http://www.openmp.org/mp-documents/ spec.pdf, May . United States Department of Defense (). Reference manual for the Ada programming language: ANSI/MILSTD-A- standard .-, Springer . Weaver DL, Germond T () The SPARC architecture manual, version . Prentice Hall, Englewood Cliffs
Memory Ordering Memory Models
Memory Wall Sally A. McKee , Robert W. Wisniewski Chalmers University of Technology, Goteborg, Sweden IBM, Yorktown Heights, NY, USA
Synonyms Data starvation crisis
Definition The memory wall describes implications of the processor/memory performance gap that has grown steadily
over the last several decades. If memory latency and bandwidth become insufficient to provide processors with enough instructions and data to continue computation, processors will effectively always be stalled waiting on memory. The trend of placing more and more cores on chip exacerbates the situation, since each core enjoys a relatively narrower channel to shared memory resources. The problem is particularly acute in highly parallel systems, but occurs in platforms ranging from embedded systems to supercomputers, and is not limited to multiprocessors.
Discussion Introduction The term memory wall was coined in a short, controversial note that William A. Wulf and Sally A. McKee published in a issue of the ACM SIGArch Computer Architecture News (“Hitting the Memory Wall: Implications of the Obvious” []). At the time, most computer architects focused entirely on increasing processor speed, believing that improving and creating more latency tolerating techniques would suffice to keep these ever-faster processors fed with instructions and data. The High Performance Computing (HPC) community community had already recognized the problem, but years after John Ousterhout published an article entitled “Why Aren’t Operating Systems Getting Faster As Fast as Hardware?” [], the mainstream computing community still believed that out-of-order execution, wider instruction issue widths, non-blocking caches, and increased speculation would continue to deliver performance improvements through increased Instruction Level Parallelism (ILP) and latency tolerance. Even Richard Sites’s article in the popular magazine Microprocessor Report (“It’s the Memory, Stupid!”) [] failed to change common opinion. In the memory wall article, the authors make the point that if the entire memory hierarchy, and not just the caches, were not improved in tandem with the microarchitecture, computer performance would cease to improve. They assume that every fifth instruction is a memory reference and that the number of cycles required to fill a cache line grows over time (looking at two different ratios of processor versus DRAM speedups, cache hit of % up to .%, and two initial cache miss/hit cost ratios). By graphing the number of cycles it takes to perform a cacheline
M
Memory Wall
2010
2010
2005
p = 90.0% p = 94.0%
1000
p = 98.0% 100
10
1 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
1
cache miss/hit cost ratio
a
2000
1995
10
average cycles per access
100
µ = 100% µ = 50%
10000
2010
2005
2000
1995
p = 99.8%
2005
1995
µ = 100%
p = 99.0% p = 99.4%
1000
2000
going to hit a wall in the improvement of system performance unless something basic changes” []. The memory wall also has a bandwidth component. One aspect of that is the limited number of pins that can be placed on ever-shrinking chips (formerly on the perimeter [], now on the surface). The note focuses on latency aspects of the memory wall, but latency and bandwidth are two sides of the same coin, and cannot be teased apart. Denali CEO Sanjay K. Srivastava again stated the obvious in a interview with Wall Street Reporter: “There has been a growing disparity between the bandwidth needs of the processing elements and the ability of the memory elements to deliver that bandwidth. This ‘data starvation crisis’ is now being recognized as the main bottleneck for all electronic system designs” []. Depending on application characteristics and the architectures on which the applications run, current programs may or may not experience a memory wall. Data access locality on many levels is important to good performance. Applications that access sparse data structures, exhibit random (as opposed to structured) access patterns, or have huge working sets that prevent them from enjoying high cache hit rates suffer most.
µ = 50%
10000
average cycles per access
2010
2005
2000
1995
fill, they show that even at impossibly high hit rates and a less aggressive speed growth differential, processor improvements would soon no longer matter, since the cores would find themselves almost always waiting on memory. Their analysis necessarily contains many simplifying assumptions, but the four-page note eventually had impact, and memory wall has become a common term. Figures and show the average number of cycles required per memory access as cache miss/hit cost ratios rise. The x axis marks increases in these cost ratios assuming given starting points (i.e., a cache for which fills cost four times as much as hits, and one for which fills cost times as much). The log-based y axis indicates cycles per access. The curves in each graph indicate average access costs for different (unrealistic) cache hit rates from .% to .%. The lines at the top of each graph indicate where on the cycle-cost curves we would be at different points in time, where the top line marks years for an aggressive improvements (%) in processor speed every months, and the bottom marks years for a more modest improvement rate (%). As the original article says, “the specific values won’t change the basic conclusion of this note, namely that we are
Very high hit rates
cache miss/hit cost ratio
b
Memory Wall. Fig. Trends for a cache miss/hit cost ratio of four []
More modest hit rates
M
M
µ = 50%
10
2010
2010
2005
2000
1995
100
average cycles per access
p = 99.8%
µ = 100% µ = 50%
10000
2010
2005
2000
1995
p = 99.0% p = 99.4%
1000
2005
µ = 100%
2000
1995
2010
2005
1995
2000
Memory Wall
10000
average cycles per access
p = 90.0%
1000
p = 94.0% p = 98.0% 100
10
1 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
1 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
cache miss/hit cost ratio
a
Very high hit rates
cache miss/hit cost ratio
b
More modest hit rates
Memory Wall. Fig. Trends for a cache miss/hit cost ratio of sixteen []
DRAM Basics Dynamic Random Access Memory (DRAM) is used in almost all main memories, as opposed to Static Random Access Memory (SRAM), which is used in almost all caches in the memory hierarchy. The term DRAM was coined to indicate that the technology allowed any (random) access to be performed in about the same amount of time, in contrast to tape storage and disks. In spite of the name, some accesses take longer than others. Improvements in DRAMs have largely increased storage capacity, rather than increasing speed, hence the processor/memory performance gap. DRAMs are organized into internal banks, and each bank, or storage array, is accessed by specifying a row and a column address. Separate wires are used to transmit commands, addresses, and data. When the memory controller issues a row address, the specified line is loaded from the storage array into a bank of sense amplifiers (also called a DRAM page or hot row) that boosts the signal but also behaves something like a line of cache. The data in the “open” page remains valid long enough to perform many accesses, and consecutive accesses that hit in the same row happen much faster than accesses to different rows. Exploiting reference
locality can thus improve memory performance in these Fast Page Mode (FPM) DRAMs. Reading the row from the storage array is destructive, and so before another row can be accessed, the current row must be written back, which is called “closing” the row. The sense amplifiers must then be recharged before anothe row can be read. Memory systems can be organized with either open or closed page policies. In the former, data are left in the sense amplifiers in hopes that subsequent accesses will exhibit enough locality to be satisfied by the open row. In the latter, the active page is closed after every access. Using a closed page policy is simpler and provides more predictable access times, and thus many memory controllers close the current page after every cacheline fill. Memory performance can also be improved by increasing bandwidth and using more chips in parallel. Separate chips are referred to as ranks. One common current packaging format uses Dual Inline Memory Modules (DIMMS), which have separate electrical connections on both sides of the printed circuit boards containing the DRAM circuits. Since the basic DRAM storage technology changes little with time (a bit is simply a capacitance stored at the intersection
Memory Wall
of wires in the array), most improvements in DRAM performance come from innovations in the interface through which the memory controller communicates with the chip. DRAM chips were originally asynchronous, requiring a handshake protocol to coordinate data transmission across the memory bus. Eliminating asynchrony (so each action must happen after a specified number of memory clock cycles) and pipelining commands allows multiple accesses to be in flight at a time in Synchronous DRAMs (SDRAMs). Extended Data Out (EDO) DRAMs allow slightly more parallelism: The data output of a read is maintained while the command for the next access is being sent. One of the biggest performance improvements comes from increasing the speed at which data are transmitted. Double Data Rate (DDR) DRAMs transmit data on both the rising and falling edges of the memory clock. DDR halves the clock cycle to provide four clock edges per memory clock cycle, and DDR divides it further. Bruce Jacob et al. [] provide a good reference for all levels of the memory hierarchy. The bottom line is that changing the DRAM interface to incorporate synchrony, pipelining of data and commands, and double data rate transmission have had a significant impact on the design of memory system back ends. In spite of these improvements, DRAMs still cannot deliver the performance required to keep cores from stalling frequently. Given that HPC applications suffer most seriously from the memory wall phenomenon, the rest of this discussion is presented in that context.
HPC Software Implications of Memory Trends Many types of applications suffer from inadequate memory performance, but the trends that created conditions leading to the memory wall have historically had the largest impact on HPC applications. These trends cause two broad categories of memory challenges. A confluence of economic and technological factors drives the use of less overall memory for a given amount of computation and the use of less memory bandwidth for the delivered computation FLOPS (Floating Point Operations per Second).
Existing Model for HPC Applications Shared memory paradigms such as OpenMP, UPC (Universal Parallel C), pthreads, Global Arrays, X,
M
Fortress, and Charm++ are coming into greater use. Nonetheless, the predominant programming model remains single-threaded MPI (Message Passing Interface). Reduced amounts of memory combined with increased core counts cause the standard model of a single thread per MPI task to require re-examination, and the reduced bandwidth will in turn cause application programmers to redesign the algorithm decomposition of their problems. A common program structure used by MPI programs alternates between computation and communication phases. All tasks in the application synchronize via either reduction operations or explicit barriers. The tasks proceed through each of these phases in lock step. Two consequences of this model are pertinent to memory effects. These are highlighted here and explored in detail below. The first consequence is that as memory per computation core is reduced, then for a given single-threaded MPI task, the amount of data on which that task can compute is reduced. This, in turn, can potentially imply that either the data needed by the algorithm decomposition no longer fit in memory, or that the amount of computation per phase is reduced. The reduction of computation per phase causes additional load balancing challenges: Tasks spend a greater percentage of time communicating instead of doing useful computation, thereby reducing the overall efficiency of the algorithm. The second consequence is that as the ratio of memory bandwidth per FLOP shrinks, the percentage of time the application spends moving data to and from memory grows. Though communication/synchronization time is inevitable with decreasing bandwidth ratios, this limitation is imposed in part by the MPI model, which prevents the application from leveraging more efficient intra-node optimizations.
Amount of Memory Figure shows the increasing cost of memory relative to the cost of computing. This trend has led to less available memory for a given amount of computation on large supercomputers. Although there are some technologies on the horizon that may provide a different cost point – for example, PCM (Phase Change Memory) — there will be challenges in effectively utilizing them. More importantly, even if designers can incorporate them, they will only provide a one-time improvement in the cost ratio. Overall, the community consensus holds that
M
M
Memory Wall
GFLOPS vs DRAM price reductions 100%
10%
$/GFLOPS Mem $/MB 2010
2009
2008
2007
2006
2005
2004
1% 2003
Memory Wall. Fig. Memory cost trends
even in the presence of occasional improvements in the memory/computation cost ratio, the general trend will continue, and the gap will widen.
Memory Bandwidth The trends in memory bandwidth are at least as problematic. Though improvements have been made in terms of the number of pins and later balls, there is still a physical limitation to the size and thus number of off-chip connections. As with the amount of memory there are a couple technologies on the horizon, such as chip stacking, that has a potential to provide a one time gain. However, as with the amount of memory, consensus amount experts indicate that trend will continue. Though not memory bandwidth related, there is an increasing gap in communication bandwidth versus computation power. This trend leads to similar challenges. The effect of decreasing memory bandwidth per computation is examined in the next section. Future Directions U.S. Department of Energy roadmaps project that the next generation of supercomputers will see O(M) cores with the count climbing to a projected O(M) cores in the exascale era near the end of the decade (–). One implication of all these memory trends is that the core algorithms in HPC applications must be redesigned to handle not just the challenges introduced by memory trends but also the scale of the supercomputers of coming generations. Although researchers recognize the necessity of doing this, it is a daunting task, and in many areas there has not been
significant progress. More work has been applied on the programming model front. The trends provide a strong indication that most applications written in a flat (only MPI), singlethreaded programming model will not be able to take advantage of the computing power that nextgeneration and coming exascale machines will offer. Many programmers have recognized this, and some are in the process of modifying applications. Many different avenues are being pursued, but most are based on executing more than a single software thread of execution within a node. This strategy directly attacks the challenge of amount of memory by providing many threads of computation per given memory. The most prevalent option is combining OpenMP with MPI. While this is a readily available model, researchers have already identified difficulties with the existing specification. The ability to run multiple threads within a given MPI task indirectly addresses the memory bandwidth and communication bandwidth challenges. Such a strategy releases the threads running within the MPI task from being held to MPI algorithm limitations of interaction. For example, threads can take advantage of shared memory, and (more importantly) shared L or even L caches, or other intra-chip optimizations such as atomic operations. The disadvantage of this twotiered or hybrid programming model is its increased complexity. Programming languages such as UPC, X, Chapel, and Fortress have been introduced to address this complexity, providing an integrated programming model across the entire system. They explicitly support multiple levels of parallel execution, and recognize that memory access is non-uniform: Required data will be stored physically closer to some threads than others (and thus data locality becomes important on yet another level). It seems likely that truly addressing all these trends and the challenges they impose will benefit from a new programming paradigm. The challenge is analogous to the transition to MPI: Changes must be embraced by the community at large. Porting applications to new programming models is extremely expensive, and will only be undertaken once sufficient confidence in an emergent programming model exists. Memory trends must be addressed in both hardware and software by a unified community.
Memory Wall
Related Entries Buses and Crossbars Cache Coherence Chapel (Cray Inc. HPCS Language) CHARM++ Cilk Coarray Fortran Cray TE Cray Vector Computers Cray XMT Cray XT and Cray XT Series of Supercomputers Cray XT and Seastar -D Torus Interconnect Denelcor HEP Earth Simulator Exascale Computing Fortran and Its Successors Fortress (Sun HPCS Language) Functional Languages Green Flash: Climate Machine (LBNL) IBM Blue Gene Supercomputer Instruction-Level Parallelism Vector Extensions, Instruction-Set Architecture (ISA) Intel Core Microarchitecture, x Processor Family Latency Hiding MPI (Message Passing Interface) Nonuniform Memory Access (NUMA) Machines OpenMP Power Wall Processors-in-Memory PVM (Parallel Virtual Machine) Reconfigurable Computers Roadrunner Project, Los Alamos Shared-Memory Multiprocessors SoC (System on Chip) Top Transactional Memories UPC Vector Extensions, Instruction-Set Architecture (ISA)
Bibliographic Notes and Further Reading Publication of “Hitting the Memory Wall: Implications of the Obvious” inspired a flurry of responses, both in private email and in print. Most of these put forth arguments for why the memory wall did not and would
M
not exist (for example, see Maurice Wilkes’s “The Memory Wall and the CMOS End-Point” [], also in ACM SIGArch Computer Architecture News). A simple online search yields many such articles, and it is interesting to review the numerous then-popular assumptions and to examine which ones have held over time. Chun Liu et al. [] discuss the problem in the context of the chip multiprocessors (CMPs) that began to emerge over the past decade. Thus, in spite of the community’s original skepticism about the very existence of the problem, a few researchers have continued to address different aspects of the memory wall, keeping the term in common computer parlance. More recently, Richard Murphy [] examines memory performance (both latency and bandwidth) for traditional and emerging supercomputer applications, finding that performance for both kinds of workloads is dominated by the memory system, but that the emerging applications are even more sensitive to memory latency and bandwidth limitations than their traditional counterparts. He notes that for massively parallel processors, the memory wall manifests itself due to the limited number of independent channels to memory compared to increases in numbers of cores. Thus, the problem predicted by Doug Burger et al. [] in their article, “Memory Bandwidth Limitations of Future Microprocessors,” has been realized for this class of computers. Samuel K. Moore emphasizes Murphy’s findings with respect to informatics codes in a short IEEE Spectrum article, “Multicore Is Bad News for Supercomputers.” He quotes Sandia National Laboratory’s director of computation, James Peery: “After about eight cores, there’s no improvement." Nonetheless, the debate regarding the conclusions of both articles continues: For example, see Ars Technica [] for lengthy conversations on the memory wall for all types of CMP systems. Brian Rogers et al. [] developed an analytic model of CMP memory bandwidth to analyze constraints imposed by gap between increasing numbers of cores and limited memory bandwidth, and they used this model to assess the likely effectiveness of combinations of known techniques such as link compression, cache compression, DRAM caches, and stacked (D) caches, concluding that by using appropriate combinations of approaches to reduce main memory bandwidth
M
M
MEMSY
requirements, the number of cores on chip is likely to reach – or higher (for future technology generations). The philosophy behind this work is the classic one that has allowed us to address the memory wall problem as well as we have thus far: We must use every technique available, and every combination thereof, to overcome the memory wall. Rogers et al. project that in this way, CMPs four technology generations from now may support over cores. At that point, cost, power, and thermal considerations are likely to represent even larger walls. The interdependences of this confluence of problems place even greater constraints on how any individual component can be addressed: The memory wall is no longer just about latency and bandwidth, but about cost, power, area, and temperature, and thus will likely remain a prominent problem in computer system design for generations to come.
. Stokes J () Analysis: more than cores may well be pointless. In: Ars Technica. Condé Nast Digital. Available http://arstechnica.com/hardware/news/// analysis-more-than--cores-may-well-be-pointless.ars, Dec. . Wilkes M () The memory wall and the CMOS end-point. Comput Archit News ():– . Wulf W, McKee S () Hitting the memory wall: implications of the obvious. Comput Archit News ():–
MEMSY Erlangen General Purpose Array (EGPA)
Mesh Hypercubes and Meshes Networks, Direct
Bibliography . Burger D, Goodman J, Kägi A () Memory bandwidth limitations of future microprocessors. In: Proceedings of the rd International Symposium on Computer Architecture, Philadelphia, – May . IEEE/ACM, Los Alamitos/New York, pp – . Jacob B, Ng S, Wang D () Memory systems: cache, DRAM, disk, st edn. Elsevier/Morgan Kaufmann, Burlington/San Francisco . Liu C, Sivasubramaniam A, Kandemir M () Organizing the last line of defense before hitting the memory wall for CMPs. In: Proceedings of the th IEEE Symposium on High Performance Computer Architecture, Madrid, – Feb . IEEE, Los Alamitos, pp – . Murphy R () On the effects of memory latency and bandwidth on supercomputer application performance. In: Proceedings of the IEEE International Symposium on Workload Characterization, Boston, – Sept . IEEE, Piscataway, pp – . Ousterhout J () Why aren’t operating systems getting faster as fast as hardware? In: Proceedings of the Summer USENIX Technical Conference, June , pp – . Rogers B, Krishna A, Bell G, Vu K, Jiang X, Solihin Y () Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In: Proceedings of the th International Symposium on Computer Architecture, Austin, – June . IEEE/ACM, Los Alamitos/New York, pp – . Sites R () It’s the memory, stupid. Microprocessor Rep ():– . Srivastava S () CEO interview. Wall Street Reporter, Aug
Mesh Partitioning METIS and ParMETIS
Message Passing MPI (Message Passing Interface) PVM (Parallel Virtual Machine)
Message Passing Interface (MPI) MPI (Message Passing Interface)
Message-Passing Performance Models Bandwidth-Latency Models (BSP, LogP)
METIS and ParMETIS
METIS and ParMETIS George Karypis University of Minnesota, Minneapolis, MN, USA
Synonyms Decomposition; Fill-reducing orderings; Graph partitioning; Load balancing; Mesh partitioning
Definition METIS and ParMETIS are serial and parallel software packages for partitioning and repartitioning large irregular graphs and unstructured meshes and for computing fill-reducing orderings of sparse matrices.
Introduction Algorithms that find a good partitioning of highly irregular graphs and unstructured meshes are critical for developing efficient solutions for a wide range of problems in many application areas on both serial and parallel computers. For example, large-scale numerical simulations on parallel computers, such as those based on finite element methods, require the distribution of the finite element mesh to the processors. This distribution must be done so that the number of elements assigned to each processor is the same, and the number of adjacent elements assigned on different processors is minimized. The goal of the first condition is to balance the computations among the processors. The goal of the second condition is to minimize the communication resulting from the placement of adjacent elements to different processors. Graph partitioning can be used to successfully satisfy these conditions by first modeling the finite element mesh by a graph, and then partitioning it into equal parts. Graph partitioning algorithms are also used to compute fill-reducing orderings of sparse matrices. These fill-reducing orderings are useful when direct methods are used to solve sparse systems of linear equations. A good ordering of a sparse matrix dramatically reduces both the amount of memory and the time required to solve the system of equations. Furthermore, the fill-reducing orderings produced by graph partitioning
M
algorithms are particularly suited for parallel direct factorization as they lead to a high degree of concurrency during the factorization phase.
METIS METIS is a serial software package for partitioning large irregular graphs, partitioning large meshes, and computing fill-reducing orderings of sparse matrices. METIS has been developed at the Department of Computer Science and Engineering at the University of Minnesota and is freely distributed. Its source code can downloaded directly from http://www.cs.umn.edu/˜ metis, and is also included in numerous software distributions for Unix-like operating systems such as Linux and FreeBSD. The algorithms implemented in METIS are based on the multilevel graph partitioning paradigm [, , ], which has been shown to quickly produce highquality partitionings and fill-reducing orderings. The multilevel paradigm, illustrated in Fig. , consists of three phases: graph coarsening, initial partitioning, and uncoarsening. In the graph coarsening phase, a series of successively smaller graphs is derived from the input graph. Each successive graph is constructed from the previous graph by collapsing together a maximal size set of adjacent pairs of vertices. This process continues until the size of the graph has been reduced to just a few hundred vertices. In the initial partitioning phase, a partitioning of the coarsest and hence, smallest, graph is computed using relatively simple approaches such as the algorithm developed by Kernighan-Lin []. Since the coarsest graph is usually very small, this step is very fast. Finally, in the uncoarsening phase, the partitioning of the smallest graph is projected to the successively larger graphs by assigning the pairs of vertices that were collapsed together to the same partition as that of their corresponding collapsed vertex. After each projection step, the partitioning is refined using various heuristic methods to iteratively move vertices between partitions as long as such moves improve the quality of the partitioning solution. The uncoarsening phase ends when the partitioning solution has been projected all the way to the original graph. METIS uses novel approaches to successively reduce the size of the graph as well as to refine the partition during the uncoarsening phase. During coarsening, METIS
M
M
METIS and ParMETIS Multilevel partitioning
G0
G0
G1
G1
G2
G2
G3
Uncoarsening phase
Coarsening phase
G3 G4
Initial partitioning phase
METIS and ParMETIS. Fig. The three phases of multilevel k-way graph partitioning. During the coarsening phase, the size of the graph is successively decreased. During the initial partitioning phase, a k-way partitioning is computed, During the multilevel refinement (or uncoarsening) phase, the partitioning is successively refined as it is projected to the larger graphs. G is the input graph, which is the finest graph. Gi+ is the next level coarser graph of Gi . G is the coarsest graph
employs algorithms that make it easier to find a highquality partition at the coarsest graph. During refinement, METIS focuses primarily on the portion of the graph that is close to the partition boundary. These highly tuned algorithms allow METIS to quickly produce high-quality partitions and fill-reducing orderings for a wide variety of irregular graphs, unstructured meshes, and sparse matrices. METIS provides a set of stand-alone command-line programs for computing partitionings and fill-reducing orderings as well as an application programming interface (API) that can be used to invoke its various algorithms from C/C++ or Fortran programs. The list of stand-alone programs and API routines of the current version of METIS (version .x) is shown in Table . The API routines allow the user to alter the behavior of the various algorithms and provide additional routines that partition graphs into unequal-size partitions and compute partitionings that directly minimize the total communication volume.
Partitioning a Graph METIS provides the pmetis and kmetis programs for partitioning an unstructured graph into a userspecified number of k equal-size parts. The partitioning algorithm used by pmetis is based on multilevel
recursive bisection described in [], whereas the partitioning algorithm used by kmetis is based on multilevel k-way partitioning described in []. Both of these methods are able to produce high-quality partitions. However, depending on the application, one program may be preferable than the other. In general, kmetis is preferred when it is necessary to partition graphs into more than just a small number of partitions. For such cases, kmetis is considerably faster than pmetis. On the other hand, pmetis is preferable for partitioning a graph into a small number of partitions. The API routines corresponding to the pmetis and kmetis programs are METIS_PartGraphRecursive and METIS_PartGraphKway, respectively.
Alternate Partitioning Objectives The objective of the traditional graph partitioning problem is to compute a balanced k-way partitioning such that the number of edges (or in the case of weighted graphs the sum of their weights) that straddle different partitions is minimized. This objective is commonly referred to as the edge-cut. When partitioning is used to distribute a graph or a mesh among the processors of a parallel computer, the objective of minimizing the edgecut is only an approximation of the true communication cost resulting from the partitioning [].
METIS and ParMETIS
M
METIS and ParMETIS. Table An overview of METIS’ command-line and library interfaces Operation
Stand-alone program
API routine
Partition a graph
pmetis
METIS_PartGraphRecursive
kmetis
METIS_PartGraphKway METIS_PartGraphVKway METIS_mCPartGraphRecursive METIS_mcPartGraphKway METIS_WPartGraphRecursive METIS_WPartGraphKway METIS_WPartGraphVKway
Partition a mesh
partnmesh
METIS_PartMeshNodal
partdmesh
METIS_PartMeshDual
Compute a fill-reducing
oemetis
METIS_EdgeND
Ordering of a sparse matrix
onmetis
METIS_NodeND METIS_NodeWND
Convert a mesh into a graph
mesh2nodal
METIS_MeshToNodal
mesh2dual
METIS_MeshToDual
The communication cost resulting from a k-way partitioning generally depends on the following factors: (a) the total communication volume, (b) the maximum amount of data that any particular processor needs to send and receive; and (c) the number of messages a processor needs to send and receive. METIS’ API provides the METIS_PartGraphVKway and METIS_WPartGraphVKway routines that can be used to directly minimize the total communication volume resulting from the partitioning (first factor). In addition, METIS also provides support for minimizing the third factor (which essentially reduces the number of startups) and indirectly (up to a point) reduces the second factor. This enhancement is provided as a refinement option for both the METIS_PartGraphKway and METIS_PartGraphVKway routines, and is the default option of kmetis and METIS_PartGraphs Kway.
Support for Multiphase and Multi-physics Computations The traditional graph partitioning problem formulation is limited in the types of applications that it can effectively load balance because it specifies that only a single quantity be balanced. Many important types of multiphase and multi-physics computations require that multiple quantities be balanced simultaneously. This is because synchronization steps exist between the
different phases of the computations, and so, each phase must be individually load balanced. That is, it is not sufficient to simply sum up the relative times required for each phase and to compute a partitioning based on this sum. Doing so may lead to some processors having too much work during one phase of the computation (and so, these may still be working after other processors are idle), and not enough work during another. Instead, it is critical that every processor must have an equal amount of work from each phase of the computation. METIS includes partitioning routines that can be used to partition a graph in the presence of such multiple balancing constraints. Each vertex is now assigned a vector of m weights and the objective of the partitioning routines is to minimize the edge-cut subject to the constraints that each one of the m weights is equally distributed among the domains. For example, if the first weight corresponds to the amount of computation and the second weight corresponds to the amount of storage required for each element, then the partitioning computed by the new algorithms will balance both the computation performed in each domain and the amount of memory that it requires. The multi-constraint partitioning algorithms and their applications are further described in []. The pmetis and kmetis programs invoke the multi-constraint partitioning routines whenever the input graph specifies more that one set of vertex weights. Support for multi-constraint
M
M
METIS and ParMETIS
partitioning via METIS’ API is provided by the METIS_mCPartGraphRecursive and METIS_ METIS_mCPartGraphKway routines that are based on the multilevel recursive bisection and the multilevel k-way partitioning paradigms, respectively.
Partitioning for Heterogeneous Parallel Computing Architectures Heterogeneous computing platforms containing processing nodes with different computational and memory capabilities are becoming increasingly more common. METIS’ API provides the METIS_WPartGraph Recursive, METIS_WPartGraphKway, and METIS_WPartGraphVKway routines, which are designed to partition a graph into k parts such that each part contains a prespecified fraction of the total number of vertices (or vertex weight). By matching the weights specified for each partition to the relative computational and memory capabilities of the various processors, these routines can be used to compute partitionings that balance the computations on heterogeneous architectures. Partitioning a Mesh METIS provides the partnmesh and partdmesh programs for partitioning meshes arising in finite element or finite volume methods. These programs take as input the element node array of the mesh and compute a k-way partitioning for both its elements and its nodes. METIS currently supports four different types of mesh elements that are triangles, tetrahedra, hexahedra (bricks), and quadrilaterals. These programs first convert the mesh into a graph, and then use kmetis to partition this graph. The difference between these two programs is that partnmesh converts the mesh into a nodal graph (i.e., each node of the mesh becomes a vertex of the graph), whereas partdmesh converts the mesh into a dual graph (i.e., each element becomes a vertex of the graph). In the case of partnmesh, the partitioning of the nodal graph is used to derive a partitioning of the elements. In the case of partdmesh, the partitioning of the dual graph is used to derive a partitioning of the nodes. Both of these programs produce partitioning of comparable quality, with partnmesh being considerably faster than partdmesh. However, in some cases, partnmesh may produce partitions that have higher load imbalance
than partdmesh. The API routines corresponding to the partnmesh and partdmesh programs are METIS_PartMeshNodal and METIS_PartMesh Dual, respectively.
Computing a Fill-Reducing Ordering of a Sparse Matrix METIS provides two programs oemetis and onmetis for computing fill-reducing orderings of sparse matrices. Both programs use multilevel nested dissection to compute a fill-reducing ordering []. The nested dissection paradigm is based on computing a vertex separator for the graph corresponding to the matrix. The nodes in the separator are moved to the end of the matrix, and a similar process is applied recursively for each one of the other two parts. These programs differ on how they compute the vertex separators. The oemetis program finds a vertex separator by first computing an edge separator using a multilevel algorithm, whereas the onmetis program uses the multilevel paradigm to directly find a vertex separator. The orderings produced by onmetis generally incur less fill than those produced by oemetis. In particular, for matrices arising in linear programming problems the orderings computed by onmetis are significantly better than those produced by oemetis. Furthermore, onmetis utilizes compression techniques to reduce the size of the graph prior to computing the ordering. Sparse matrices arising in many application domains are such that certain rows of the matrix have the same sparsity patterns. Such matrices can be represented by a much smaller graph in which all rows with identical sparsity pattern are represented by just a single vertex whose weight is equal to the number of rows. Such compression techniques can significantly reduce the size of the graph, whenever applicable, and substantially reduce the amount of time required by onmetis. However, when there is no reduction in graph size, oemetis is about –% faster than onmetis. Furthermore, for large matrices arising in three-dimensional problems, the quality of orderings produced by the two algorithms is quite similar. METIS’ API provides the METIS_EdgeND, METIS_NodeND, and METIS_NodeWND routines for computing fill-reducing orderings for sparse matrices. The first two routines provide the functionality
METIS and ParMETIS
of the oemetis and onmetis programs, respectively; whereas, the third routine is designed to compute fill-reducing orderings of graphs whose vertices have weights corresponding to the number of rows in the original sparse matrix with identical sparsity structure.
Converting a Mesh into a Graph METIS provides two programs: mesh2nodal and mesh2dual for converting a mesh into the graph format used by METIS. In particular, mesh2nodal converts the element node array of a mesh into a nodal graph; that is, each node of the mesh corresponds to a vertex in the graph and two vertices are connected by an edge if the corresponding nodes are connected by lines in the mesh. Similarly, mesh2dual converts the element node array of a mesh into a dual graph; that is, each element of the mesh corresponds to a vertex in the graph and two vertices are connected if the corresponding elements in the mesh share a face. The API routines corresponding to these two programs are METIS_MeshToNodal and METIS_MeshToDual, respectively. Since METIS does not provide API routines that can directly compute a multi-constraint partitioning of a mesh or compute a partitioning that minimizes its overall communication cost, these routines can be used to first convert the mesh into a graph, which can then be used as input to METIS’ graph partitioning routines to obtain such partitionings.
ParMETIS ParMETIS is an MPI-based parallel library that implements a variety of algorithms for partitioning and repartitioning unstructured graphs and meshes and for computing fill-reducing orderings of sparse matrices.
M
ParMETIS has been developed at the Department of Computer Science and Engineering at the University of Minnesota and is freely distributed. Its source code can downloaded directly from http: // www.cs.umn.edu/˜ metis. The algorithms in ParMETIS are based on the corresponding algorithms implemented in METIS. However, ParMETIS extends METIS’ functionality by including routines that are especially suited for parallel computations and large-scale numerical simulations. The complete list of operations that ParMETIS’ routines can perform and the associated API routines of the current version of ParMETIS (version .x) is shown in Table . ParMETIS’ partitioning and repartitioning routines provide support for single- and multiconstraint partitioning and for parallel computational platforms with heterogeneous computing capabilities. Note that unlike METIS, ParMETIS does not provide any stand-alone parallel programs for parallel partitioning and ordering.
Partitioning a Graph ParMETIS_V3_PartKway is the routine in ParMETIS that is used to partition unstructured graphs. This routine takes a graph and computes a k-way partitioning that minimizes the edge-cut. ParMETIS_V3_PartKway makes no assumptions on how the graph is initially distributed among the processors. It can effectively partition a graph that is randomly distributed as well as a graph that is well distributed. The parallel graph partitioning algorithm used in ParMETIS_V3_PartKway is based on the serial multilevel k-way partitioning algorithm (implemented by METIS’ kmetis program and METIS_Part
METIS and ParMETIS. Table An overview of ParMETIS’ API routine Partition
API routine
Partition a graph
ParMETIS_V3_PartKway ParMETIS_V3_PartGeom ParMETIS_V3_PartGeomKway
Partition a mesh
ParMETIS_V3_PartMeshKway
Repartition a graph corresponding to an adaptively refined mesh
ParMETIS_V3_AdaptiveRepart
Refine the quality of an existing partitioning
ParMETIS_V3_RefineKway
Compute a fill-reducing ordering of a sparse matrix
ParMETIS_V3_NodeND
Convert a mesh into a graph
ParMETIS_V3_MeshToDual
M
M
METIS and ParMETIS
GraphKway API routine) and parallelized in [, , , ]. When coordinate information is available about the vertices of the graph (e.g., when the vertices correspond to mesh nodes or elements), ParMETIS provides two additional routines that can be used to compute the partitioning. The first is ParMETIS_V3_PartGeom that uses an approach based on space-filling curves. This routine is very fast but produces relatively low-quality solutions. The second is ParMETIS_V3_PartGeom Kway that first computes an initial partitioning using the method based on space-filling curves, redistributes the graph according to this partitioning, and then calls ParMETIS_V3_PartKway to compute the final high-quality partitioning of the graph. Even though this approach computes two different partitionings, because the second partitioning is computed using a graph that is already well-distributed among the processors, it actually takes less time. Also, the quality of the final partitioning produced by ParMETIS_V3_PartGeomKway is comparable to that computed by ParMETIS_V3_PartKway. Note that in most of ParMETIS’ partitioning routines the number of sub-domains is decoupled from the number of processors. Hence, it is possible to use these routines to compute a k-way partitioning independent of the number of processors that are used.
Partitioning Adaptively Refined Meshes For large-scale scientific simulations, the computational requirements of techniques relying on globally refined meshes become very high, especially as the complexity and size of the problems increase. By locally refining and de-refining the mesh either to capture flow-field phenomena of interest or to account for variations in errors, adaptive methods make standard computational methods more cost effective. The efficient execution of such adaptive scientific simulations on parallel computers requires a periodic repartitioning of the underlying computational mesh. These repartitionings should minimize both the inter-processor communications incurred in the iterative mesh-based computation and the data redistribution costs required to balance the load. Repartitioning algorithms fall into two general categories. The first category balances the computation by incrementally diffusing load from those sub-domains
that have more work to adjacent sub-domains that have less work. These schemes are referred to as diffusive schemes. The second category balances the load by computing an entirely new partitioning, and then intelligently mapping the sub-domains of the new partitioning to the processors such that the redistribution cost is minimized. These schemes are generally referred to as remapping schemes. Remapping schemes typically lead to repartitionings that have smaller edgecuts, while diffusive schemes lead to repartitionings that incur smaller redistribution costs. However, since these results can vary significantly among different types of applications, it can be difficult to select the best repartitioning scheme for the job. ParMETIS provides the ParMETIS_V3_Adapti veRepart routine for repartitioning adaptively refined meshes. This routine assumes that the mesh is well distributed among the processors, but that (due to mesh refinement and de-refinement) this distribution is poorly load balanced. ParMETIS_V3_AdaptiveRe part is based on the Unified Repartitioning Algorithm [, ] and combines the best characteristics of remapping and diffusion-based repartitioning schemes. A key parameter used by this algorithm is the ITR Factor that describes the ratio between the time required for performing the inter-processor communications incurred during parallel processing compared to the time to perform the data redistribution associated with balancing the load. As such, it allows for the computation of a single metric that describes the quality of the repartitioning. The underlying algorithm of ParMETIS_V3_Adap tiveRepart follows the multilevel partitioning paradigm but uses a technique known as local coarsening that only collapses together vertices that are located on the same processor. On the coarsest graph, an initial partitioning need not be computed, as one can either be derived from the initial graph distribution (in the case when sub-domains are coupled to processors), or else one needs to be supplied as an input to the routine (in the case when sub-domains are decoupled from processors). However, this partitioning does need to be balanced. The balancing phase is performed on the coarsest graph by using both remapping- and diffusionbased algorithms []. A quality metric for each of these partitionings is then computed (using the ITR Factor) and the partitioning with the highest quality is selected.
METIS and ParMETIS
This technique tends to give very good points from which to start multilevel refinement, regardless of the type of repartitioning problem or the value of the ITR Factor. Note that the fact that the algorithm computes two initial partitionings does not impact its scalability as long as the size of the coarsest graph is suitably small. Finally, multilevel refinement is performed on the balanced partitioning in order to further improve its quality. Since ParMETIS_V3_AdaptiveRepart starts from a graph that is already well distributed, it is very fast. Appropriate values for the ITR Factor can easily be determined depending on the times required to perform (a) all inter-processor communications that have occurred since the last repartitioning, and (b) the data redistribution associated with the last repartitioning/load balancing phase. Simply divide the first time by the second. The result is the correct ITR Factor. In case these times cannot be ascertained (e.g., for the first repartitioning/load balancing phase), our experiments have shown that values between and , work well for a variety of situations. ParMETIS_V3_AdaptiveRepart can be used to load balance the mesh either before or after mesh adaptation. In the latter case, each processor first locally adapts its mesh, leading to different processors having different numbers of elements. ParMETIS_V3_AdaptiveRepart can then compute a partitioning in which the load is balanced. However, load balancing can also be done before adaptation if the degree of refinement for each element can be estimated a priori. That is, if we know ahead of time into how many new elements each old element will subdivide, we can use these estimations as the weights of the vertices for the graph that corresponds to the dual of the mesh. In this case, the mesh can be redistributed before adaption takes place. This technique can significantly reduce data redistribution times.
Improving the Quality of a Partitioning ParMETIS provides the ParMETIS_V3_Refine Kway routine that can be used to improve the quality of an existing partitioning. ParMETIS_V3_Refine Kway can be used to improve the quality of partitionings that are produced by other partitioning algorithms. ParMETIS_V3_RefineKway can also be used repeatedly to further improve the quality
M
of a partitioning. However, each successive call to ParMETIS_V3_RefineKway will tend to produce smaller improvements in quality.
Partitioning Meshes ParMETIS provides the ParMETIS_V3_PartMesh Kway routine for directly partitioning the elements of an unstructured (or structured) mesh. This partitioning is computed by first converting the mesh into a graph whose vertices are the elements of the mesh and then using ParMETIS_V3_PartKway to compute the actual partitioning. Note that in the case of adaptive computations, ParMETIS does not provide routines that can directly repartition adaptively refined meshes. However, an application can use the ParMETIS_V3_MeshToDual routine to construct in parallel the dual graph of the mesh, which can then be provided as input to the ParMETIS_V3_Adaptive Repart routine. Computing Fill-reducing Orderings of Sparse Matrices ParMETIS_V3_NodeND is the routine provided by ParMETIS for computing fill-reducing orderings of sparse matrices. The algorithm implemented by ParMETIS_V3_NodeND is similar to that implemented by METIS’ onmetis stand-alone program and METIS_NodeND routine. To achieve high performance, ParMETIS_V3_NodeND first uses ParMETIS _V3_PartKway to compute a highquality partitioning and redistributes the graph accordingly. Next it proceeds to compute the log p levels of the elimination tree concurrently. When the graph has been separated into p parts (where p is the number of processors), the graph is redistributed among the processor so that each processor receives a single subgraph, and METIS_NodeND is used to order these smaller subgraphs.
Related Entries Chaco Dense Linear System Solvers Domain Decomposition Graph Partitioning Hypergraph Partitioning Load Balancing, Distributed Memory PaToH (Partitioning Tool for Hypergraphs)
M
M
Metrics
Reordering Sparse Direct Methods
Bibliographic Notes and Further Readings There is a extensive body of research related to the topic of graph and mesh partitioning and its applications to parallel and scientific computing. Schloegel et al. [] and Devine et al. [] provide comprehensive overviews of the various methods that have been developed and their applications to high performance scientific computing. In addition, besides METIS and ParMETIS, there are a number of other tools that have been developed and are actively maintained for serial and parallel graph partitioning.
. Schloegel K, George Karypis, and Vipin Kumar () Graph partitioning for high performance scientific simulations. In: Dongarra J, Foster I, Fox G, Kennedy K, White A, Torczon L, Gropp W (eds) Sourcebook of parallel computing. Morgan Kaufmann
Metrics Allen D. Malony University of Oregon, Eugene, OR, USA
Synonyms Performance metrics
Bibliography
Definition
. Devine K, Boman EG, Karypis G () Partitioning and load balancing for emerging parallel applications and architectures. In: Heroux M, Raghavan P, Simon HD (eds) Parallel Processing for Scientific Computing. SIAM . Hendrickson B () Graph partitioning and parallel solvers: Has the emperor no clothes? In: Proceedings irregular , Berlin . Karypis G, Kumar V () Multilevel algorithms for multiconstraint graph partitioning. In: Proceedings of supercomputing , New York . Karypis G, Kumar V () Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput ():– . Karypis G, Kumar V () A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parallel Distrib Comput ():– . Karypis G, Kumar V () A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Scientific Comput ():– . Karypis G, Kumar V () A coarse-grain parallel multilevel k-way partitioning algorithm. In: Proceedings of the eighth SIAM conference on parallel processing for scientific computing , Minneapolis, MN . Kernighan BW, Lin S () An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J ():– . Schloegel K, Karypis G, Kumar V () Parallel multilevel algorithms for multi-constraint graph partitioning. In: Proceedings of Europar , . Distinguished Paper Award . Schloegel K, Karypis G, Kumar V () A unified algorithm for load-balancing adaptive scientific simulations. In: Proceedings supercomputing ’, Santa Fe, New Mexico . Schloegel K, Karypis G, Kumar V () Wavefront diffusion and lmsr: algorithms for dynamic repartitioning of adaptive meshes. IEEE Trans Parallel Distrib Syst ():– . Schloegel K, Karypis G, Kumar V () Parallel static and dynamic multi-constraint partitioning. Concurrency Comput Pract Ex :–
A metric is a quantitive measure used to characterize, evaluate, and compare parallel performance. Many performance metrics are defined based on the variety of performance data and other information that can be obtained from a measurement of a parallel execution. Metrics can be absolute as well as derived from other metrics. Parallel metrics are used to understand the relationship of scalability and performance.
Discussion Introduction Performance is at the heart of parallel computing. Qualitatively, performance embodies those aspects that motivate parallel computing research and development, such as delivering greater speed and solving larger problems. However, for purposes of scientific and engineering discussion, it is necessary to describe performance in more quantitative terms. Metrics make it possible to characterize, evaluate, and compare performance based on methods for performance data measurement and analysis. Metrics are determined from performance measures. Where a measure constitutes a quantity or degree of performance, and implies a means for ascertaining (measuring) it, a metric is a “standard of measure,” a semantic representation of performance. Metrics are the language of parallel performance science. Metrics are mostly determined by the variety of performance data that can be obtained in measurements,
Metrics
reflecting different features of interest in a parallel execution. Metrics can be of different types, depending on the purpose for their use. Where do metrics come from? What are the common metrics used in the parallel computing community? How are metrics applied? These are some of the questions discussed below.
Standard Metrics Performance metrics are based on measures of parallel operation. Two questions naturally follow. What are the features of parallel operation to quantify? How are those features measured in a real parallel program? The term “standard metric” is used here to reflect common performance data from measurement of parallel program execution.
Time Metrics Parallel computing is generally concerned with reducing the amount of time necessary to perform a computational task, whether that be the execution of single parallel application or the running of many jobs together on a parallel machine. Thus, the most fundamental performance metric of interest to parallel computing is execution time. A time metric is calculated in different ways depending on what is available for time measurement, and there can be different types of execution time to observe. Time measurement requires a clock source (a clock reference), and different sources have different resolution and accuracy characteristics. Wallclock time regards the passage of real time as a clock basis, as one would use a clock on the wall. It is used often to quantify the actual time taken for a parallel execution to complete, from start to finish. One could use a watch to time a parallel program, but all operating systems provide some form of real time clock to keep track of time. The real time clock is created from a hardware source and can have a high degree of resolution. However, there are different components of time reflecting different states of execution. Operating systems normally keep track of CPU time, I/O time, and the time spent idle for a process. CPU time is typically broken down into user and system time. User time is the time the process spends in execution of the program code and system time is the time the operating system spends executing on behalf of the process. The CPU time measure (with user and system components)
M
is commonly referred to as process virtual time since it is maintained on a process basis and the virtual clock advances only when the process is executing in either user or system mode. Whereas it is helpful to see the breakdown of time for each individual process or thread of a parallel program, process virtual time cannot serve as a uniform time reference for the whole parallel execution, primarily because it ignores periods when the process is idle. In general, it is important to have a single, real time clock reference for measuring time on all processes to determine accurately the time from the beginning of the parallel computation to the end. The real time clock is also necessary to allow performance measurements of interactions between processes (threads). In practice, synchronized clocks are hard to achieve in a parallel machine, requiring local real time clocks to be used with software-based time correlation. In summary, time metrics can apply to the parallel program as a whole, in which case it is necessary to have a common time basis. Total execution time is the metric used most often for reporting the performance of a parallel execution. It represents the elapsed time from the beginning of the parallel computation to the end. However, further breakdown of execution time across the processes / threads of the parallel program can be important for performance analysis, especially of synchronization and communication operations. Multiple methods of measuring time are used to compute time metrics of concern. Note, it is also useful to sum time metrics from individual processes to get a single metric representing total time consumed overall. For instance, adding the time processes spend doing real computation could represent total work. Similarly, adding the time processes spend in synchronization and communication could represent total overhead.
OS and Hardware Metrics Time is not the only metric used to characterize parallel execution. The operating system and processor hardware provide performance data that reflects how a program interacts with the parallel machine. From the operating system standpoint, it is possible to see information about virtual memory, paging, I/O, and networking. OS metrics typically captures performance data at the process level as counts of interesting events and other quantitative measures.
M
M
Metrics
The availability of counters in modern computing hardware has made it possible to generate performance metrics about the operation of the processor, memory system, and other hardware components. All CPUs today have counters available covering a range of functionality from tracking instructions executed, to measuring cache operation, to basic timing. Access to the counters are made possible through the instruction set architecture, at the low level, and libraries such at the Performance API (PAPI) [], at higher levels. Using the counter interfaces, it is possible to calculate metrics such as the number of floating point operations ( flops), level cache misses (data, instruction), translation lookaside buffer misses, and so on. OS and hardware metrics are measured locally in a parallel system. For some, it makes sense to accumulate the performance data to compute a total value, as a way to convey an amount of work, such as in the case of total flops, or an average value, to elucidate an overall parallel inefficiency, as in the case of average cache misses per processor. Clearly, flops (typically reported as millions or billions of flops, Mflops or Gflops) is an important metric since it is used commonly to represent the amount of computational work in scientific applications.
Derived Metrics The types of metrics above one might regard as raw metrics since they come directly from measurement. New metrics can be calculated from other metrics to better characterize performance. A derived metric is the result of an arithmetic metric expression. The best known derived metric is flops per second rate (also referred to by the acronym flops), determined by dividing the total number of floating point operations performance by the execution (summed over all processes) by the total execution time. Flops rate is the primary derived metric reported for high-performance scientific computing. Rate-based derived metrics are easily calculated from any raw metric. Any OS or hardware counter can be used to determine informative rate metrics, such as instructions per second, cache misses per second, or page faults per second. Sometimes it is necessary to instrument the program code to determine certain values. For instance, sustainable memory bandwidth and communication bandwidth in gigabytes per second
depends on knowing how much data is being referenced or transferred. However, the derived metric expression can be anything. One important derived metric is the computationto-communication ratio, computed in two ways: . The number of calculations (determined by some computation metric) divided by the number of communication messages (measured as number or bytes) . The time spent calculating divided by the time spent communicating The computation-to-communication ratio is helpful in gauging the efficiency of a parallel execution. Lower ratios suggest more communication overhead in the parallel execution. The ratio could also reflect the differences in performance of the processing and communications infrastructure. Improving communication network bandwidth and latency would result in an increase in the ratio. This type of derived metric could be called a relative metric since it is describing one metric with respect to another, generally as a ratio. If time is the denominator, the relative metric becomes a rate. Other expressions for deriving metrics might help in the aggregation or categorization of more specific performance, such as in summing the time spent in all MPI routines into a single MPI class.
Parallelism Metrics The metrics discussed above can be thought of as describing aspects of parallel execution performance where the parallelism degree, problem size, and other parameters are fixed. Such an execution instance is typically called a performance experiment. Obviously, standard metrics from multiple performance experiments can be captured and compared. In this regard, new comparative metrics can be created to highlight certain cross-experiment performance aspects. By definition, each performance experiment reflects differences in how the experiment was constructed, on what platform it executed, the input data sets, and many other parameters of performance interest. A comparative metric can analyze performance along one or several parameter dimensions. A simple comparative metric would consider how a standard or derived performance metric varies for
Metrics
changes in a single parameter. Consider varying the amount of calculations performed by a parallel program by increasing the size of the problem. The number of parallel processes used will remain fixed. This is a type of performance study called problem size scaling. After running the performance experiment for each problem size, the standard performance metrics can be compared with respect to the problem size choices. One problem size could be chosen as the base, for instance, and performance metrics for the other sizes normalized to the base metrics. Since the parallelism degree is fixed, the result of the comparative analysis gives insight to effects on performance due to the amount of computation, memory allocation, and other factors dependent on problem size. The term “parallelism metrics” will be used generally to characterize performance effects as a result of changes in parallel execution parameters. As comparative metrics, parallelism metrics have implicit and explicit relevance to how parallelism is manifesting in performance variation.
Speedup A fundamental parallelism metric is speedup. Simply put, speedup expresses the performance improvement as parallelism increases. If Ts is the sequential execution time of a program and Tp is the execution time when the program is run in parallel on p processors, speedup is the simple expression: S = Sp = Ts /Tp . The speedup metric conveys performance improvement when S > and degradation when S < . Ideal speedup (also known as ideal scaling) is defined by Sp = p. Superlinear speedup is obtained when Sp > p. However, there are two well-known speedup formulations, determined by how execution time is modeled. The first is taken from Amdahl’s law [] which is a method to find the maximum expected improvement to an overall system when only part of the system is improved. If the number of processors is increased for use in the parallel portion of an application’s computation, Amdahl’s law is a speedup expression that identifies the theoretical maximum performance improvement possible. Let Ts be the execution time of a sequential computation. Let α be the fraction of the computation that cannot be parallelized. If there are p processors, then the time of the parallel computation, Tp , can be written
M
as Tp = α ∗ Ts + ( − α) ∗ Ts /p, assuming the parallel execution is ideal and does not result in any overheads. Now the speedup equation can be written as: S = Ts /Tp = α + (−α)/p . Notice as p → ∞, S is bounded by α . In other words, the sequential portion of the computation limits the speedup that can be achieved. Speedup formulated in this manner is also known as strong speedup (or strong scaling), since the size of the computational work to be performed is defined by the sequential execution and does not change. The sequential fraction becomes a dominant, strong factor in the speedup metric. There was concern in the parallel computing community that large-scale parallel performance would be hard to achieve since even tiny portions of sequential execution place severe constraints on the actual speedup that can be obtained. For example, a % sequential fraction limits speedup to maximum. Suppose instead that the computational work is allowed to increase with increasing parallelism degree. This is the basis of Gustafson’s Law []. Let n be the problem size, p be the number of processors, and Tp be the execution time on a parallel computer. We can write Tp = a(n) + b(n) where a(n) is the part that executes sequentially and b(n) is the part that executes in parallel. To determine execution time on a single processor, we just multiply the parallel component by the number of processors used: Ts = a(n) + p ∗ b(n). (Again, overheads are assumed to be zero.) If Tp is normalized to , a(n) and b(n) then represent sequential and parallel fractions of the total computation. The speedup equation can be written as: S = Ts /Tp = Ts = a(n) + p ∗ ( − a(n)). Now assume increasing problem size. If the serial portion, a(n), diminishes (as a fraction of the total computation) as n increases, speedup will be bounded by the number of processors, p: S → p as n → ∞. The term “weak speedup” (or “weak scaling”) is used to describe this speedup metric based on Gustafson’s Law. The idea is that good speedups can be achieved with greater numbers of processors, as long as the amount of (relative) parallel work being done can be increased. This tends to be not difficult to do with many applications. As a result, the community became more confident that massively parallel computer systems could be productively utilized.
M
M
Metrics
Efficiency Speedup metrics are calculations based on time metrics. How does speedup itself change as a result of parallelism degree? An efficiency metric can be defined as the ratio of speedup to the number of processors used to obtain S it: Ep = pp . Efficiency is an example of a utilization metric. In essence, it represents a return on investment in parallelism. For instance, a parallel program might achieve a speedup of , but it takes , processors to do it. Efficiency in this case is less than %. Efficiency can change as a result of any performance factor that is influenced by parallelism degree. Returning to problem size, suppose performance experiments are conducted with varying problem size and parallelism. When all experiments are completed, efficiency can be computed for each. To quantify the effect of change problem size on parallel program efficiency, an isoefficiency function can be computed. Given a chosen level of efficiency, its purpose is to specify what size of problem must be solved on a given number of processors. Isoefficiency was created as a technique to analyze the scalability of parallel algorithms [].
Metric Use Metrics form the basis of performance evaluation methodology and are used for purposes of benchmarking, characterization, performance problem diagnosis, and optimization. The particular performance evaluation problem will determine which metrics are best to use. Sometimes convention defines the metric(s) of choice.
Peak Metrics It has been standard practice to rate parallel machine performance by peak speed, usually defined for scientific computing in terms of maximum flops per second a parallel machine can possibly deliver. It is computed by summing up the maximum performance of every processor that can be executing in parallel. Of course, peak speed is never attained by a parallel application, since there are overheads involved and peak speed assumes only certain instructions are used. However, it has been applied as an upper-bound threshold for evaluating parallel execution efficiency. The term “percentage of peak” represents an efficiency metric where the total attained flops per second from a parallel application is divided
by peak flops. Because of the strong influence of processor components and system architecture, percentage of peak can differ from machine to machine for the same application. Clearly, different applications have different parallelism behavior and thus will vary in percentage of peak on the same machine. In general, it is not uncommon to see low percentages of peak for many parallel applications.
Benchmark Metrics Benchmarking is the process of conducting controlled experiments using a suite of programs to test performance factors associated with machine platform, compilers, libraries, and other parameters, including parallelism degree and problem size. In contrast to just comparing performance to some absolute measure, benchmarking is intended to allow more informed evaluation with respect to program properties. Benchmark codes are designed to exercise certain execution aspects, whether it be processor, memory, or networking performance, as well as to be representative of application code behavior. There is a long history to benchmarking of parallel machines, particularly for high-performance computing. The most common metric used in scientific benchmarking is again flops per second or just flops for short. It is the primary metric in the LINPACK benchmark suite [] and the basis for the High-Performance LINPACK (HPL) benchmark [] version used to rank supercomputers in the TOP [] list. The LINPACK benchmarks measure a system’s floating point computing power when solving a dense NxN system of linear equations, Ax = b, using Gaussian elimination with partial pivoting. The number of floating point operations is known, ∗ N + ∗ N , needing only execution time to be measured to determine flops. To rank systems for the TOP, HPL is run for different matrix sizes to find the maximal performance, Rmax at Nmax xNmax matrix size. The peak performance of the system, Rpeak , is also reported. In addition to these, the metric N/ is calculated as the matrix dimension where half of the performance (Rmax /) is achieved. This metric is reminiscent of the famous N/ metric created by Hockney and Jesshope [] to convey the efficiency of vector pipelined architecture. Here N/ refers to the vector length at which half of the theoretical peak vector pipeline performance is achieved.
Metrics
Larger values mean the vector hardware has a harder time amortizing the pipeline startup costs. There has been controversy over the years in the use of flops as a single overall parallel performance metric. Some argue that it overemphasizes the importance of floating point operations and obscures other factors more relevant to the resulting performance. There have also been questions about the use of LINPACK as a benchmark for ranking HPC machines because it represents only a particular type of parallel code. Such is the nature of benchmarking. Other benchmarks have been created to address these concerns. The NAS parallel benchmarks [] and the SPEC HPC benchmarks [], among others, provide a variety of parallel codes for evaluation. They support base experiments and tuned experiments where different optimizations are allowed. The focus is on scientific computing and the primary metric is still flops. However, they also calculate what might be called workload metrics as a formula of the different benchmark results, for example, a geometric mean in the case of SPEC HPC. The HPC Challenge benchmark [] is different in that it was created to measure a broader spectrum of factors for performance evaluation. Certainly, flops is a metric for certain HPCC codes, but memory and communication bandwidth metrics are targeted with others, to gain a better sense of the capacity of the parallel system. To drill down more on memory performance, one benchmark code calculates billions of integer random updates to memory per second, called GUPS. Random memory performance is often a limiter to overall application performance because locality of reference is poor.
Tuning Metrics Metrics provide a means to characterize what, how, and where performance is manifested in parallel execution. Derivative and parallel metrics further quantify performance factors and give means to evaluate parallelism effects. As the goal of parallel computing is to deliver performance returns, there is strong interest optimizing applications. Tuning the performance of a parallel application is an iterative process consisting of several steps involving the identification and localization of inefficiencies, instigating their repair, and validating performance improvement.
M
The idea of tuning metrics is to represent the inefficiencies that result in performance degradation so that the critical ones can be targeted and tuning methods can be selected. It is typical to speak of these inefficiencies as performance bottlenecks and the process of characterizing and prioritizing bottlenecks as bottleneck analysis. Depending on the parallel computational model and parallel programming approach (e.g., shared memory multi-threading or distributed memory message passing), different performance bottlenecks will result. The goal is to find tuning metrics that help in bottleneck analysis. For instance, tuning methods for task-based parallel programs might apply critical path analysis techniques to find the path of parallel execution events that takes the longest time to execute. If optimizations can be made along this critical path to reduce its overall execution time, the parallel performance overall will be improved. However, the critical path may change as a result, meaning that any further improvements with respect to the original path will be ineffective. Critical path analysis can be applied to multi-threaded and message passing programs []. Tuning metrics include synchronization overheads, communication delays, and other costs associated with parallel execution. Ordering of severity of these metrics can guide where optimization focus should be applied. Optimal large-scale parallel performance depends strongly on minimizing waiting time across the parallel system. Here, load balance is an important metric to consider, but it is a difficult one to actually measure directly. It should be kept in mind that performance metrics used for tuning are representing complex interactions between multiple performance factors at the algorithm, software, hardware, and system levels. Diagnosing performance problems requires metrics to identify source of inefficiencies, and applying optimizations can change performance effects, requiring further tuning iterations.
Summary Parallel computing is an endeavor to improving performance through the use of parallel resources. To evaluate, characterize, diagnose, and tune parallel performance, it is necessary to quantify performance information in forms for critical review. This is the role of metrics.
M
M
Microprocessors
Related Entries
Intel Core Microarchitecture, x Processor Family
Amdahl’s Law
Superscalar Processors
Benchmarks Gustafson’s Law Performance Analysis Tools
Bibliography . Amdahl GM () Validity of the single-processor approach to achieving large-scale computing capabilities. In: Proceedings of the American Federation of Information Processing Societies Conference. AFIPS Press, New Jersey, pp – . Bailey D et al () The nas parallel benchmarks. Technical Report NAS Technical Report RNR--, NASA Ames Research Center, Moffett Field, March . Browne S, Dongarra J, Garner N, Ho G, Mucci P () A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl (): – . Dongarra J, Bunch J, Moler C, Stewart P () Linpack benchmark. http://www.netlib.org/linpack/ . Grama AY, Gupta A, Kumar V () Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Concurr ():– . Gustafson JL () Reevaluating Amdahl’s Law. Commun ACM ():– . Hockney () The science of computer benchmarking. SIAM, Philadelphia . HPCS HPC Challenge () Hpc challenge benchmark. http:// icl.cs.utk.edu/hpcc/ . Meuer H, Dongarra J, Strohmaier E, Simon H () Top. http://www.top.org/ . Petitet A, Whaley R, Dongarra J, Cleary A () HPL – A portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/ benchmark/hpl/ . Standard Performance Evaluation Corporation () Spec’s benchmarks and published results: high performance computing, openmp, MPI. http://www.spec.org/benchmarks.html/ . Yang CQ, Miller B () Critical path analysis for the execution of parallel and distributed programs. In: th International conference on distributed computing systems. IEEE computer society press, San Jose, California, pp –
Microprocessors AMD Opteron Processor Barcelona DEC Alpha EPIC Processors IBM Power Architecture
MILC Steven Gottlieb Indiana University, Bloomington, IN, USA
Synonyms MIMD lattice computation
Definition MILC is an acronym for MIMD Lattice Computation and is both the name of a long enduring physics collaboration, that is, the MILC Collaboration, and the code and the lattice gauge configurations that the MILC collaboration makes publicly available, that is, the MILC code and the MILC configurations, respectively.
Discussion History of the Collaboration The MILC collaboration was established in . The original members of the collaboration were Claude Bernard, Thomas DeGrand, Carleton DeTar, Steven Gottlieb, Alex Krasnitz, Michael Ogilvie, Robert Sugar, and Doug Toussaint. Many postdocs and students have joined the collaboration, and more senior members joining later include Urs Heller, Jim Hetrick, and James Osborn. Additional members can be found at the collaboration Web site [] or by consulting the list of publications [].
Physics Goals Physicists know of four fundamental forces in Nature. They are gravitation, the electromagnetic force, the weak force, and the strong force. The Standard Model of Elementary particle physics describes the latter three forces as gauge theories []. For a gauge theory, the forces are determined by a symmetry group. The electromagnetic symmetry group is called U() and the theory is called quantum electrodynamics (QED). The charge of the electron is small and allows an expansion in powers of the charge. This expansion is called perturbation theory. The electromagnetic and weak forces are closely related and the combined theory, called the
MILC
electroweak theory, is described by the symmetry group SU() × U(). The combined theory can also be treated by perturbation theory. Applications of perturbation theory to QED date to the s. For the strong force, the symmetry group is SU() and the theory is called quantum chromodynamics (QCD). Because this force is so strong, perturbation theory is not able to treat many of the interesting phenomena. The MILC Collaboration studies quantum chromodynamics (QCD) using the technique of lattice gauge theory. Lattice gauge theory was formulated by Kenneth Wilson in []. The lattice formulation of a quantum field theory allows a nonperturbative treatment of the theory, which is necessary for QCD and other theories with a strong coupling. According to QCD, matter consists of spin-/ particles that are called quarks. There are six different types of quarks and they bind together in combinations of three quarks (called baryons) or as quark-antiquark pairs (called mesons). The MILC collaboration studies the spectrum and decay properties of such bound states []. MILC also studies the properties of strongly interacting matter at very high temperature as existed in the early universe very shortly after the Big Bang.
Characteristics of Lattice QCD as a Computational Problem Lattice gauge theories are generally formulated on a regular hypercubic grid of points in space-time. Denote the grid size as Ns × Nt , where there are Ns grid points for each spatial direction and Nt grid points in the time direction. The spacing between grid points is denoted a, using units in which the speed of light c = . To get accurate results, the limit a → , called the continuum limit, must be taken. The variables that describe the quarks are multicomponent fields located at the grid points. There are several different ways to formulate the theory for quarks. The most compact method, called staggered or Kogut– Susskind quarks [], has three complex components: χ i , where i = , , . The three components correspond to the symmetry group SU() of QCD. Another formulation, used in the original paper of Wilson [], more closely mimics the original Dirac formulation of fermions used in QED and has a second spinor index on the quark field: ψ αi , where α = , , , and i is as before. The MILC code supports calculations with either type of field.
M
The variables that describe the gauge fields are called gluons in QCD and are the analog of the photon in QED. In the continuum theory, these variables are elements of the Lie algebra of the symmetry group; however, in the lattice theory, elements of the symmetry group itself are used. In the case of the QCD, the gauge fields are × complex, unitary matrices with unit determinant, that is, elements of SU(). There is one such element for each link connecting nearest neighbor grid points. If x is a grid point in space-time and μˆ is a unit vector in one of the four directions, then denote the matrix as U(x, x + μˆ ) or U μ (x). There is a very important property of these variables that U(x, y) = U † (y, x), where † denotes the adjoint or transpose conjugate. Thus, links where y is a neighbor in a negative direction from x can be expressed in terms of a matrix based at y. That is, ˆ = U † (y, x) = U † (y, y + μ) ˆ = U μ† (y). U(x, y = x − μ) Thus, only the matrices corresponding to links oriented in the positive directions are stored. To summarize, at each grid point, four × complex matrices that describe the gluon or gauge fields of QCD are stored. The quarks are described by - or × -component complex fields. In Nature, there are six different types of quarks labeled u, d, c, s, t, and b. This label has been given the fanciful name flavor. The lightest three quarks u, d, and s are most important in the calculations because they are light enough that they can spontaneously appear and disappear from the vacuum as quantum fluctuations. Quark flavors whose vacuum fluctuations are included in the calculation are said the be dynamical quarks. The t and b quarks are much too heavy for this effect to be significant for them. Calculations including a dynamical c quark are just starting to be explored. A quark or antiquark that appears in a bound state whose properties are studied is called a valence quark. All the quarks except for t have been treated as valence quarks. (The t quark decays too quickly to form bound states.) Special techniques are needed to treat the valence c and b quarks because they are so much heavier than the u, d, and s quarks []. In lattice QCD the partial differential equations of the continuous theory become finite difference equations on the grid of space-time points. The computations are very easy to vectorize or to parallelize. The MILC code is designed for parallelization via domain
M
M
MILC
decomposition [, ]. Contiguous blocks of grid points are assigned to each node or core of a parallel computer. The MILC code has only one level of hierarchy at present. Below, the word node is used for what is generally each MPI process or core of a job. Each node will be responsible for updating the variables that are stored at all the grid points “owned” by that node. Because of the finite difference equations, updating a variable stored at a site will require knowledge of variables stored at neighboring sites. However, there is one important feature of the gauge symmetry that is essential to QCD. In the finite differences that appear in place of derivatives, one cannot just subtract the values of fields at neighboring grid points; one must first “parallel transport” a field to a neighboring site by multiplying by the gauge matrix U corresponding to the link joining the sites. Thus, χ(x) − χ(y) is not valid and one would need to ˆ in order to compute χ(x) − U μ (x)χ(y), if y = x + μ, maintain the gauge symmetry of the theory. In practice, this means there are many instances of multiplication of - or × -component vectors by × matrices. This allows for some data reuse in what is otherwise a sparse problem. It also encourages the programmer to develop a library of routines that operate on matrices and vectors. The other issue of dealing with finite differences and domain decomposition is that it is necessary to take into account the possibility of a neighboring value possibly being off-node. In the MILC code, this is dealt with via messaging and setting of pointers. The programmer is shielded from the bother of distinguishing between onnode and off-node neighbors. Three routines are used when accessing values from neighboring grid points. They are start_gather, wait_gather, and cleanup_gather. The initial development of the MILC code was done on the Intel IPSC/ and Ncube . These machines had long latencies compared with some of the other parallel machines available then. Thus, it was impractical to fetch a neighboring value for a single grid point. These calls are designed to fetch all of the off-node values in a particular direction for a needed field. Actually, because of red-black checkerboarding for many operations, there is also a choice of fetching only from red sites, black sites, or all sites. The code uses the notion of parity rather than checkboard color, and that notation is used below. An even site is one for which the coordinates of the grid point obey (x + y + z + t) mod = .
The other sites are odd. Since the same operations are going to be carried out on each site of the grid (or each site of one parity), it is efficient to get all off-node values with a single message from any other node that “owns” (some of) the required data. The start_gather routine checks to see which data is needed from off-node, and which data this node owns that will be required by other nodes. It allocates buffers for each of the messages that are expected from other nodes and posts the receives. It gathers up the data that will be required by each neighboring node and sends it off. The routine also sets up a pointer for each grid point that either points to the normal on-node location of an on-node neighbor value, or to a location in the message buffer that will contain a message with the data from a neighboring node. Before the neighboring values are accessed through the pointer, the programmer must wait for the wait_gather routine to return. If possible, other useful work can be done before the call to wait_gather in order to overlap communication and computation. The routine cleanup_gather frees the buffers allocated for the messages. Early on it was recognized that a significant amount of time was spent setting up the pointers, so restart_gather was added to repeat a previous gather reusing the same buffers thus eliminating the need to recalculate the pointers or allocate space for the buffers. (This call is now obsolete, and the feature is subsumed by another routine that hides this from the application programmer.) Through the use of the routines described above, the application programmer is freed from the burden of dealing with calls to MPI or knowing which grid points have off-node neighbors or which node contains the offnode neighbor. All the information about the domain decomposition is contained in a set of layout routines used in the initial setup routine that allocates the space for the grid of sites. The number of nodes on which a job will run is determined at run time rather than being fixed at compile time. In the original code, a site structure for each grid point contained all of the needed physical variables and some extra information about the layout. The setup routine allocates an array of sites called lattice.
Coding Style and Examples A brief introduction to the code with only enough detail to show some snippets of code that loop over lattice sites
MILC
and gather values from neighbor sites follows. More complete documentation is available online [] or in the doc subdirectory of the code distribution. Topics to be discussed include: global variables, data types, site structure, moving around on the lattice, macros, and gathers.
Global Variables A number of variables are used so often that they are declared as globals. Each application has an include file lattice.h part of which is below. /* Definition of globals */ #ifdef CONTROL #define EXTERN #else #define EXTERN extern #endif /* The following are global scalars */ EXTERN int nx,ny,nz,nt; /* lattice dimensions */ EXTERN int volume; /* volume of lattice = nx*ny*nz*nt */ /* Some of these global variables are node dependent */ /* They are set in "make_lattice()" */ EXTERN int sites_on_node; /* number of sites on this node */ EXTERN int even_sites_on_node; /* number of even sites on this node */ EXTERN int odd_sites_on_node; /* number of odd sites on this node */ EXTERN int number_of_nodes; /* number of nodes in use */ EXTERN int this_node; /* node number of this node */
In practice, only the file that contains the main function, always called control.c, defines the variable CONTROL, in which case EXTERN is null and the storage for the global variables is allocated there. In all other files, EXTERN becomes extern, and it is possible to access the global variables. The comments should clarify the meaning of these variables.
M
Data Types In order to be able to conveniently compile for either single or double precision, complex variables are defined in include/complex.h as follows: /* generic precision complex number definition */ /* specific for float complex */ typedef struct { float real; float imag; } fcomplex; /* specific for double complex */ typedef struct { double real; double imag; } dcomplex; #if (PRECISION==1) #define complex fcomplex #else #define complex dcomplex #endif
In include/su.h, these definitions are used to define complex matrices and vectors. typedef struct { fcomplex e[3][3]; } fsu3_matrix; typedef struct { fcomplex c[3]; } fsu3_vector; typedef struct { dcomplex e[3][3];} dsu3_matrix; typedef struct { dcomplex c[3];} dsu3_vector; #if (PRECISION==1) #define su3_matrix fsu3_matrix #define su3_vector fsu3_vector #else #define su3_matrix dsu3_matrix #define su3_vector dsu3_vector #endif
In this way, the application programmer can use complex, su_matrix, and su_vector, and at compile time the value of PRECISION will determine whether single or double precision is used.
M
M
MILC
There are additional datatypes to deal with Wilson (-component) quarks. Details can be found in the documentation. The include files complex.h and su.h also contain prototypes of many functions that use the datatypes just defined.
Site Structure The file lattice.h also defines a “site” structure that contains all the physical variables needed at each grid point. This is convenient in that the programmer only has to add another structure to the list, and then all variables are allocated with one call to malloc in a fairly standard setup routine. The section on “Evolution of the Code” describes a disadvantage of this approach. Here is a simple example of a site structure that only contains the gauge matrices and two other matrices and two su vectors.
void make_lattice(){ register int i; /* scratch */ short x,y,z,t; /* coordinates */ /* allocate space for lattice, fill in parity, coordinates and index */ node0_printf("Mallocing %.1f MBytes per node for lattice\n", (double)sites_on_node * sizeof(site)/1e6); lattice = (site *)malloc (sites_on_node * sizeof(site)); if(lattice==NULL){ printf("NODE %d: no room for lattice\n",this_node); terminate(1); }
typedef struct { /* The first part is standard to all programs */ /* coordinates of this site */ short x,y,z,t; /* is it even or odd? */ char parity; /* my index in the array */ int index; /* Now come the physical fields, program dependent */ /* gauge field */ su3_matrix link[4]; su3_matrix tempmat1,staple; su3_vector phi1, phi2; } site;
/* Allocate address vectors */ for(i=0;i
/* The lattice is a single global variable - (actually this is the part of the lattice on this node) */ EXTERN site *lattice;
In the file generic/make_lattice.c there is a function make_lattice that allocates space for the lattice and for pointers that are used by the gather routines. The variables such as coordinates, parity, and index defined in the site structure are also set up there.
}
MILC
Moving Around on the Lattice In the code above, each node is performing a loop over the entire lattice volume, not just over the sites on the node. Two utility functions defined in the layout routines node_number(x,y,z,t) and node_index(x,y,z,t) return which node a grid point is assigned to and which index it has on the node to which it is assigned. The value of lattice[i].index is this site’s position in the entire grid, not just the part of the grid assigned to this node. The code shows how to access any variable that is part of the site structure. An alternative access scheme is based on pointers. site *s; s=&(lattice[i]); s->x /* is now a pointer to variable lattice[i].x whose value is the x coordinate of that grid point */
Macros Loops should not normally iterate over the whole lattice volume. They should be restricted to the sites on a node, or the sites of one parity on that node. Also, a function that operates on a vector field may sometimes need to use phi as input and other times phi. However, C does not allow the name of a structure member to be passed as an argument. There are macros defined in include/macros.h to simplify the coding. To simplify the loop over the sites on each node, four macros are defined. /* Standard red-black checkerboard */ /* macros to loop over sites of a given parity. Usage: int i; site *s; FOREVENSITES(i,s){ commands, where s is a pointer to the current site and i is the index of the site on the node } */ #define FOREVENSITES(i,s) \ for(i=0,s=lattice;
M
i<even_sites_on_node;i++,s++) #define FORODDSITES(i,s) \ for(i=even_sites_on_node, s = &(lattice[i]); i<sites_on_node;i++,s++) #define FORSOMEPARITY(i,s,choice) \ for(i=((choice)==ODD ? even_sites_on_node : 0), \ s= &(lattice[i]); \ i< ((choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) #define FORALLSITES(i,s) \ for(i=0,s=lattice; i<sites_on_node;i++,s++)
In each case, loop is restricted to the required sites, and either the integer i or the pointer s can be used to navigate as described above. To deal with the issue of not being able to pass the identifier of a structure member, two new concepts “field offset” and “field pointer” are defined. The macros to implement them are also in macros.h. /* "field offset" and "field pointer" */ /* used when fields are arguments to subroutines */ /* Usage: fo = F_OFFSET(field), where "field" is the name of a field in lattice. address = F_PT(&site , fo), where &site is the address of the site and fo is a field_offset. Usually, the result will have to be cast to a pointer of the appropriate type. (It is naturally a char *). */ typedef int field_offset; #define F_OFFSET(a) \ ((field_offset)(((char *) &(lattice[0]. a))-((char *) &(lattice[0])))) #define F_PT( site , fo) ((char *)(site) + (fo))
M
M
MILC
Gathers The discussion above, of gathering data from a neighboring site, can now be illuminated with an example. The trace of the product of the four link matrices going around each elementary square of the lattice is a useful quantity. This must be summed over all squares. The two cases of the square being in a purely spatial plane or a plane with time and a spatial direction are treated separately in the following code. Each square is defined by two perpendicular directions and the corner with the smallest coordinates. If μˆ and νˆ are two different unit ˆ x + νˆ, and vectors, the corners of the square are x, x + μ, x + μˆ + νˆ. Traversing the edges of the square starting in the μ direction, the expression becomes ˆ ˆ x+ μˆ + ν)U(x+ ˆ ˆ x+ ν)U(x+ ˆ ˆ x)). Tr(U(x, x+ μ)U(x+ μ, μˆ + ν, ν,
This can be reexpressed in our other notation as ˆ † (x + νˆ)U † (x)), Tr(U (x)U (x + μ)U μ
ν
μ
ν
or, using the cyclic property of the trace, as Tr(Uν† (x)U μ (x)Uν (x + μˆ )U μ† (x + νˆ)) . To complete the calculation based at site x, link matrices based at two nearest neighbor sites are needed. However, from the last expression it is clear that the first matrix multiplication with data that is local to site x can be done. That work is done while waiting for the messages to arrive. Below is part of the code generic/plaquette.c to accomplish this task. A detailed explanation follows the code. void plaquette(Real *ss_plaq, Real *st_plaq) { /* su3mat is scratch space of size su3_matrix */ su3_matrix *su3mat; register int i,dir1,dir2; register site *s; register su3_matrix *m1,*m4; su3_matrix mtmp; Real ss_sum,st_sum; msg_tag *mtag0,*mtag1; ss_sum = st_sum = 0.0; su3mat = (su3_matrix *)malloc(sizeof (su3_matrix)*sites_on_node);
if(su3mat == NULL) { printf("plaquette: can’t malloc su3mat\n"); fflush(stdout); terminate(1); } for(dir1=YUP;dir1<=TUP;dir1++){ /* dir1 plays role of mu */ for(dir2=XUP;dir2
MILC
(gen_pt[1][i]), &mtmp); } cleanup_gather(mtag0); cleanup_gather(mtag1); } } g_floatsum(&ss_sum); g_floatsum(&st_sum); *ss_plaq = ss_sum /((Real) (3*nx*ny*nz*nt)); *st_plaq = st_sum/((Real) (3*nx*ny*nz*nt)); free(su3mat); } /* plaquette4 */
Space is allocated for a field major variable sumat that can hold one su matrix for each site. Then there are nested loops over the two directions that define the plane of the plaquette. The first step is to start the gathers of the two matrices that are based at nearest neighbor points. Note that the call is to start_gather_site. (See discussion under “Evolution of the Code.”) In the first gather, F_OFFSET is used to specify that the gathered link corresponds to direction dir, the number of bytes to gather is sizeof(su_matrix), and values are gathered from neighbors in the direction dir. Also, values are gathered from sites with both parities and the pointer array gen_pt[] will have the address of the neighbors’ data when the gather is complete. This call will create a “message tag” that has the information required by wait_gather and cleanup_gather. Its return value is the address of that data structure. In the second call, the roles of dir and dir are reversed and the pointer array gen_pt[] is used. In the first FORALLSITES loop, m (m) is set to the address of the link in direction dir (dir), and then mult_su_an is used to take the product of the adjoint link in direction dir with the link in direction dir, and the result is stored in the array sumat. Note that both the pointer s and index i are used in this loop. The program then waits for the two gathers to complete. Then the intermediate result sumat can be multiplied on the right by the link pointing in direction dir
M
gathered from neighbor dir, which can now be found using gen_pt[]. Since there are no adjoints required, mult_su_nn is used. Note that a single su matrix mtmp is used because the calculation of the trace is completed by the call to realtrace_su before proceeding to the next site. In the call to realtrace_su, the first argument is the link whose address is in gen_pt[]. This routine calculates just the real part of the trace of the product of the adjoint of the first argument and the second argument. Finally, g_floatsum is called, which is an application-specific global sum, saving the programmer from having to make a call to a specific message passing API, such as MPI. Actually, this code could be improved by doing a single call to g_vecfloatsum, which would be more efficient. (This routine is not called that often.)
Communication Options The MILC code was developed before the advent of MPI and PVM. The initial platforms were the Intel iPSC/ and Ncube . By having a layer of applicationspecific calls between the programmer and the native message-passing calls, portability is greatly enhanced. All of the communication routines are contained in a single, target-specific file. In addition to the two targets just mentioned, there was initially code for the Intel iPSC simulator and the “vanilla” version that works with a single thread. Today, only com_vanilla.c, com_mpi.c, and com_qmp.c (which is with the QMP message passing library defined by the USQCD software effort, described below) survive. The vanilla version of the code is remarkably useful for code development. It allows the programmer to develop and debug code in a serial environment that has all the correct syntax for parallel execution.
Assembly Code Options Many of the floating point operations within a QCD code involve × matrices and vectors with multiples of components. The code includes a library of such basic operations on matrices and vectors. To improve performance, some of the routines that take the most time are written in assembly code. At various times, assembly code has been developed for Intel i, DEC Alpha, IBM RS, Cray TD, Cray TE, and Intel SSE. The code has been ported to the QCDOC special purpose computer. For the IBM Blue Gene/L and P, assembler
M
M
MILC
support is available through the USQCD libraries discussed below. Often the performance improvement from the assembly code is notable when the data is in the cache, but somewhat limited when the data must be fetched from memory.
Evolution of the Code The MILC code had its origins nearly years ago and has evolved to include new applications and algorithms. It has also had to evolve to accommodate changes to computer architecture. One important change has been the widening of processor cache lines. The original MILC code was “site-major” in that all the variables for a given grid point or site are stored together. With a narrow cache line, at each stage of the calculation the required variables can be accessed from the site structure. However, as the cache line widens, an inefficiency arises when other variables are brought into the cache when they are not needed because they are stored next to the required data. The solution is to create “field-major” data that stores only a single type of variable, with one grid point’s data following another’s. Then there is the chance that data swept into the cache will be used when the code advances to do the calculation for the next grid point. Thus, the original start_gather has been replaced with start_gather_site and start_gather_field, so the programmer can gather from either type of variable. Around , a second level of hierarchy was accommodated by use of OpenMP []. This was motivated by a computer that had eight processors, but a switch that only could support four MPI processes. This was not a notable success, but multithreading is being tried again as compilers have improved, and new computers such as the Cray XE, Blue Waters and BlueGene/Q have or will have many more cores (or threads) that can be used in SMP mode. A port of the code to the IBM Cell Broadband Engine was done [, ] and work is ongoing to use GPUs as an accelerator [, ].
USQCD SciDAC Project The US Department of Energy under its Scientific Discovery through Advanced Computating (SciDAC) program [] has been funding development of QCD code through the National Infrastructure for Lattice Gauge Theory project since []. This project unites developers from several of the major QCD groups in the
USA with the aim of sharing code to improve reliability, performance, and portability of QCD codes, and to reduce the barriers to development of new codes. The USQCD libraries [] include QLA for basic linear algebra, QMP for application specific message passing, QIO for lattice-specific parallel input/output, and QDP for data parallel style computing that combines shifts from neighboring grid points and computation. There is also a QOP library that is highly optimized and (possibly) architecture specific. The MILC code has gradually been evolving to take advantage of these capabilities.
Availability of Code: Tutorials The MILC collaboration has made its code available to others for many years. In the beginning, the code was distributed on the basis of personal requests. The code is now distributed via a web page []. A mailing list has also been established at http://denali.physics.indiana.edu/mailman/milccode. The documentation for the code is available at the site from which the code is distributed. There are also tutorials on the code prepared by Carleton Detar []. They include information about use of the SciDAC QCD libraries with the MILC code.
Use of MILC Code for Benchmarking The MILC code has been run on many parallel machines and is used for benchmarking by a number of organizations. When the Intel Paragon was being developed, slight inconsistencies in what should have been identical runs resulted in a redesign of the motherboards. The code became part of the hardware diagnostics. A MILC application su_rmd is part of the NERSC benchmark suite [] used to evaluate potential purchases. It is also part of the SPEC CPU benchmark [] and the SPEC MPI benchmark []. For the NSF Track competition, which was won by the NCSA Blue Waters project [], performance on a QCD problem supplied by members of MILC had to be analyzed by each applicant. The su_rmd application was used for this and will be part of the system tests.
Public Availability of Gauge Configurations: Gauge Connection, ILDG The MILC collaboration has also long made its gauge configurations available to other scientists.
MILC
Configurations including the effects of dynamical quarks are expensive to create, but they can be used for many physics projects once they are archived, so it makes sense to share this resource to make the best use of the public investment in the creation of the configurations. The Gauge Connection hosted by NERSC [] was the initial repository of the configurations. More recently, the International Lattice Data Grid (ILDG) [] has created a new standard format and a distributed collection of repositories. There are portals for the ILDG in Australia, Europe, Japan, the UK, and the USA. Many collaborations contribute configurations to the repository that are created using a variety of lattice actions.
Related Entries Cell Broadband Engine Processor
M
Scientific, ; Introduction to Quantum Fields on a Lattice, J. Smit, Cambridge, ; Lattice Methods for Quantum Chromodynamics, T. DeGrand and C. DeTar, World Scientific, ; Quantum Chromodynamics on the Lattice, An introductory presentation, C. Gattringer and C. B. Lang, Springer, . The literature discussing the implementation and performance of lattice field theory codes is not as extensive as the physics literature. However, there is generally a Machines and Algorithms session at the annual Lattice conference. There have been a number of plenary reviews of computers built specifically for lattice QCD as well as articles about individual computers. At the USQCD Web site [], there are links to three other code suites Chroma, CPS, and FermiQCD that are publicly available. They are written in C++.
Clusters Cray TE Cray XT and Cray XT Series of Supercomputers Domain Decomposition Hypercubes and Meshes IBM Blue Gene Supercomputer MPI (Message Passing Interface) NCube NVIDIA GPU PVM (Parallel Virtual Machine) QCD (Quantum Chromodynamics) Computations QCDSP and QCDOC Computers Vector Extensions, Instruction-Set Architecture (ISA)
Bibliographic Notes and Further Reading The MILC code is an application suite for lattice field theory. There is a vast physics literature on the subject. There has been an annual conference in the area for years. For many years the proceedings were part of the Nuclear Physics B (Proceedings Supplement) series. Recently, the proceedings have been published online through Proceedings of Science (http://pos.sissa.it). Several textbooks provide an introduction to lattice field theory and the algorithms that are used. They include: Quantum Fields on a Lattice, I. Montvay and G. Münster, Cambridge, ; Lattice Gauge Theories, An Introduction, Third Edition, H. J. Rothe, World
Bibliography . http://physics.indiana.edu/~sg/milc.html. There is a link there to the code distribution site: http://www.physics.utah.edu/~detar/ milc . http://physics.indiana.edu/~sg/milc/milcpubs.pdf . Quigg C () Gauge theories of the strong, weak and electromagnetic interactions. Addison Wesley, New York . Wilson K () Confinement of quarks. Phys Rev D : . Bazavov A et al () Nonperturbative QCD simulations with + flavors of improved staggered quarks. Rev Mod Phys : arXiv:. . Kogut JB, Susskind L () Hamiltonian formulation of Wilson’s lattice gauge theories. Phys Rev D : . Thacker BA, Lepage GP () Heavy-quark bound states in lattice QCD. Phys Rev D :; El-Khadra AX, Kronfeld AS, Mackenzie PB () Massive fermions in lattice gauge theory. Phys Rev D : . Bernard C et al () Studying quarks and gluons on MIMD parallel computers. Int J High Perform Comput Appl :; http://hpc. sagepub.com/cgi/content/abstract/// . Bernard C et al () QCD on the iPSC/. In: Herman HJ, Karsch F (eds) Workshop on Fermion algorithms, World Scientific, Singapore . http://www.physics.utah.edu/~detar/milc/milcv.html or http:// www.physics.utah.edu/~detar/milcv.pdf . Tamhankar S, Gottlieb S () Benchmarking MILC code with OpenMP and MPI. Nucl Phys B (Proc. Suppl.) : . Shi G et al () Implementation of scientific computing applications on the Cell Broadband Engine. Sci Program : . Shi G, Kindratenko V, Gottlieb S () Cell processor implementation of a MILC lattice QCD application. In: Proceedings of the XXVI international symposium on lattice field theory, Proceedings of science (LATTICE ), Williamsburg, p
M
M
MIMD (Multiple Instruction, Multiple Data) Machines
. Clark MA () QCD on GPUs: cost effective supercomputing. In: Proceedings of the XXV international symposium on lattice field theory, Proceedings of science (LAT ) . Shi G, Gottlieb S, Torok A, Kindratenko V () Design of MILC lattice QCD application for GPU clusters. In: Proceedings of the th IEEE International Parallel & Distributed Processing Symposium; http://lattice.github.com/quda/ . http://www.scidac.gov . http://www.scidac.gov/physics/quarks.html; http://www.usqcd. org;http://usqcd.fnal.gov . http://www.usqcd.org/software . http://www.physics.utah.edu/~detar/kitpc/; http://www.physics. utah.edu/~detar/hacklatt/ . http://pdsi.nersc.gov/benchmarks.htm . http://www.spec.org/cpu/Docs/.milc.html . http://www.spec.org/auto/mpi/Docs/.milc.html . http://www.ncsa.illinois.edu/BlueWaters/; http://en.wikipedia. org/wiki/BlueWaters . http://qcd.nersc.gov/ . http://ildg.sasr.edu.au/Plone
MIMD (Multiple Instruction, Multiple Data) Machines Rolf Riesen , Arthur B. Maccabe IBM Research, Dublin, Ireland Oak Ridge National Laboratory, Oak Ridge, TN, USA
Definition Distributed and parallel computing systems exist in many different configurations. We group them into six categories and show that the architectural characteristics of each category determines the type of application that will run most efficiently and scalability on these systems. Drawing the line between distributed and parallel systems is not easy because the definitions overlap and large systems combine aspects of both.
Discussion Introduction Parallel and distributed computing share many traits and distinguishing the two is not easy. Usually, a sense of larger distances is associated with distributed computing, but many topics grouped under this term are relevant for parallel computing as well. Often specific systems are considered distributed computing systems,
e.g., computers working together and connected by the Internet. Instead of trying to define the two terms precisely, we will define classes of parallel computing systems, note their distinctions, and describe what applications and usage models are best suited for each class. We will order these systems according to their characteristics and will then be able to draw a line that separates distributed from what we would consider parallel computing. Nevertheless, since there is some overlap between classes of parallel computing systems, drawing that line of distinction will still leave some room for debate about what a distributed or a parallel computing system is.
Parallel Computing Systems Parallel computing has become pervasive and many different platforms exist to execute parallel programs. These platforms are available in the form of a machine that executes the parallel programs of a user, or more commonly, hidden behind a web interface in the form of a parallel service such as search or database access. This section defines six types of general-purpose parallel computing systems. Specialized parallel machines designed to execute one particular application have been built as well but will not be considered as part of the taxonomy described in this paper. There are architectural differences between these systems that determine what these machines are used for. We will first look at several different parallel computing architectures and then explore the differences between them. After that we look at how these differences dictate usage, availability, performance, and construction (hardware and software).
Volunteer Computing Within any large collection of computers, there are usually a few machine that are idle or underutilized. In the s, projects such as Condor enabled programs to use the networked computers in an office or a class room when they were idle. This allowed users to submit programs for execution during the night when these computers were not in use. With the growth of the Internet, the number of compute resources that may be idle and can be accessed remotely is now in the millions.
MIMD (Multiple Instruction, Multiple Data) Machines
SETI@Home (Search for Extraterrestrial Intelligence) began in the mid-s to harness idle resources on the Internet. Personal computer owners, who want to volunteer some of their compute cycles, can download a screen-saver from the project home page. Whenever a computer is idle for a while, the screen-saver activates. Instead of just showing a moving graphic or blanking the screen, the SETI@Home screen-saver also contacts a centralized server and requests tasks to execute. The goal of the SETI@Home project is to analyze radio signals from space and detect patterns that are not likely to occur naturally. For each work request, the server sends some data to the client to be analyzed. While the screen-saver is running, it processes the data, sends results back to the server, and requests more work, if it is still idle. Because users may interrupt the screensaver or turn the computer off, the server keeps track of which data sets have been processed. It sends the same data set to multiple computers to ensure that at least one of them will finish the work, and it compares replicated work to make sure the results are valid. The SETI@Home project became so successful in attracting volunteers that it could not keep all their computers busy anymore. A follow-on project called the Berkeley Open Infrastructure for Network Computing (BOINC) allows other projects to use volunteer resources. Currently running examples include protein folding, genetic linkage analysis, predicting the performance of quantum computers, finding weaknesses in cryptographic hash functions, creating a threedimensional model of the Milky Way galaxy, and testing the sensitivity of climate models.
Grid Computing Grid computing became popular in the late s. The term grid is used because, similar to an electric power grid, there are multiple producers and consumers which share their resources under predefined agreements. While one data center has surplus compute capacity after the workers leave for the day, another user may be able to buy these compute cycles. Certain operations or applications may perform faster on one type of computer, or can be executed more cheaply on another. The goal of grid computing is to bring consumers and producers together in an organized fashion that governs accounting and responsibilities. Most gird
M
users are both producers and consumers. While there are some similarities to volunteer computing, and the actual machines in use may be of the same types, grid computing is much more suitable for a commercial environment. Agreements to make a certain number of compute cycles, or type of system, or storage capacity available are formal and do not depend on the goodwill of anonymous users. Some of the tools and methods used in volunteer and grid computing are similar and can be used interchangeably. However, grid computing differs from volunteer computing in that some of the systems in the grid are used for long-term storage and some resources, such as leased communication lines or particular systems, are dedicated. In volunteer computing, CPU-scavenging or cycle stealing is the norm, without guarantees when services will be available or a task will complete. In grid computing a certain level of Quality of Service (QoS) is expected. Starting in about , the term cloud computing started to be used. Cloud computing shares many similarities with grid computing, may be used in a grid, but also includes some aspects of volunteer computing and peer-to-peer networks. Whether the term will survive and what its exact definition will be seem to depend on the companies and organizations currently promoting and selling cloud computing systems.
Cluster Computers Cluster computing began in the late s with systems cobbled together from commercial-of-the-shelf (COTS) components. These were typically PC computers with Ethernet connections located under a desk or on a shelving unit. Demand for clusters grew and commercial companies started to sell integrated turnkey systems. The individual computers that make up a cluster are now rack-able servers. On high-end systems Ethernet is replaced by faster and more costly networks such as Infiniband or Myrinet, and some clusters have redundant power supplies. The defining characteristic is that the individual parts can be bought separately and put together by an integrator in almost any fashion a customer requests. Clusters reside in a single administrative domain. That means all the components that are part of a cluster are owned and managed by a single entity. Decisions
M
M
MIMD (Multiple Instruction, Multiple Data) Machines
about what application to run on the cluster, who pays the electricity bill, and which individual user get access are made by a single organization. This differs from volunteer and grid computing which span multiple administrative domains. Frequently, some of the nodes that make up a grid are clusters. Another difference to volunteer and grid computing is that a cluster usually consists of homogeneous components. The type and performance characteristics of CPUs, disks, memory, and network are either identical or compatible inside a single cluster. In a grid the resources are heterogeneous and one task of the system software for grid computing is to match the appropriate resource with the requirements of a given application. For example, if an application needs -bit x CPUs with at least GB of memory, the system will not schedule that application on systems that lack these features. On the other hand, users of a cluster know its hardware characteristics and select or compile applications accordingly.
Supercomputers The term supercomputer is poorly defined but usually associated with high-end parallel systems. Because of their high degree of parallelism, these systems are sometimes called Massively Parallel Processors (MPP). Some of these machines fill large rooms, cost millions of dollars, and require huge amounts of electrical power and cooling. What distinguishes them from clusters is that they are designed as a single system, not built from individual, independent components. Early supercomputers used custom-built proprietary processors. That changed with the advent of high-performance generalpurpose microprocessors. Today, supercomputers use the same processors that are used in desktop workstations and servers. The network of most of these machines is still custom and much more tightly coupled to the processor and memory subsystem than the COTS network cards that attach to PCs through an I/O bus. Often, multiple networks are present to serve specific functions: high-speed message passing, collective operations, file I/O, diagnostics, and administrative services. Although the line between a cluster and a supercomputer has been blurred, supercomputers stand out due to their higher-performing networks and their higher reliability. Reliability in larger systems is a key factor that determines the throughput of a system. Throughput is how
many applications a system can finish in a unit of time. Supercomputers have dedicated hardware to monitor the whole system, sometimes use a dedicated network for diagnosis and control of components, and have the ability to work around failed components until they can be repaired. This is crucial for systems that exceed tens of thousand processors. With that many components, it is highly likely that at least one of them has failed at any given time, and locating and isolating these components cannot be done efficiently without a dedicated infrastructure and the necessary system software.
Symmetric Multi Processors The systems described so far are distributed-memory systems. The processors on a node have direct access to local memory, but to access data in the memory of a remote node, the data has to travel across a network in the form of a message before it can be used. Symmetric Multi processors (SMPs) are shared-memory machines. Each of the CPUs in an SMP can directly access all of the memory. We will see later that this allows the programming model to change. Instead of moving data to local memory before it can be accessed, SMP programming models can assume that the data is always accessible. Access time to the data may depend on which memory module holds the data. Smaller SMPs use a memory bus that is used by all CPUs to access any of the memory attached to it. The cost, called latency, of accessing any memory location is roughly the same, no matter which CPU accesses which memory block. For larger SMPs, a single bus shared among all CPUs would be a bottleneck and a more sophisticated architecture, such as a crossbar switch, becomes necessary. When memory size and number of CPUs grow large, it becomes impossible to keep memory access times uniform. Non uniform Memory Architectures (NUMA) deal with this problem by keeping some memory local to a CPU. All of the memory is still accessible to all CPUs, but access times for non local memory is higher than the latency to access local memory. When a large number of CPUs share a single memory, cache coherence becomes an issue which limits further scaling. Hardware to keep cache contents synchronized when many CPUs read and write the same memory location grows exponentially in size and cost. At some point the cost of keeping caches coherent becomes prohibitive. The only way to add more CPUs
MIMD (Multiple Instruction, Multiple Data) Machines
to the system is by abandoning cache coherence and allow for non uniform memory access. That approach is taken by the supercomputers described in the previous section.
Multicore Processors The last few years have seen the arrival of multicore processors which combine multiple CPU cores on a single integrated circuit. Each core consists of registers, arithmetic and floating-point units, and all the other control logic that we are used to in individual CPUs. As long as Moore’s law continues to hold, each new generation of integrated circuits will contain twice as many transistors as the generation eighteen months ago. Since pipelining and superscalar designs are reaching their limits, multiplying the number of cores on an integrated circuit is a fairly simple way of making use of the ever increasing number of transistors available. Parallel programming is difficult and a single more powerful CPU would be preferable. Physical limits, however, force CPU manufacturers to provide their customers with parallel systems on a chip. The number of cores per processor is increasing and how the cores share memory ports and caches is evolving. With two or four cores it is common to give each core an L cache, and share the L cache and the ports to memory. With larger numbers of cores, multicore processors face some of the same problems that limit scaling of SMPs. Newer multicore designs will share L caches among groups of cores and use a Network on Chip (NoC) to connect the groups to each other and to the available memory ports. In other words, each multicore processor can be an SMP, a distributed memory system, or a combination of the two. It should be clear by now that multicore processors are parallel systems with many properties in common with SMPs and even distributed memory systems when the number of cores grow large enough. Processors with hundreds of cores are predicted and finding ways to efficiently program these devices is currently occupying the research community and shares many traits with programming the other parallel systems discussed in this section. A Spectrum of Parallel Computing Systems In this section we briefly introduced several different parallel computing platforms. While there are many similarities between them, there are major differences
M
that make each platform suitable for specific applications and the way they are used. In the next section we will look at the differences in more detail. In the section after that we will investigate how these differences impact various aspects of parallel computing. We categorized these systems by describing them at a fairly high level of abstraction. In reality, most existing systems span multiple categories and cannot always be assigned to one or another category so easily. For example, volunteer computing makes use of desktop computers, but many of those computers now contain multicore processors. Sometimes they contain two or more multicore processors which make them multicore SMPs. Similarily, the nodes in a computational grid are often clusters or supercomputers. Programming these systems is difficult and different aspects have to be taken into consideration when matching an application to a class of parallel compute system.
Difference in Architectural Characteristics We have identified six major classes of parallel computing systems. There are many similarities among these systems, but some architectural differences exist that are important indicators for what uses and type of applications a system is most suitable. We will see that the classes of parallel computing systems can be arranged along axes that have a crucial impact on the use of these systems. One example of such an axis is physical distance between processing elements. In volunteer computing the individual processors can be spread out across the whole world and communication delays are measured in milliseconds and seconds. Sometimes the individual computers participating in a volunteer effort are turned on only at certain hours of the day. This increases response delays to hours. At the other end of the spectrum are multicore processors where a signal from one core to another has to travel fractions of a millimeter. We will now look at several such characteristics that differentiate parallel computing systems.
Network Volunteer computing and most grids use the Internet to connect processors into a parallel computer. Geographical distances may be large and the speed-of-light limit for electrical signals comes into play, especially with satellite links. The high overhead of network protocols such as TCP/IP, congestion along links, and slow routers
M
MIMD (Multiple Instruction, Multiple Data) Machines
operations are possible, for example, reductions for finding the largest value among the data each processor contributes. These operations are simple memory-tomemory copies in multicore processors and SMPs. Supercomputers sometimes have specialized hardware and a separate network for these kinds of operations. In clusters these operations become more expensive and on the Internet they require a lot of additional bandwidth and latency. Figure illustrates the network characteristics of the six categories of parallel systems we are considering. We move from left, from the least integrated systems, to the right where we find the most integrated and tightly coupled systems. Doing so, data access latencies decrease, and bandwidth and bisection bandwidths increase. We will later see that the systems to the far right have fewer processors and fewer connections between them. Although the bandwidth keeps increasing, the bisection bandwidth is limited by the number of connections and therefore drops as we move from supercomputers to SMPs and multicore processors.
Network Interface Above we have described various network characteristics, but we have not discussed how processors are connected to the network. The network interface is a crucial component in a parallel system. In SMP and multicore systems no network interface is used to access data in memory. Instead, the role of the network interface is
More
yes
Bandwidth Latency Bisection
no
Collectives
or
e
P
ul
tic
SM M
er pu t
er om
st lu C
pe rc Su
G rid
r
Less
ee
limit the bandwidth that can be achieved and impose millisecond latencies or higher. Moving to clusters and supercomputers, latencies drop to microseconds and bandwidth improves because data has to travel shorter distances and higher capacity links are available. The network topology connecting the processing elements also changes: from the seemingly random graph of the Internet to structured topologies such as -D and -D meshes, trees, and hypercubes. These topologies provide higher bisection bandwidth. It is measured by letting half of the processors in a parallel computer communicate simultaneously with the other half. Assignment to each half is done such that as few links as possible must carry all the traffic. The rich topologies of clusters and supercomputers provide a much higher bisection bandwidth than the Internet where often low-speed links have to be shared among many end points. Moving to SMPs and multicore computers, latency, bandwidth, and bisection bandwidth further improve. Data can now be shared in memory and latencies to access that data drop to nanoseconds. How the source and the destination of a data transfer are addressed also changes as we move along the hierarchy. For volunteer and grid computing, each processing element has its own IP address which the routers of the Internet use to direct data packet to the correct destination. Clusters sometimes use the same mechanism, but supercomputers use the location of a node within the structured topology to send data to it. Instead of an address that needs to be decoded and interpreted, supercomputers use directions such as “two up and three to the left” in a -D grid. This makes routers simpler and faster. SMPs and multicore which have shared memory use memory addresses to access data. Very fast hardware decodes an address and accesses the appropriate memory location. CPUs can use memory addresses directly with their load and store instructions. Moving traffic using turnby-turn directions or determining at what IP address a particular computer resides requires hundreds to tens of thousands of CPU cycles. Sometimes it is necessary to distribute data from one processor to all others participating in a parallel computation. This is called a broadcast. In a multicast, data gets distributed to a select subset of the processors participating in the computation. More complex collective
nt
M
Vo lu
MIMD (Multiple Instruction, Multiple Data) Machines. Fig. Different parallel systems have different network characteristics. No specific values are assigned to each data point since variations in each parallel platform category are large. The data points and lines are meant to show trends
MIMD (Multiple Instruction, Multiple Data) Machines
gap between messages limits how many messages can be sent per second. Message rate is determined by latency and the processing capabilities of the system and the network interface. Generally, moving from volunteer computing to multicore CPUs, message rate increases. Simple Ethernet interfaces process one packet at a time and have very few additional capabilities. Sophisticated supercomputer network interfaces are small computers themselves with an embedded processor, memory, and fast data-movement hardware to connect to the network and the local memory. The network interface processor can be programmed to make decisions on where in memory to deliver a data packet, how to deal with errors in the network, and even implement collective operations. While the network processor is doing all that, the main CPU is free to continue its application processing. Figure shows the trends for network interface characteristics for each class of parallel system we are investigating. Fanout means the number of connections between a network interface and the network. The line labeled intelligence describes the processing capabilities of the network interfaces. Integration indicates how close an interface is to local memory or the CPU.
Remote Data Access For SMP and multicore systems, accessing data associated with another processor is a simple load or store instruction that can be executed by the processor directly. For all other parallel architectures, remote data access implies sending and receiving a message
More
Bandwidth Latency Message rate Intelligence Fanout Integration
or e
P
tic ul M
r pu m
lu Su
pe r
co
C
SM
te
er st
rid G
un t
ee
r
Less
Vo l
played by the memory controller. Both SMP and multicore systems may have, in addition, network interfaces if they are part of a larger parallel computer such as a cluster or a grid. However, in this section we are concerned with the functions provided by the interface between processor and the data that resides on other processors. For SMP and multicore processors, that interface is the memory controller, and for all other platforms it is the network interface. We used bandwidth and latency to characterize the networks described in the previous section. Network interfaces are often described using the same metrics. However, it is important to see that these metrics may be different for network interfaces and the links in the network. In some cases the links in a network have a higher bandwidth capacity to accommodate multiple network interfaces. The name interface suggests two sides: the connection to the network and the connection to the local processors. For the Ethernet cards used in volunteer and grid computing, there is a single connection from the interface to the network. (There can be multiple interfaces connecting a single machine to the network, though.) This is usually true for cluster computing as well, but supercomputers tend to integrate the network interface with a router that has multiple connections to the network. This makes it possible to build the different network topologies seen in these machines. If the interface supports it, data may move through it at multiple times the bandwidth of a single network link, if the routing algorithm makes use of more than a single link into the network. SMP and multicore interfaces, the memory controllers, connect to a memory bus or a switched network that directs data traffic to the appropriate memory bank. The other side of the interface connects directly to the processors in the case of a memory controller for an SMP or multicore CPU. In supercomputers, the network interface is often connected to the memory bus giving it direct and low-latency access to memory. In clusters and systems that use commodity network interfaces, the interfaces are connected to an I/O bus and treated similarly to other I/O devices in the system. In addition to bandwidth and latency, message rate is another important measure for network interfaces. Although latency may be low in some interfaces, it may not be possible to send another message right away. This
M
MIMD (Multiple Instruction, Multiple Data) Machines. Fig. Different parallel systems have different network interface characteristics
M
M
MIMD (Multiple Instruction, Multiple Data) Machines
that contains the data. These two access methods dictate which programming model is most suitable for them. The latency and bandwidth at which remote data can be accessed depend on the access method (load/ store or data transmission), and in the case of message passing, the performance and geographical distance of the network. Bulk data transfers benefit from higher bandwidths and are less affected by the latency of a network.
Other Differences There are several other differences between these parallel computing systems that are not directly related to their architecture. One of them is size, i.e., the number of processing elements and therefore the amount of available parallelism. In principle, volunteer computing could attract the largest number of processors to solve a particular problem. However, it is not common to gather more than a few thousand computers at a time to work on a single problem. The largest supercomputers have hundreds of thousands of processors and millions are expected in the next few years. SMPs and multicore processors configured internally as an SMP are limited to a few tens by their architecture. In larger multicore processors, the hundreds of cores will use a NoC to exchange data. We have already mentioned geographical distance which grows smaller and allows faster data access as we move from volunteer and grid computing to multicore processors. Figure illustrates the trends. Specialization and homogeneity increase moving from volunteer computing to multicore processors. Volunteer computers are of different brands and type, their
More Parallelism Distance Specialization/ homogeneity Cost per core
re co
P
ul ti
SM
pu co
m
lu Su
pe r
M
te
r
er st
rid C
G
un te er
Less
Vo l
MIMD (Multiple Instruction, Multiple Data) Machines. Fig. Various differences among parallel systems
peripherals differ, and their main purpose is most likely desktop computing. The compute nodes of a supercomputer are homogeneous and the only peripheral is the network interface. Additional nodes, sometimes of a different design, are responsible for I/O and user interactions. Most clusters are homogeneous, but some are built from whatever parts are available. Multicore processors today are homogeneous; all cores are of the same type. Future processors may include specialized cores mixed with general-purpose cores. Cost per processing element varies. It is not zero for volunteer computing, since someone had to bear the purchase cost of the computer providing the “free” cycles. Since grids and clusters are purchased for a specific purpose and usually contain higher-performing computers and faster networks, their cost per CPU is higher. Supercomputer processors are expensive because of the cost of specialized network interfaces, high-performance network, and the infrastructure that integrates all the parts into a single, reliable system. Cost per processing element drops for multicore processors.
Architecture Impact The six major categories of parallel computing platforms differ in some crucial characteristics that determine how these systems are used and for which type of applications they are most suitable. For example, while purchasing multicore chips with a larger core count may be an economical way to increase parallelism, and allows for fast data access, the achievable level of parallelism is limited by the number of cores which technology permits on a single die. Volunteer computing can incorporate as many processors into a computation as are available, but geographical distance and network technology limits how fast remote data can be accessed. In this section we show how differences between these parallel computing platforms impact usage and available features.
Application Type Master/slave applications are well suited for volunteer computing. One process, designated the master, hands out tasks to computers that become available and ask for work to do. When a slave finishes its task, the master collects the results. It is the master’s role to coordinate tasks, for example, if one task depends on the results of another task. These types of applications are well-suited
MIMD (Multiple Instruction, Multiple Data) Machines
for volunteer computing because the master can retry a task until at least one of the slaves finishes it. Note that this programming paradigm also works well on all other architectures discussed in this article. This is the case because the tasks a relatively independent and the limited communication does not tax the network. Systems with more powerful networks and faster data access can easily accommodate this style of computing, albeit at a higher cost. Tightly coupled applications work better on tightly coupled systems such as SMPs and multicore processors. These types of applications work on a single set of data that is distributed across the memory of all participating processors. All of the tasks running on these processors perform the same operations, each on its portion of the data. Depending on the problem being solved, this may involve more or less communication with processors that are working on neighboring data elements. It is these communications, and the number and pattern of processors that need to communicate with each other, that determine how powerful the network must be for these applications to run efficiently.
System Software The system software controlling these systems and the services it provides depend on the type of the system. In volunteer and grid computing each computer runs a full-featured operating system (OS). The software to organize these computers into a parallel system is called middleware and runs on top of the OS, making use of its services. In systems that are dedicated to running parallel applications, such as clusters and supercomputers, the OS is augmented and tailored to provide the services needed, such as message passing, and omits services that are not needed and may be disruptive e.g., email and printer servers. Some supercomputers use specialized lightweight OSs to limit administrative overhead and make resources more directly available to applications. In SMP and multicore computers a single instance of a full-featured OS controls all resources present. Programming Model The programming model chosen to write an application depends on the problem to be solved as well as the platform that will execute the application. The two main
M
paradigms currently in use are explicit message passing and shared-memory programming. The former is well suited for applications that need to control where (in which memory) data resides and can coordinate data transfers such that the data is available locally when it is needed. Message passing maps well to the networked architectures, but also executes well on shared memory machines where the explicit mapping of data to memory regions often leads to better cache performance than when a compiler determines data layout. Shared memory programming can only be done on SMPs and multicore processors, unless additional software layers transform memory accesses into data transmissions in distributed memory systems. Several shared-memory programming models provide higher levels of abstraction, but sometimes hide too much of the underlying architecture, such as memory layout, to reach the highest performance possible.
Usage The six different parallel computing platforms are used differently and for different purposes. The goal of capacity computing is to complete as many jobs in the shortest time possible, i.e., high throughput of jobs. This requires that enough processors are available to solve the problem and efficient scheduling such that all available resources are used. Supercomputers and some SMPs are used for capability computing. These systems were purchased to solve specific problems in the shortest amount of time possible. These grand-challange problems cannot be solved on smaller or slower systems, at least not in a reasonable amount of time. Supercomputers and most clusters are space-shared. Once nodes have been allocated to an application, the processors on these nodes only execute that application. In time-sharing, which is common on all other systems, other applications and tasks also run on the same processors. The OS accomplishes that by giving each process a small slice of time in round-robin fashion, creating the illusion that multiple processes run concurrently on the same CPU. Space-sharing lets an application progress faster since the compute resources are dedicated to that application. Time-sharing makes more efficient use of the available resources, but does not provide optimal performance for an application.
M
M
MIMD (Multiple Instruction, Multiple Data) Machines
Summary The type of parallel platform influences what applications run well on it. In our spectrum from volunteer computing to multicore processors, applications that run efficiently on a network of volunteered compute resources run also well or better on more integrated and faster systems. Applications that run well on tightly integrated systems may not run well or scale to grid and volunteer computing. In general, cost increases as we move from volunteer computing to supercomputers. Cost for SMPs, depending on size, and multicore processors is lower than supercomputers, but the number of processing elements is limited. In order to find the best system for a given parallel application, all factors, including architecture, performance, cost, and usage model, must be taken into consideration. Few systems are designed to run a single application or serve a single user. Therefore, compromises become necessary. Distributed and parallel computing have many things in common. Volunteer and grid computing are usually considered distributed computing, while clusters and the other systems in the lineup are considered parallel computers. Many systems today are hybrids, such as clusters made up of SMP and multicore computers. This allows for a higher peak floating-point performance while keeping the cost reasonable. However, there are performance impacts, e.g., multiple CPUs/ cores sharing a single network interface, and programming these multi-level parallel machines becomes more difficult.
Future Directions Parallel computing has moved from large supercomputers to the desktop. The amount of parallelism will increase with future generation processors and new programming models to make using these machines easier are being researched. Multicore processors will evolve to include specialized cores to accelerate floating-point performance and graphic processing for example. The number of processing units on these chips will continue to grow and they will begin to resemble SMPs, complete with non uniform memory access and complex networks connecting all components on the chip. Clusters and supercomputers will continue to grow and eventually cross the exaflop barrier. Systems that
large will be plagued by component failures because of the sheer number of these components. This will require drastic measures to make the applications and the systems themselves more resilient to failures. System software will also evolve and be a crucial part to improve resiliency. Running a full-fledged OS on each of a million cores seems wasteful and may limit scalability. Research into lightweight OSs is progressing and may offer a solution to use these compute resources efficiently.
Related Entries Bandwidth-Latency Models (BSP, LoGP) Cache-Only Memory Architecture (COMA) Clusters Distributed-Memory Multiprocessor Flynn’s Taxonomy Locality of Reference and Parallel Processing Metrics Moore’s Law Nonuniform Memory Access (NUMA) Machines Operating System Strategies Shared-Memory Multiprocessors Single System Image SoC (System on Chip)
Bibliographic Notes and Further Reading Condor was an early software system that allowed harnessing idle cycles on a collection of workstations [, ]. A detailed description of Condor and its two-decade history can be found in []. An interesting early paper describes “worms” as a mechanism to distribute compute tasks across a set of available computers []. Just before the widespread use of cluster computers, a team at the University of California at Berkeley made the case for networks of workstations []. They built on the experience of Condor and other such projects, but felt that the larger amount of memory then available, the larger capacity, the reliability of redundant arrays of workstation disks, and the dropping cost of CPUs enabled more wide spread use of such systems. Soon after, the first Beowulf clusters came into use [, ]. The late s saw many cluster projects.
Models for Algorithm Design and Analysis
Several of them distinguished themselves from networks of workstations by creating system software that mimicked that of large supercomputers []. SETI@home is described in [] and in more detail in [] and []. The infrastructure to run SETI@home evolved into BOINC [] which now runs many other volunteer computing projects in addition to SETI@home. The various aspects of grid computing are described by the articles in []. Cluster computing has been the first contact with parallel computing and system design for many people. In [], Pfister discusses many of the issues and concepts only briefly touched on in this introductory article.
Bibliography . Anderson DP () Boinc: a system for public-resource computing and storage. In: GRID ’, Proceedings of the th IEEE/ACM international workshop on grid computing, IEEE computer society, Washington, DC, pp – . Anderson DP, Cobb J, Korpela E, Lebofsky M, Werthimer D () SETI@home: an experiment in public-resource computing. Commun ACM ():– . Anderson TE, Culler DE, Patterson DA, and the NOW team () A case for NOW (networks of workstations). IEEE Micro ():– . Brightwell R, Ann Fisk L, Greenberg DS, Hudson T, Levenhagen M, Maccabe AB, Riesen R () Massively parallel computing using commodity components. Parallel Comput (–):– . Foster I, Kesselman C (eds) () The grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Francisco . Korpela E, Werthimer D, Anderson D, Cobb J, Leboisky M () SETI@homemassively distributed computing for SETI. Comput Sci Eng ():– . Litzkow M () Remote Unix – turning idle workstations into cycle servers. In: Usenix summer conference, Phoenix, pp – . Litzkow M, Livny M, Mutka M () Condor – a hunter of idle workstations. In: Proceedings of the th international conference of distributed computing systems, San Jose . Pfister GF () In search of clusters: the ongoing battle in lowly parallel computing, nd edn. Prentice-Hall, Upper Saddle River . Riesen R, Brightwell R, Bridges PG, Hudson T, Maccabe AB, Widener PM, Ferreira K () Designing and implementing lightweight kernels for capability computing. Concur Comput Pract Exp ():– . Shoch JF, Hupp JA () The “worm” programs – early experience with a distributed computation. Commun ACM (): – . Sterling T, Savarese D, Becker DJ, Dorband JE, Ranawake UA, Packer CV () Beowulf: a parallel workstation for scientific
M
computation. In: Banerjee P (ed) Proceedings of the international conference on parallel processing, Urbana-Champain pp I:–I: . Sterling T, Savarese D, Becker DJ, Fryxell B, Olson K () Communication overhead for space science applications on the Beowulf parallel workstation. In: HPDC ’, Proceedings of the th IEEE international symposium on high performance distributed computing, IEEE computer society, Washington, DC, p . Thain D, Tannenbaum T, Livny M () Distributed computing in practice: the Condor experience. Concurr Pract Exp (– ):– . Werthimer D, Cobb J, Lebofsky M, Anderson D, Korpela E () SETI@home – massively distributed computing for SETI. Comput Sci Eng ():–
MIMD Lattice Computation MILC
MIN Networks, Multistage
ML Concurrent ML
MMX Vector Extensions, Instruction-Set Architecture (ISA)
Model Coupling Toolkit (MCT) Community Climate System Model
Models for Algorithm Design and Analysis Models of Computation, Theoretical
M
M
Models of Computation, Theoretical
Models of Computation, Theoretical Gianfranco Bilardi , Andrea Pietracaprina University of Padova, Padova, Italy Università di Padova, Padova, Italy
Synonyms Computational models; Models for algorithm design and analysis
Definition A model of computation is a framework for the specification and the analysis of algorithms/programs. Typically, a model of computation includes four components: an architectural component, described as an interconnection of modules of various functionalities; a specification component, determining what is a (syntactically) valid algorithm/program; an execution component, defining which sequences of states of the architectural modules constitute valid executions of a program/algorithm on a given input; and a cost component, defining one or more cost metrics for each execution. A model of computation can be evaluated with respect to three conflicting requirements: the ease of algorithm/program design and analysis (usability); the ability of estimating, from the cost metrics provided by the model, the actual performance of a program on a specific real platform (effectiveness) and on a class of platforms (portability). An ample variety of models of parallel computation are known in the literature, which achieve different trade-offs among the above requirements.
Discussion Overview A major objective of Computer Science and Engineering is the development of automatic solutions to computational problems. Formally, a computational problem Π is a binary relation between a set of problem instances and a set of problem solutions. The intended meaning of (x, y) ∈ Π is that y is a solution to instance x. Both instances and solutions are strings from suitable finite alphabets, typically encoding some kind of mathematical objects, such as numbers, matrices, sets, or
graphs. Informally, an algorithm is a procedure that specifies how to obtain an output string from an input string by the repeated application of primitive operations (functions) from some chosen set. What constitutes an execution of the algorithm on a given input must be unambiguously specified, and the execution process must terminate within a finite number of steps, generally a function of the given instance. Algorithm A solves problem Π if, for any input instance x of Π, it outputs a solution y such that (x, y) ∈ Π. In the outlined context, a model of computation is a formalism for the specification of algorithms. Such a specification can be referred to as a program. A program whose execution terminates on every input specifies an algorithm. In principle, an algorithm is rigorously defined only by means of a program, within a chosen model of computation. In practice, however, when an algorithm is described in the literature, the main ideas are typically presented at a high level, while several (“straightforward, yet tedious”) details are left implicit. The details can be provided in different ways, within the chosen model of computation, yielding different programs for what is informally considered to be the same algorithm. Sometimes, an approach to solving a problem is described at an even higher level of abstraction, where even the model of computation is left at least in part unspecified, providing what will be referred to as an algorithmic strategy. For example, the quicksort algorithmic strategy to sort a set of records consists in separating the records into two subsets, to be sorted recursively, the keys of the records in the first subset being all smaller than those in the second subset. This strategy can be made specific assuming serial or parallel execution, assuming sequential or random memory access, rearranging the records in situ or using a proportional amount of auxiliary storage, etc. Depending on the model of computation, further elements may be specified by an algorithmic description, such as the order of execution of actions (e.g., the order of the two recursive calls in quicksort), which processor is assigned which action (when processors form an explicit component of the model), which data must be stored in which memory region (particularly relevant if memory access time in not uniform), and so on. A variety of different algorithms can then be produced, all referred to as quicksort, in recognition of the common underlying algorithmic strategy.
Models of Computation, Theoretical
In the s, several models of computation were proposed including the recursive functions of Gödel, the Turing machines of Turing, and the λ-calculus of Church. These models were soon recognized to be equivalent to each other, in the sense that there are automatic translations of programs from one model to another that preserve the input-output behavior: On a given input, the translated program terminates if and only if the original program terminates and, if so, it produces the same output. Turing machines can be physically realized (taking the infinitude of the storage tape with a grain of salt), for example, as interconnections of Boolean gates and flip flops (taking noise with another grain of salt). Therefore, the effectiveness of the model is not under question. The Church–Turing thesis states that Turing machines (as well as any equivalent model) provide a complete model of computation, in the sense that any procedure that is intuitively reckoned as executable admits a specification in the model. The thesis has been debated but it is generally accepted, as no counterexamples have been discovered yet. The halting theorem of Turing and related results show that no complete model of computation exists if all programs within the model are guaranteed to terminate. In summary, Turing machines or equivalent models determine the boundary between the problems that can be solved algorithmically and those that cannot. Parallel computing does not alter this boundary, although it can greatly enhance the performance of real machines. An algorithm automates the solution to a problem at the logical level: A procedure with a finite description can be applied to solve all of the (typically) infinitely many instances of the problem. But it is also of great practical interest to automate the solution at the physical level, by realizing its execution as a physical process. Physical processes utilize resources, which raises the question of efficiency, that is, of minimizing, or at least reducing, the cost due to resources consumed by the execution process, such as space, time, and energy. Thus, in addition to enabling the specification of an algorithmic strategy, models of computation can provide a framework to specify aspects of the physical execution of that strategy, to analyze the corresponding cost, and ultimately to design efficient algorithms. In summary, a model of computation typically includes the following four components (which may be trivial in some cases):
M
●
An architectural component, described as an interconnection of modules of various functionalities. The state sets of each module and the rules that determine the possible state transitions are also provided. ● A specification component, determining what is a (syntactically) valid program. ● An execution component, defining which sequences of states of the architectural modules constitute valid executions of a given program on a given input. ● A cost component, defining one or more cost metrics for each execution. Several properties are desirable in a model of computation. For example, the following three have been identified in [] as key properties for a model M: ●
Usability, which refers to the ease with which an algorithm can be specified and its cost analyzed in the model. ● Effectiveness, with reference to a target platform P, reflects how well the actual cost of execution of programs for P, obtained by suitably translating programs specified on M, can be derived from the cost metrics of M. ● Portability, with reference to a (possibly wide) class C of target platforms, characterizes the effectiveness of M with respect to all of the platforms of C. While usability, which is inherent to the model’s definition, is a somewhat subjective property and cannot be easily captured by a quantitative metric, effectiveness and portability are related the target physical machines and, to some extent, they are amenable to quantitative analysis []. Unfortunately, the above three properties are often in conflict with one another. This is perhaps the main reason behind the considerable proliferation of models proposed over the years, resulting from different trade-offs among the various requirements, as briefly discussed below. The desirable properties of models of computation and their trade-offs have played a role even for sequential computing, a paradigm broadly characterized by the constraint that operations are executed one at the time. Usability considerations have led to the success of the Random Access Machine (RAM) model. As commercial platforms have evolved deep memory hierarchies with half a dozen levels, the uniform memory-access
M
M
Models of Computation, Theoretical
time of the RAM has made the model less and less effective with respect to these platforms. Several variants to the cost component of the model have been proposed [], to capture the variability of access time, block transfer, pipelined transfer, and other features of current memory organizations. The scenario has become only more complex with the advent of parallel computers. With a given execution of a given algorithm one can associate a functional trace, an acyclic directed graph whose nodes correspond to the primitive operations specified by the algorithm on input x, and where an arc from node u to node v indicates that the result of operation u is an operand for operation v. The functional traces of an algorithm represent what may be called the logical aspect of the computation. All models must provide means to specify the functional traces (as a function of the problem input). However, several other aspects arise when the computation is to be realized as a physical process, the evolution of the state of a computer. There is ample freedom in how the logical structure of an algorithm can be physically realized and there is ample freedom in choosing which aspects of the physical realization to capture explicitly within the model of computation. Some reflection on these degrees of freedom can provide some orientation when navigating among the multitude of models. The physical execution of a computation requires various types of unit: control units, which coordinate the construction of the trace from the program and the input; operation units, which perform the primitive operations defined by the trace; memory units, (cells), which store values from the time they are generated to the time(s) they are used; and communication units, (channels), which connect different units. Parallel computers have several units of each type whose role in the computation has to be specified. The management of these resources is of central importance for parallel computing, and models of computation differ widely in the mechanisms they provide for specifying the use of machine resources. A general question to be addressed is whether a given resource is better handled automatically or better left under the control of the algorithm designer and programmer. Unfortunately, this question has no simple, universal answer. Automatic resource management typically leads to better usability, and some time even
to greater effectiveness (e.g., when management decisions can benefit from dynamic information available only at execution time). However, there are cases where understanding the mathematical structure of the target computation is crucial to the efficient use of resources. In these cases, the algorithm designer can potentially provide a better resource management. However, the required level of competence may be high and, therefore, the amount of programming effort should be budgeted. Moreover, the performance attained through a poor resource management explicitly coded in the algorithm could turn out worse than the one ensured by a suboptimal automatic management. The amount of resource control left to the algorithm designer is an interesting criterion for comparing the relative effectiveness of models of computation and can be used to organize some of them in a hierarchy where, proceeding from top to bottom, an increasing number of resource-management aspects is explicitly represented.
Representative Models At the top level of the hierarchy of models one can place those at the heart of functional [] and dataflow languages [], in which programs specify only the functional traces of an algorithm. Important cost metrics identifiable at this level are the number of operations, often referred to as work and denoted W, and the length of a critical path L, that is the length of a longest path in the functional trace graph. The ratio W/L provides an indication of the amount of parallelism implicit in the algorithm. In fact, an insightful result due to Brent [] shows that, off-line, the operations of a trace can be scheduled in at most L stages each including at most W/L operations. However, to actually make this parallelism explicit when dynamically executing a program is not necessarily trivial. In physical systems, data have to be stored from the time they are produced to the time they are used, leading to the notion of storage, a set of cells that can represent different values at different times. Most models of computation explicitly introduce the notion of storage and let the computation be defined directly in terms of transformations to be applied to the storage content. The Turing machine, the RAM, imperative programming languages such as FORTRAN or C, and most parallel models
Models of Computation, Theoretical
reviewed below are examples of storage-based models. In shared-memory models, a computation is specified as a sequence of transformations applied to a global shared memory. Parallelism can be expressed by parallel loop constructs, whose iterations can be executed simultaneously, or by parallel function calls. Synchronization mechanisms are also provided, to constrain the time sequence of the updates, since different sequences will generally produce different results. The work metric of a shared-memory program is the same as that of a functional program encoding the same algorithmic strategy. However, while for a functional program any topological ordering of the trace is legal, for a shared-memory program synchronization instructions may introduce further constraints. The Work-Depth model introduced in [] is an example of shared-memory model which has been widely used in the literature for the design and analysis of parallel algorithms. An algorithm in the WorkDepth model is described as a standard RAM algorithm enriched with parallel instructions, for example, a parallel loop, which make it possibile to specify sets of parallel tasks (single elementary operations as well as complex procedures). Operands and results of the operations reside in a shared memory. The relative order of execution of instructions that belong to different parallel tasks is unconstrained. However, all tasks generated by a given intruction must complete before the next instruction can be executed. This constraint can be modeled by adding arcs to the trace graph. The length of the critical path in this augmented graph is called the depth of the shared-memory computation and denoted by D. Clearly, D ≥ L. In general, implementing the same algorithm within less space may result in a greater depth, hence in a smaller amount of parallelism. Formulating an algorithm within a shared-memory model, as opposed to a functional language, enables the algorithm designer and programmer to take control of a simple, but important aspect of resource management: the reuse of memory. In a system that executes a functional program, memory is automatically allocated to hold intermediate values, with no burden for the programmer. However, since determining when values will be no longer reused is computationally hard (being in general an undecidable property), an automatic system will often be unable to reuse memory space, in situations where a programmer would know
M
(by an analysis of the algorithm at hand) that the space could be safely reused. On the other hand, poor programmer’s choices regarding memory reuse may inhibit precious parallelism (D ≫ L), as the proponents of functional programming have often pointed out. In the hierarchy of models, one level below parallel shared-memory models one can place the Parallel Random Access Machines (PRAMs) [, ]. A PRAM comprises a set of P RAM processors; each processor has a unique identification number (id), is equipped with a local memory, and has also direct access to a shared memory. Whereas algorithms developed on shared-memory models specify only which variables are involved in each operation, PRAM algorithms also specify which processor is to carry out the operation. In fact, a PRAM program is syntactically very similar to a RAM program where instructions can make reference to the processor’s id, thus enabling the same program to result in different executions on different processors. Instruction mechanisms to dynamically activate and deactivate processors are also provided. Under plausible scenarios, a PRAM could automatically execute a shared-memory algorithm incurring a slowdown logarithmic in the number of processors, to systematically handle the assignment of operations to processors. In a nutshell, a parallel data structure could be employed where operations to be executed are inserted and then deleted when an available processor takes them up for execution. For most algorithms of practical interests, however, this overhead can be avoided by a direct specification of the algorithm within the PRAM model. Since such a specification is often rather straightforward, the distinction between shared-memory and PRAM algorithms is often blurred in the literature, although the distinction ought to be retained, for conceptual clarity. Shared-memory and PRAM models assume that memory cells form a uniform and global address space, where any subset of addresses can be simultaneously accessed at a cost independent of the subset. Physical machines depart significantly from this idealized assumption. First, memory is partitioned into banks, and while accesses to cells in different banks can be concurrent, accesses to cells in the same bank are serialized. Second, processing elements and memory banks are organized into a network, the topology of which can be a degree of freedom. The result is that the cost of accessing
M
M
Models of Computation, Theoretical
a subset of memory locations is highly dependent upon the distribution of such subset across the physical memory. To investigate the impact of these physical constraints, a class of network-of-processors models has been widely studied. Typically, one such model consists of a network of P RAM processors where each processor is directly connected to one memory bank. The network specifies point-to-point connections among the processors according to a certain topology. Each processor can execute standard RAM instructions, accessing data in its local memory bank, as well as specific instructions to send and receive messages from the immediate neighbors determined by the network’s topology. In terms of cost, it is assumed that a word can be exchanged between direct neighbors in constant time. A variety of interconnection topologies have been considered, such as the complete graph, arrays of various dimensions, hypercubes and hypercubic networks, trees and fat-trees, and expander-based networks. A model based on the complete graph, referred to as Module Parallel Computer, was introduced in [] to investigate the impact of memory granularity decoupled from the further constraint imposed by a limited connectivity. However, due to physical and technological constraints, a small degree, preferably independent of the number of processors, is more realistic. Particularly suitable for writing algorithms for networks of processors is the message passing programming paradigm, in which parallel computation results from the cooperation of a number of sequential processes that exchange messages. A message passing language can be obtained by augmenting a sequential language with send and receive instructions. At execution time, a set of processes is activated, each executing the given program. Processes have a unique identification number, which serves as an address for send and receive instructions and provides a mechanism to differentiate the actions of different processes. Each process has its own memory space, accessible only to that process. A distinct copy of a statically defined variable is available to each process; thus, an instruction involving that variable has different effects for different processes. Message passing has become the main programming paradigm in parallel high performance computing, particularly in the form provided by the Message Passing Interface (MPI) library []. An MPI program can be
written assuming a specific mapping of processes to the nodes of a given topology and constraining message exchange to immediate neighbors. Strictly adhering to the network-of-processors model presents the unwelcome consequence that the transfer of messages between arbitrary nodes has to be managed in software, a solution that is typically neither convenient nor efficient. To circumvent this obstacle, the design of efficient routing algorithms for various topologies has been widely studied. Today, virtually all commercial systems provide a hardware router that can automatically deliver messages from any node to any other node. The design and programming of algorithms is then relieved of the burden of managing the details of message transport. However, as discussed in the following paragraphs, careful data mapping and operation scheduling remain crucial to performance, resulting in communication patterns that the router can deliver faster. In the outlined context, it is natural to wonder to what an extent should algorithm design be aware of memory granularity and network topology. This question has been investigated, in part, by considering automatic ways of emulating a PRAM computation on a network []. Roughly speaking, a rather sophisticated analysis has shown that one PRAM step can be reduced to a constant number of arbitrary data permutations, on an amortized average, or to logarithmically many permutations, in the worst case. The average and the worstcase approaches are substantially different, but share an unwelcome property: In order to limit contention at individual memory banks, the PRAM memory is distributed among the avaliable banks through hash functions or sophisticated schemes based on expander graphs, so that the physical layout of data turns out to be a highly scrambled version of the original PRAM layout, thus forfeiting any locality structure captured by the latter. For most networks of practical interest, an arbitrary permutation is rather costly. In contrast, many computations of interest can be mapped into such networks in a way that most memory accesses by a given processor involve its own memory bank or that of nearby processors. Hence, at least asymptotically, there is much to be gained in explicitly optimizing an algorithm for a given topology, by understanding the locality structure and exploiting it by suitable scheduling and mapping of data and operations. For most past and current machines,
Models of Computation, Theoretical
this conclusion has been tempered by the fact that the actual data-transfer time is often negligible when compared to the high local latency associated with initiating an interprocessor communication (currently, in the order of microseconds). In the future, however, one can expect that local overheads will decrease (being mostly of a technological and economical nature) while transfer time is destined to increase as machines become larger (being rooted in the more fundamental constraint arising from the finiteness of the speed of light). Substantial effort has been devoted to the development of efficient network algorithms together with lower-bound techniques to assess their performance []. The variety of possible interconnections opens a fertile field for theoretical investigations, but at the same time brings formidable challenges. The prospect of rethinking and reprogramming each algorithm for a number of different topologies is not very attractive in practice. A few avenues have been explored to mitigate this portability issue. One approach is based on efficient simulations of a guest network by a host network, to enable the automatic porting to the host of algorithms optimized for the guest. A wealth of useful and often nontrivial simulation results have been obtained. Unfortunately, even a (worst-case) optimal general-purpose simulation does not guarantee that an optimal guest algorithm is transformed into an optimal host algorithm, although weaker performance guarantees can often be made, in the presence of some structural similarity between the host and the guest. A second approach is based on the introduction of computational models whose cost component includes parameters aimed at capturing some key performance-related properties of networks. Algorithms on these models can be parameter aware. The hope is that networks of interest can be well approximated by the model, with a suitable choice of the parameter values. The Bulk Synchronous Parallel (BSP) model [] and the LoGP model [] are the most popular representatives of this class, but numerous variants and refinements have been also proposed in the literature. In the BSP model, the architectural component consists of a set of RAM processors that can exchange messages via a communication infrastructure. The execution of an algorithm consists of a sequence of rounds,
M
called supersteps: In each superstep a processor can execute operations on local data and send messages to other processors, which are received only at the beginning of the next supertep. The cost of a superstep is expressed as w + gh + ℓ, where g and ℓ are model parameters (respectively a measure of inverse bandwidth and of global synchronization time), w is the maximum number of operations executed by any processor, and h is the maximum number of messages sent or received by any given node. A shared-memory variant of BSP, called Queuing Shared-Memory (QSM) model, was introduced in [], where in the cost of a superstep the maximum number of reads/writes issued by a processor takes the place of the quantity h, the latency parameter is dropped and the maximum contention for an individual shared-memory location is explicitly accounted for. A significant generalization of BSP is the Decomposable BSP (D-BSP) model [, Chap. ], whose architectural component is based on a binary-tree conceptual structure where each leaf corresponds to a RAM processor and each internal node identifies a cluster of processors residing at the leaves of its subtree. By letting the computation proceed independently within clusters of a given level and charging bandwidth and synchroniztion costs increasing with the cluster size, the model encourages the algorithm designer to expose a form of data locality called submachine locality. The availability of different parameters at different tree levels affords a better approximation of specific processor networks. As is well known, data locality is all important in sequential computing, owing to the hierarchical structure of storage. A strong relation between submachine locality and temporal locality of reference was revealed by efficiently simulating D-BSP algorithms on standard sequential memory hierarchies [, Chap. ]. An integration of the memory and network hierarchies arises when accounting for communication delays due to the finite speed of messages, both in the network and within the memory []. In recent years, parametric models that capture hierarchical features of both memory and network have been motivated by the increasing popularity and widespread adoption of platforms based on multicore architectures. These models typically feature a hierarchical memory system comprising one large shared memory connected to the processors through several levels of nested smaller memory modules (or
M
M
Models of Computation, Theoretical
caches) which are shared by progressively smaller subsets of processors. Examples of these models are the Hierarchical Multi-level caching (HM) model by [] and the Multi-BSP model by []. Models of computation which account for the memory and network hierarchies typically provide parameters that capture salient features of each level of these hierarchies (e.g., bandwidth and latency characteristics, memory or cluster sizes, etc.). Algorithms developed on these models are typically parameter aware, in the sense that they adapt their strategies based on the model’s parameters in order to maximize data locality. An ambitious goal is the design of algorithms which exhibit (near-)optimal performance for a large range of the model’s parameters without using these parameters to influence the course of their computation. The development of algorithms of this kind has been explored for the D-BSP model [] and for the HM model [], and the terms network oblivious and multicore oblivious algorithms have been respectively employed (in analogy with cache-oblivious algorithms which pursue a similar goal in the context of sequential memory hierarchies). It is remarkable that some problems (although not all) do admit oblivious algorithms. A somewhat radical avenue to solve the challenge of portability, or rather to avoid it altogether, would be enabled by the convergence to just one type of machine architecture and organization for parallel computers. Can such a convergence be reasonably expected? One reason to build different machines is that they may better suited to different computational tasks. Therefore, an interesting theoretical question with important practical ramifications concerns the existence of universal machines. More specifically, the question of universality can be cast as follows. Consider the class of all machines with a given value of a cost metric, chosen as a mathematical abstraction of the actual cost of realizing the machine. A machine U is universal for this class if it can simulate any of its members. The quality of U can be characterized by two metrics: (a) the machine cost blowup, that is, the ratio between the cost of U and the cost of the machines in the class, and (b) the slowdown, that is, the worst-case ratio between the time taken by U and the time taken by some machine in the class to perform some computation. The existence of universal machines with blowup
and slowdown both very close to one would imply that little performance would be given up by building only machines of the universal type. For the class of Bounded Degree Networks (BDNs), that is processor networks whose degree is constant with respect to the number P of nodes, if P is taken as the cost, then it is known that there are universal BDNs with blowup equal to one and with slowdown O(log P). Examples are the shuffle-exchange and the cube-connected-cycles BDNs, both of degree . More generally, a slowdown S ≤ log P, can be achieved by BDNs with a blowup P Θ(/S) []. An alternate measure of cost, which better reflects the scenario of VLSI technology, is the layout area. The area of BDNs has been investigated in the VLSI model of computation [], which enables the study of the space layout. A number of area-universal architectures have been proposed, for example, with polylogarithmic blowup and slowdown. These networks exhibit what has been dubbed the fattree structure, with processors at the leaves and switching subnetworks at the internal nodes. It is instructive to wonder why universal BDNs are not area universal networks. Mathematical analysis shows that the bisection bandwidth of universal BDNs grows proportionally to P/ log P with the number of processors, which in turn forces the area to grow proportionally to P / log P. Therefore, the layout of a BDN is scarcely populated with processors, hence it is not effective on algorithms that are more limited by computation than by communication requirements. An example is the multiplication of two square matrices, for which a two-dimensional mesh is a more effective architecture than a shuffleexchange or a cube-connected-cycles network of the same area. (Technically, the bisection bandwidth of a network is the minimum bandwidth across any cut that splits the network into two subnetworks of (nearly) the same number of nodes.) The universality results mentioned above, both for the processor and for the area measures of cost, are valid assuming constant delays for communications between direct neighbors in the BDN. It is provable that, in any layout of a network that is universal in this model, the average geometric distance between direct neighbors grows as some function of P. Therefore, the assumption of constant delays becomes physically unrealistic, for large enough machines. A VLSI-like model where
Models of Computation, Theoretical
communication delays are assumed to be proportional to the distance between source and destination has been considered in []. In this type of model, a mesh topology of the same number of dimensions as the hosting physical space achieves universality with constant blowup and slowdown. Thus, the mesh topology, already adopted in several supercomputers, appears as a promising candidate to which computer architecture could converge, at least on a large scale. In balance, it is safe to assume that in the short and in the middle term, a strong convergence among commercial platforms is not to be expected. The design of effective algorithms for such platforms will therefore require different models of computation or at least models that are suitably parametrized. However, even if the objective were to develop efficient algorithms for just one type of machine, a multiplicity of models would likely prove quite useful. Developing an algorithmic strategy at the functional or the shared-memory level, without being concerned at first with too many resource optimization issues, and subsequently refining the algorithm and adapting it to more constraining models may prove a more manageable task than designing directly for the target machine. Of course, as in most design processes that proceed in stages, decisions made in the early stages, based on partial cost metrics, may lead to an overall suboptimal design. However, such a design can often provide a valuable starting point for further efforts. Furthermore, a retrospective on parallel algorithm design would show that, when pursued with a grain of salt, the stage-by-stage approach is generally quite effective.
Related Entries Bandwidth-Latency Models (BSP, LogP) Brent’s Theorem BSP (Bulk Synchronous Parallelism) Data Flow Graphs Distributed-Memory Multiprocessor Flynn’s Taxonomy Interconnection Networks Locality of Reference and Parallel Processing Massive-Scale Analytics MPI (Message Passing Interface) Network Obliviousness PRAM (Parallel Random Access Machines)
M
PVM (Parallel Virtual Machine) SPMD Computational Model Universality in VLSI Computation VLSI Computation
Bibliographic Notes and Further Reading In [, ], many relevant models for parallel computation are surveyed and critically discussed. The PRAM and network-of-processors models are the topics of two excellent reference books, respectively by J. Jájá [] and F.T. Leighton [], which illustrate these models through an ample spectrum of algorithms and theoretical results. The book by J. Savage [] proposes a comprehensive survey of models of computation, for both sequential and parallel processing, discussing their relative power and several related complexity issues. Finally, the first section of the recent Handbook of Parallel Computing [], edited by S. Rajasekaran and J. Reif, comprises chapters devoted to models of computation.
Bibliography . Backus JW () Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Commun ACM ():– . Bilardi G, Pietracaprina A, Pucci G () A quantitative measure of portability with application to bandwidth-latency models for parallel computing. In: Proceedings of EUROPAR . LNCS, vol . Toulouse, pp – . Bilardi G, Pietracaprina A, Pucci G, Silvestri F () Networkoblivious algorithms. In: Proceedings of the st IEEE Int. Parallel and Distributed Processing Symposium, Long Beach, CA, pp – . Bilardi G, Preparata FP () Horizons of parallel computing. J Parallel Distr Com :–. Also appeared in Proceedings of the INRIA th Anniversary Symposium, and as Tech. Rep. CS-- Brown University . Bilardi G, Preparata FP () Processor-time tradeoffs under bounded-speed message propagation: Part I, upper bounds. Theor Comput Syst :– . Brent RP () The parallel evaluation of general arithmetic expressions. J ACM ():– . Chowdhury RA, Silvestri F, Blakeley B, Ramachandran V () Oblivious algorithms for multicores and network of processors. In: Proceedngs of the th IEEE Int. Parallel and Distributed Processing Symposium, Atlanta, pp –. Best Paper Award (Algorithms Track) . Culler DE, Karp R, Patterson D, Sahay A, Santos E, Schauser KE, Subramonian R, Eicken TV () LogP: A practical model of parallel computation. Commun ACM ():–
M
M
Modulo Scheduling and Loop Pipelining
. Dennis JB () First version of a data flow procedure language. In: Proceedings of the Symposium on Programming. LNCS, vol . New York, pp – . Fortune S, Wyllie J () Parallelism in random access machines. In: Proceedings of the th ACM Symp. on Theory of Computing, San Diego, pp – . Gibbons PB, Matias Y, Ramachandran V () Can a sharedmemory model serve as a bridging-model for parallel computation? Theor Comput Syst ():– . Goldschlager LM () A unified approach to models of synchronous parallel machines. In: Proceedings. of the th ACM Symp. on Theory of Computing, San Diego, pp – . Goodrich MT () Parallel algorithms column : Models of computation. ACM SIGACT News ():– . Harris TJ () A survey of PRAM simulation techniques. ACM Comput Surv ():– . JáJá J () An introduction to parallel algorithms. Addison Wesley, Reading . Leighton FT () Introduction to parallel algorithms and architectures: arrays, trees, hypercubes. Morgan Kaufmann, San Mateo . Maggs B, Matheson LR, Tarjan RE () Models of parallel computation: a survey and synthesis. In: Proceedings of the th Hawaii International Conference on System Sciences (HICSS), vol . Honolulu, pp – . Mehlhorn K, Vishkin U () Randomized and deterministic simulations of PRAMs by parallel machines with restricted granularity of parallel memories. Acta Inform :– . Message Passing Interface Forum. http://www.mpi-forum.org/ . Meyer auf der Heide F () Efficient simulations among several models of parallel computation. SIAM J Comput ():– . Reif J, Rajasekaran S (ed) () Handbook of parallel computing: models, algorithms and applications. CRC Press, Boca Raton, FL . Savage JE () Models of computation – exploring the power of computing. Addison Wesley, Reading . Shiloach Y, Vishkin U () O(n log n) parallel MAX-FLOW algorithm. J Algorithm ():– . Valiant LG () A bridging model for parallel computation. Commun ACM ():– . Valiant LG () A bridging model for multi-core computing. J Comput Syst Sci :–
Modulo Scheduling and Loop Pipelining Arun Kejariwal , Alexandru Nicolau Yahoo! Inc., Sunnyvale, CA, USA University of California Irvine, Irvine, CA, USA
Loop parallelization is the most critical aspect of any parallelizing compiler, since most of the time spent in
executing a given program is spent in executing loops. In particular, the innermost loops of a program (i.e., the loops most deeply nested in the structure of the program) are typically the most heavily executed. It is therefore critical for any parallelizing (Instruction Level Parallelism (ILP)) compiler to try to expose and exploit as much parallelism as possible from these loops. Historically, loop unrolling, a standard non-local transformation, is used to expose parallelism beyond iteration boundaries. When a loop is unrolled, the loop body is replicated to create a new loop. Compaction of the unrolled loop body helps exploit parallelism across iterations as the operations in the unrolled loop body come from previously separate iterations. However, in general, loops cannot be fully unrolled, both because of space considerations, and because of the fact that loop bounds are often unknown at compiletime. Furthermore, the extent of parallelism exposed by loop unrolling is restricted by the amount of unrolling. In order to extract higher levels of parallelism (i.e., beyond unrolling) in software, the various iterations of a loop are pipelined (akin to pipelining of operations in hardware [, ]) subject to dependences and resource constraints. This technique is known as loop pipelining. An example of loop pipelining is shown in Fig. . Figure b shows the pipelined execution of the different iterations of the loop shown in Fig. a, where LBi represents the loop body of the i-th iteration. Loop pipelining has its roots in the hand-coding practices for microcode compaction [, ]. Modulo scheduling [, ] pipelines (or overlaps) the execution of successive iterations of a loop such that successive iterations are initiated at regular intervals. The interval between successive iterations is determined based on data dependences and resource constraints. This constant interval between the start of successive iterations is called the initiation interval (II). Modulo Scheduling was the first method proposed for loop pipelining or software pipelining (the term “software pipelining”, originally coined to describe a specific transformation [], is nowadays commonly used interchangeably with loop pipelining to describe any and all transformations that deal with extracting parallelism along the lines discussed in this entry). The objective of loop pipelining is to find a recurring pattern (kernel) across iterations, whether themselves compacted or
M
Modulo Scheduling and Loop Pipelining
do i = 1, N
LB1
a)
Modulo Scheduling and Loop Pipelining. Fig. Loop pipelining
do i = 1, N v1
1: q = p mod i 2: A[i] = q + r end do
v3 <1>
3: B[i] = A[i−1] + 1
a)
v2
end do
b)
b)
v2
DDG
Modulo Scheduling and Loop Pipelining. Fig. Intra and loop carried dependences
not previously, as they get pipelined and replacing the original loop with this pattern, plus a prologue and epilogue code. The problem of loop pipelining then reduces to how to determine the (best) kernel of this transformed loop.
Preliminaries In general, an operation in a loop body is dependent on one or more operations. These dependences may either be data dependences (flow, anti or output) [] or control dependences [, ]. Data dependence between operations of the same iteration is termed intraiteration dependence; data dependence between operations from different iterations is termed inter-iteration or loop-carried dependence. Subsequently, a dependence refers to a data dependence unless stated otherwise explicitly. For example, in Fig. , the dependence between operation v and operation v is an intra-iteration dependence, whereas the dependence between v and v , shown by a dashed arrow in the data dependence graph (DDG) shown in Fig. b, is a loop carried dependence. The dependence distance (in number of iterations) between v and v is shown in angled brackets. Next, some terms used in the rest of the entry are defined.
DDG
<1>
2: A[i] = B[i] + 1
LB3
a)
v1
1: B[i] = A[i−1] + r
LB2
LB
b)
Modulo Scheduling and Loop Pipelining. Fig. Loop recurrences
Definition Given a dependence graph G(V, E), a path from an operation v to an operation vk is a sequence of operations ⟨v , . . . , vk ⟩, such that (vi− , vi ) ∈ E for i = , , . . . , k. Definition A cycle in a dependence graph is a path ⟨v , . . . , vk ⟩ such that v = vk and the path contains at least one edge. A cycle is simple if, in addition, v, . . . , vk− are distinct. A loop carried dependence in conjunction with intra-iteration dependences may form a simple cycle in the dependence graph. For example, in Fig. , the intraiteration dependence and the loop carried dependence between the operations v and v form a simple cycle. A loop has a recurrence if an operation in one iteration of the loop has a direct or indirect dependence upon the same operation from a previous iteration, e.g., in Fig. operations vki and vk− constitute a recurrence, where i k vi represents the i-th operation of the k-th iteration. In general, a recurrence may span several iterations, i.e., an k−j operation vki may depend on an operation vi , where j > . The existence of a recurrence manifests itself as a simple cycle in the dependence graph. Subsequently, a cycle refers to a simple cycle in a dependence graph. The length of a cycle c, denoted by len(c), is defined as the sum of the latencies of operations in c. The delay of a cycle c, denoted by del(c), is defined as the time interval between the start of operation v and the start of operation vk− in a given schedule plus one, where operations v , vk− ∈ c, see Definition . Let dist(c) denote the sum of the distances (corresponding to all the loop carried dependences) along c. For example, in Fig. the path ⟨v , v , v , v , v ⟩ constitutes a simple cycle. The length of the cycle is and dist(c) = + = . Note that the delay of a cycle c may not be equal to its length as the delay of a cycle is dependent on
M
M
Modulo Scheduling and Loop Pipelining
<2> v3
v1 <1> v2
v4
Modulo Scheduling and Loop Pipelining. Fig. A simple cycle
the actual scheduling of operations constituting the cycle. In general, for any cycle c in a dependence graph, len(c) ≤ del(c).
Modulo Scheduling Algorithm The objective of modulo scheduling is to optimize performance (execution time) of a loop over the execution of the entire loop, i.e., beyond the iteration boundaries (unlike basic block scheduling techniques which optimize the performance of a single iteration of the loop). Intuitively, modulo scheduling finds an initiation interval that is larger than what would be required by either resource availability or data dependences. Specifically, modulo scheduling takes a loop without conditionals as an input, looks at the loop body, and computes an initiation interval based on resource utilization, termed resource-constrained initiation interval (ResII). Then, it looks at the recurrences in the dependence graph and evaluates the length of each recurrence. Subsequently, it computes an initiation interval based on recurrences, termed recurrence-constrained initiation interval (RecII) (RecII can be computed in polynomial time via, for instance, the Bellman-Ford algorithm [, ]). A minimum initiation interval (MII) is defined as follows: MII = max(ResII, RecII)
()
The minimum initiation interval is the smallest initiation interval for which a correct modulo schedule exists. Finally, it generates a schedule, termed a modulo schedule, by pipelining (or overlapping) successive iterations of the loop at intervals of MII control steps. The computation of MII based on resource utilization and recurrences ensures that no intra- or inter-iteration dependence is violated, and no resource usage conflict arises between operations of either the same or
distinct iterations. An illustration of modulo scheduling is shown in Fig. . Consider the loop and its data dependence graph shown in Figs. a and b respectively. Assuming a resource availability as shown in Fig. d and a single cycle latency for each resource, the corresponding modulo schedule is shown in Fig. c. The reader is advised to verify the schedule with respect to the dependences and resource usage. A detailed description of the modulo scheduling algorithm is presented later in this section. Let R represent the library of resources available for scheduling loop L. Resources are grouped into classes, each of which consists of identical resources. Let R (i ) denote the i-th resource class. Let the number of resources in the i-th class be denoted by num(i). For example, in Fig. d, R (1) represents the set of adders, where num() = . Similarly, R (2 ) represents the set of multipliers, where num() = . For simplicity of exposition, each resource is assumed to have single cycle latency. The techniques discussed in this entry can be easily extended to deal with multi-cycle resources []. Let V denote the set of operations in the given loop body. Given a function r : V → {, , . . .} such that for each operation v ∈ V, r(v) gives the number of the resource class used by v. For example, in Fig. d, r(v ) = and r(v ) = . Modulo scheduling, as originally described in [], makes the following assumptions: () The iteration count is known a priori (this, in practice, is not really necessary for the functioning of the algorithm), and () The loop body does not contain conditionals or unconditional jumps. It constrains each iteration to have an identical schedule. Modulo scheduling of an innermost loop consists of a number of pre-processing steps as described below. . [Generation of conditional free loop body] In general, a loop body contains multiple control flow paths of which some control paths are executed more frequently than others. In such cases, the most frequently executed control path is selected based on either profiling information or heuristics [, ]. This defines the region to be modulo scheduled. Alternatively, a control dependence can be converted to data dependence via predication [, ]. As a result, the resulting loop body looks like a single
Modulo Scheduling and Loop Pipelining
M
Iteration 1
do i = 1, N
Control step
v1
v2
1: q = p + 2 3: B[i] = A[i] + 4
a)
end do
b) DDG
2
v2
v1
3
v3
v2
v1
v3
v2
4 5
c)
3
v1
2: A[i] = q * r v3
2
1
v3
#
+
x
2
1
d)
Modulo schedule
Modulo Scheduling and Loop Pipelining. Fig. An overview of modulo scheduling
basic block which is amenable to modulo scheduling. In presence of multi-level branches, hierarchical predication [–] or predicate matrices [] may be used to convert multi-level control dependences to data dependences. Early exit branches and multiple branches back to the loop header are handled in a similar fashion. . [Minimization of Anti- and Output dependences] Anti- and output dependences can set up recurrence(s) in the data dependence graph which can potentially increase the minimum initiation interval (MII). The dynamic single assignment form [–] may be used to minimize the anti- and output dependences, which enhances performance by reducing the minimum initiation interval. Let an integer-valued function PCOUNT and a setvalued function SUCC be defined such that, for each v ∈ V, PCOUNT(v) is the number of immediate predecessors of v not yet scheduled and SUCC(v) is the set of all immediate successors of v. The algorithm for modulo scheduling is presented next. Algorithm first computes the lower bound on MII. For this, the algortihm computes ResII and RecII. Since the loop body is yet to be scheduled, hence, del(c) cannot be computed. Therefore, as a first approximation RecII is computed based on the length of a cycle. Step (a) of the algorithm schedules a single iteration of a loop such that the data dependences are preserved and usage(i) of a given resource i across control steps separated by an interval which is a multiple of MII is less than or equal to num(i). Step (b) of the algorithm checks whether MII is greater than RecII for the schedule obtained in Step (a). If the check fails then MII
Algorithm (Modulo Scheduling). The input to this algorithm is a dependence graph G(V, E) of the body a loop L. The output is a modulo schedule.
. [Compute resource usage] Examine the loop body to determine the usage, usage(i), of each resource class R (i) by the loop body. for each resource class R (i) do usage(R (i)) ← endfor for each v ∈ V do usage(r(v)) ← usage(r(v)) + endfor
M
. [Determine recurrences] Enumerate all the recurrences in the dependence graph using, for example, any of the algorithms in [, ]. Let C be the set of all recurrences in the dependence graph. Compute len(c) ∀ c ∈ C. . [Compute the lower bound of minimum initiation interval] a) [Compute the resource-constrained initiation interval] ResII =
∀
max (⌈
R (i)∈R
usage(R (i)) ⌉) num(R (i))
()
b) [Compute the recurrence-constrained initiation interval] RecII = max (⌈ ∀ c∈C
len(c) ⌉) dist(c)
()
c) [Compute the minimum initiation interval] MII = max(ResII, RecII)
()
M
Modulo Scheduling and Loop Pipelining
. [Modulo schedule the loop] a) [Schedule operations in G(V, E) taking only intra-iteration dependences into account] Let U(i, j) denote the usage of the i-th resource class in control step j for each v ∈ V do compute PCOUNT(v) and SUCC(v) /* Assign a control step number to v */ ℓ(v) ← endfor while V =/ / do for each node v ∈ V do /* Check whether data dependences are satisfied */ if PCOUNT(v) = then /* Check that no more than num(r(v)) operations are scheduled on the resources corresponding to * R (r (v )) at the same time modulo MII */ ⌊ℓ(v)/MII⌋ Let U ← ∑i= U(r(v), (ℓ(v) − i ∗ MII)) while U > num(r(v)) do ℓ(v) ← ℓ(v) + Update U endwhile for each w ∈ SUCC(v) do PCOUNT(w) ← PCOUNT(w) – ℓ(w) ← max(ℓ(w), ℓ(v) + ) endfor /* Schedule the operation */ V ← V − {v} endif endfor endwhile b) [Check the schedule obtained in Step (a) w.r.t. recurrences ] del(c) II ′ ← max ∀ c∈C dist(c) if MII < II ′ then MII ← MII + Goto step (a) endif c) [Pipeline the iterations ] Schedule the successive iterations MII control steps apart with each iteration having the same schedule (obtained in Step (a)).
DDG v1
do i = 1, N 1: x[i] = a + b
v2
2: y[i] = x[i] * c 3: z[i] = x[i] * y[i]
a)
#
end do
+
x
1
1
1
1
d)
v3
b)
c) Iteration
Iteration
Control Step
1
2
v1
v1
2
v2
3
v3
2
MII v1
4
v2
5
v3 Modulo schedule
3
MII
v2
v1
v3
v2
v1
v3
v2 v3
e)
Infeasible schedule
Modulo Scheduling and Loop Pipelining. Fig. Modulo scheduling in the absence of recurrences
is incremented by and the algorithm iterates through Steps (a) and (b) until the check is satisfied. Subsequently, successive iterations of the loop are scheduled at an interval of MII. The following example illustrates modulo scheduling in the absence of recurrences. Example Consider the loop shown in Fig. a and its DDG as shown in Fig. c. From the dependence graph one observes that there are no loop carried dependences in the given loop. Therefore, in this case MII = ResII. The number of available resources (adder etc.) per control step is given in Fig. b. From Eq. , ResII for the given resource constraints is . The modulo scheduled loop is shown in Fig. d, with MII as the initiation interval. In Fig. d, control steps and correspond to the prologue of the transformed loop and control steps and constitute the kernel of the transformed loop. From the schedule one observes that cycle is saved per iteration (except the first iteration) in the modulo schedule as compared to the sequential execution of the entire loop. Further overlap of iterations is not feasible as it would violate the resource constraints, e.g., in Fig. e, operation v of the first iteration and
Modulo Scheduling and Loop Pipelining
v1
Iteration 1 2 v3
v2
M
1
v1
2
v2 v3
3
v4
4
v5
5
v6
6
v7
do i = 1, N 1: p = A[i] + 2
v1
v4
2: w = p + A[i−1] <1>
v5
v4
4: r = q + 4 5: s = w * r
Control Step
v2 v3
3: q = B[i] * p
v6
v5
6: t = s * B[i] 7: A[i] = t * r
a)
end do
DDG
v6
v7
b)
c)
v7
d)
7
v1 v2 v3
Modulo Scheduling and Loop Pipelining. Fig. Modulo scheduling in the presence of recurrences
operation v of the second iteration cannot be scheduled in the same control step as there is only one multiplier available.
In the presence of multiple recurrences, RecII is given by Eq. . Next, an example of modulo scheduling in the presence of recurrences is presented.
A subtlety of Step (a) of Algorithm bears further discussion. The termination of the inner while loop stems from the fact that MII is greater than or equal to ResII. The constraint (MII ≥ ResII) guarantees that there exists an ℓ(v) for which U is less than num(r(v)). In Step (a) of Algorithm , at any control step, let ready set be a set of operations such that for any operation v in the ready set PCOUNT(v) = . Modulo scheduling involves selection of an operation among the operations in the ready set. Thus, alike list scheduling, Algorithm represents a family of algorithms since the method of selection of an operation from the ready set is not specified. In the presence of multi-function resources, (A multi-function resource can execute more than one type of operations.) determining ResII becomes non-trivial. In such cases, exact ResII can be computed by performing an optimal bin-packing of the reservation tables for all the operations. However, the drawback of such an approach is its exponential complexity. A cycle c in the dependence graph imposes the constraint that the time interval between an operation in the current iteration and the same operation dist(c) iterations later must be at least del(c). Consequently, one has the following recurrence constraint:
Example Consider the loop shown in Fig. a and its DDG as shown in Fig. b. From Fig. b one observes that the dependence graph contains one cycle, c = ⟨v , v , v , v , v ⟩. The length of the cycle (len(c)) is and its distance (dist(c)) is . First, one computes the lower bound on the minimum initiation interval. Assuming the number of resources given in Fig. b, ResII for the DDG given in Fig. b is . Since the loop body is yet to be scheduled, hence, del(c) cannot be computed. Therefore, as a first approximation Algorithm computes RecII based on the length of a cycle. In this case, RecII is . From Eq. , one gets MII = . The corresponding schedule of the loop body is shown in Fig. c. From Fig. c one notes that delay of the cycle c is (= del(c)). Since del(c) > len(c), an MII of four will lead to a recurrence violation. Therefore, the value of MII is incremented by one and the loop is rescheduled. The final modulo schedule is shown in Fig. d. As an exercise, the reader can verify that MII = does not lead to any recurrence violation.
del(c) − RecII⋅dist(c) <
()
Remarks: Infeasibility of MII Modulo scheduling a loop requires computing the recurrence constrained initiation interval (RecII) which
M
M
Modulo Scheduling and Loop Pipelining
Iteration Iteration 1 1 v1
do i = 1, N
v1
v2
v3
v4
4
v4
2
v4
v3
v1 v3
v2 v4
4: C[i] = A[i] + r
v1
v2 v3
v4
6
3
v2 v3
v2
5
3: r = q * 2
a) end do
v1
MII v1
3
2: A[i] = B[i] + 2
1
v2 v3
2
1: q = p + 6
2
v4
b)
Modulo schedule (Non−optimal)
c)
DDG
e)
#
+
x
2
1
d)
Optimal schedule
Modulo Scheduling and Loop Pipelining. Fig. Limitations of modulo scheduling
in itself requires computing the delay of each cycle in the dependence graph of the loop. However, the delay of a cycle is dependent on the schedule for which RecII is known. Several approaches have been proposed to resolve the circular dependence. Rau et al. [] used linear search to find MII where MII is increased by one iteratively. The FPS compiler [] used binary search to determine MII where the lower bound is computed using Step of Algorithm , and the upper bound is the length of the locally compacted loop body. One could employ an enumerative branch-and-bound search of all possible schedules or use a trial-and-error approach guided by some heuristics. However, the former is computationally very expensive, while the latter does not always guarantee compact schedules. Huff [] modeled the problem as a minimal cost-to-time ratio problem []. Feautrier [], Govindarajan et al. [], Eichenberger et al. [], Altman et al. [] model modulo scheduling as an integer linear programming [] problem. However, the exponential complexity of integer linear programming algorithms render such techniques impractical for large loops. Similarly, techniques such as simulated annealing, Boltzmann machine algorithm, and genetic algorithms [] have been proposed but can become very time expensive for large loops.
Limitations Modulo scheduling constrains each iteration to have an identical schedule and schedules successive iterations
at a fixed minimum initiation interval (MII). However, in general, optimal schedule for many loops cannot be achieved by duplicating the schedule of the first iteration at fixed intervals. The constraint that each iteration has the same schedule is ultimately an arbitrary choice, but a fundamental component of modulo scheduling, necessary to ensure convergence in the sense of finding a fixed MII. For example, consider the loop and its data dependence graph shown in Fig. a and b respectively. Let vki denote the i-th operation of the k-th iteration. The modulo schedule of the data dependence graph in Fig. b is shown in Fig. c. Given a resource library shown in Fig. e, ResII for the dependence graph is . Since there are no loop carried dependences, therefore, MII = ResII. Note that operation v cannot be scheduled in control step as it would lead to a resource conflict with operations v , v scheduled in control step . Thus, a fixed initiation interval in conjunction with the modulo constraint can potentially create “gaps” in the schedule. In order to extract higher levels of parallelism, one must forgo the constraints imposed by modulo scheduling, i.e., one has to ●
Allow successive iterations to be scheduled at different initiation intervals. ● Allow iterations to have different schedules. An optimal schedule is shown in Fig. d. Observe that the first two iterations have different schedules.
Modulo Scheduling and Loop Pipelining
Subsequently, iteration is scheduled like iteration , iteration is scheduled like iteration , and so on. Further, the second and the third iterations are initiated with different initiation intervals. The schedule of the second iteration in Fig. d helps close the gap in the first iteration in Fig. c, as v can now be scheduled in control step without any resource conflicts. Thus, relaxing the constraints (of modulo scheduling) facilitates compaction which yields better performance. As scheduling of iterations progresses, the execution of operations tend to form a repeating pattern (also known as a kernel), as shown in Fig. d by the ‘boxed’ set of operations. While such kernels cannot be detected by modulo scheduling, the technique discussed in section “OPT: Optimal Loop Pipelining of Innermost Loops” can detect such kernels.
Modulo Scheduling with Conditionals In the previous section modulo scheduling of loops without conditionals was discussed. We now describe how to modulo schedule loops with conditionals.
Hierarchical Reduction Hierarchical reduction [] was proposed to make loops with conditionals amenable for modulo scheduling. The hierarchical reduction technique, derived from G. Wood [], reduces a conditional to a pseudooperation whose scheduling constraints (both interiteration and loop carried dependences) represent the union of the scheduling constraints of the two branches. The significance of hierarchical reduction is: ●
Conditionals do not limit the overlapping of different iterations. ● It enables migration of operations outside a conditional around the conditional. Branches of different conditionals can be overlapped via hierarchical percolation scheduling, a technique known as trailblazing []. The algorithm for hierarchical reduction can be summarized as follows: First, the THEN and ELSE branches of a conditional statement are scheduled independently. Next, the entire conditional is reduced to a pseudo-operation whose scheduling constraints represent the union of the scheduling constraints of the two branches. The latency of the pseudo-operation is the maximum of the length of the two branches; the
M
resource usage of the pseudo-operation is the maximum of the resource usage of the two branches. Dependences between operations inside the branches and those outside are replaced by dependences between the pseudo-operation representing the conditional and those outside. In the presence of nested control flow, each conditional is scheduled hierarchically, starting with the innermost conditional. The dependence graph thus obtained is free of conditionals, hence, amenable for modulo scheduling. The dependence graph is then modulo scheduled using Algorithm . During code generation two sets of code, corresponding to the two branches, are generated. Any code scheduled in parallel with the conditional is duplicated in both branches. Example Consider the loop shown in Fig. a. The corresponding control flow graph is shown in Fig. b, where BBi denotes the i-th basic block. From Fig. a, one notes that the operations v , v , and v are dependent on operation v . Further, even though v is in basic block BB , it can be scheduled in parallel with the conditional. As discussed above, the conditional, i.e. basic blocks BB and BB are replaced with the pseudooperation v′ . The dependence graph of the transformed loop is shown in Fig. c. The set of resources used by v′ consists of an adder and a multiplier. Assuming the number of resources given in Fig. e, the dependence graph is then modulo scheduled using Algorithm and is shown in Fig. d.
Enhanced Modulo Scheduling Predication (also known as if-conversion) [] is another technique to convert loops with conditionals into straight line code. Predication removes conditionals by computing a condition for the execution of each operation. Predicated execution assumes architectural support for predicate registers which hold the condition for execution of the operations. A predicate register is specified for each operation and predicate define operations are used to set the predicate registers based on the appropriate condition. Predication-based modulo scheduling techniques [, ] schedule operations along all execution paths together. Thus, the MII for predicated execution is constrained by the resource usage of all the loop operations rather than those along
M
M
Modulo Scheduling and Loop Pipelining
BB0 v’ do i = 1, N
Iteration BB1
1: A[i] = B[i] + 6
v1
BB2
1
if i < N/2 then 2: p = A[i] + q
BB3
1
v1
2
v’ v4
3
v5
MII
v4
v’
2
else 3: p = A[i] * q v5
4: r = A[i] + 2 5: C[i] = p + r
a)
end do
BB0 = {v1}
b)
BB1 = {v2} BB3 = {v4, v5}
BB2 = {v3}
e)
#
+
x
2
1
c)
DDG
v1
4
v’ v4
5
v5
d)
Modulo schedule
Modulo Scheduling and Loop Pipelining. Fig. Modulo scheduling with conditionals
the execution path with maximum resource usage of all paths. Enhanced Modulo Scheduling (EMS) [] integrates hierarchical reduction and predicated execution to alleviate their limitations. Predication eliminates the need for pre-scheduling the conditional constructs (required in hierarchical reduction); regeneration of conditionals eliminates the need for additional hardware (required by predicated execution), thus EMS can be used on processors without support for predicated execution. An overview of the algorithm is now presented. First, the loop body (with conditionals) is converted into a straight line predicated code using the RK algorithm []. The predicated loop body is then modulo scheduled using Algorithm in conjunction with modulo variable expansion [] to rename registers with overlapping lifetimes. Finally, the predicate define operations are replaced with conditionals.
Modulo Scheduling with Multiple Initiation Intervals Hierarchical reduction transforms a loop with conditionals such that the basic algorithm can be used for modulo scheduling the transformed loop. However, in
general, two paths of a conditional may imply two different dependences (both intra-iteration and loop carried dependences) and thus, may have different initiation intervals. The initiation interval of the transformed loop is thus constrained by the worst path i.e., MII = max IIpath(i) ∀i
()
where IIpath(i) is the initiation interval of path i. From Eq. one observes that a path with a shorter MII is penalized due to other paths. Similarly, in the case of predicated execution MII is determined by the sum of operations from all paths. Thus, all paths are penalized due to the fixed MII. As an illustration, consider the example in Fig. a. Assuming an availability of one adder and one multiplier, MII of basic block BB is whereas the MII of basic block BB is . A fixed MII-based modulo scheduling would have assigned an MII of to both the paths, which slows down the execution of path . In [], modulo scheduling with multiple initiation intervals was proposed for architectures with support for predicated execution. In Fig. b, the predicate for each basic block is shown next to each basic block. Predicate p is the true predicate. After if-conversion, path
Modulo Scheduling and Loop Pipelining
J=2 do I = 1, N if J > K then a[I] = I + 2 J=J*2 else b[I] = I + 2 J=J+2 end do
BB0 path 1 BB1
path 2 p1
BB3
a)
p0
BB2
p2
p0
b)
Modulo Scheduling and Loop Pipelining. Fig. A loop with control flow
II1 II2 p0
p1 p2
Modulo Scheduling and Loop Pipelining. Fig. Multiple initiation intervals
corresponds to the operations predicated on p and p , and path to the operations predicated on p and p . The initiation intervals for the two paths are shown in Fig. . From Fig. one notes that during execution of path , only operations from path are fetched and executed. However, during the execution of path , operations from both the paths are fetched, and operations predicated on p are squashed. Path corresponds to a longer initiation interval II due to larger number of operations. Notice that path is never penalized. The execution paths of a loop body often have different execution frequencies. In order to minimize penalty during the scheduling process, the algorithm assigns higher priorities to the more frequently executed paths, i.e., smaller initiation intervals are assigned to the more frequently executed paths. A detailed description of the algorithm can be found in [].
Discussion Several techniques [, , ] have been proposed in the past to find the global minimum of the search
M
space of feasible schedules. However, the high complexity of such approaches makes them computationally very expensive for practical purposes. Iterative modulo scheduling [] algorithm provides a heuristic to control both the search for a valid schedule and the choice of a next state in the search space when the search arrives at a dead end, i.e., when the current schedule is found infeasible. The algorithm is iterative in the sense that it schedules and re-schedules operations to find a “fixedpoint” solution which simultaneously satisfies all the scheduling constraints. When the scheduler finds that there is no slot available for scheduling the current operation, it displaces one or more previously scheduled, conflicting operations. These are in turn re-scheduled as the search continues. If the search fails to yield a valid schedule even after a large number of steps, then the initiation interval is increased by a small amount and the entire process is repeated. From Eq. , a larger MII delays the execution of a recurrence operation. In the worst case, an MII equal to the schedule length transforms into a list schedule (non-modulo) which is always feasible. A large number of techniques have been proposed over the last three decades for loop pipelining. In [], Gasperoni and Schwiegelshohn proposed an algorithm for scheduling loops on parallel processors. In [], Wang and Eisenbeis proposed a technique called Decomposed Software Pipelining wherein software pipelining is considered as an instruction level transformation from a vector of one dimension to a matrix of two dimensions. In particular, the software pipelining problem is decomposed into two subproblems – one is to determine the row numbers of operations in the matrix and another is to determine the column numbers. In [], Ruttenberg et al. compared optimal and heuristic methods for modulo scheduling. Allan et al. surveyed various techniques for software pipelining in [].
Kernel Recognition All techniques described in this entry can be perceived as kernel recognition techniques. The basic idea behind kernel recognition is to unroll the loop and compact operations from different iterations “as much as possible” (In the presence of vectorizable operations in the loop body the compaction of operations must be constrained in some way to force the emergence of a
M
M
Modulo Scheduling and Loop Pipelining
repeating pattern.) while looking for a repeating block of instructions. Once a repeating pattern is found in the unrolled, compacted, overlapping bodies of the iterations, the repeating pattern (also referred to as kernel) constitutes the loop body of the new loop with a prologue and an epilogue. URPR (Unrolling, Pipelining and Rerolling) algorithm [] (an extension of the microcode loop compaction algorithm URCR []) was one of the early algorithms proposed for loop pipelining based on kernel recognition. URPR assumes that the iteration count is known a priori, the loop body has no conditionals, no subroutine calls, no I/O statements. Some of the other early works include [, , ]. The techniques differ in type of their underlying assumptions. Broadly speaking, resource constraints (functional units, registers) and conditionals have been the basic distinguishing features. In [], a loop pipelining technique called OPT was proposed to achieve the effect of unbounded unrolling and compaction of a loop. The schedule obtained using OPT is time optimal, i.e., no transformation of the loop based on the given data dependences can yield a shorter running time for innermost loops without conditionals. Additional hardware assists, such as predicate matrices, specific to loop pipelining were proposed in []. In [], perfect pipelining was proposed for loop pipelining of loop with conditionals. Akin to OPT, the kernel recognized by perfect pipelining corresponds to unbounded unrolling and compaction; furthermore, perfect pipelining finds a kernel for all paths despite arbitrary flow of control within the loop body. Next the OPT algorithm geared toward optimal loop pipelining of innermost loops is discussed.
OPT: Optimal Loop Pipelining of Innermost Loops Techniques such as URPR exploit parallelism across the iterations of a loop by unrolling the loop before compaction. If the loop is unrolled (say) k times then parallelism can be exploited in this unrolled loop body by compacting the k iterations, but the new loop still imposes sequentiality between every group of k iterations. Although additional unrolling and compaction can be done to expose higher levels of parallelism, this becomes expensive very rapidly. OPT [] overcomes this problem by achieving the effect of unbounded unrolling and compaction of a loop. The
schedule obtained using OPT is time optimal, i.e., no transformation of the loop based on the given data dependences can yield a shorter running time for the loop. The basic technique examines a partially unrolled loop, say the first i iterations, and schedules the operations of those iterations as soon as possible. However, the greedy approach may introduce “gaps” – control steps with no operations – between the operations of an iteration due to loop carried dependences; the presence of gaps may in turn forbid the formation of a kernel. In such cases, the gaps are shrinked, i.e., the operations are rescheduled such that there is no effect on the critical path length, in order to force the occurrence of a kernel. A kernel emerges naturally after scheduling a large enough portion of an unrolled loop; the kernel arises as a result of combination of (fixed) dependences and finite number of operations in the loop body. A kernel can be inferred from a greedy schedule and where gaps, if any, are bound by a constant, in at most O(n ) iterations []. Next, some terms are defined and the algorithm for optimal parallelization of innermost loops is presented. Let vki denote the i-th operation of the k-th iteration. Definition Let num(c) denote the number of loop carried dependences in a cycle c. The slope of cycle c is defined as the ratio len(c)/num(c) []. The slope of a cycle establishes a bound on the rate that operations in the cycle can be executed. Let slope(v) be the maximum slope of any cycle on which v depends. If v is not dependent on any cycle, then slope (v) = /. When an iteration is scheduled, it is spread across some interval of control steps t , . . . , tk . Operations from an iteration tend to cluster into groups of mutually dependent operations (referred to as a region) with gaps between the regions. For example, in Fig. b there are two maximal regions. Definition Let A , . . . , Aj be the maximal regions of an iteration, where Ai = ti , . . . ti′ and ti′ < ti+ for all i. Then gap(Ai , Ai+ ) = ti+ − ti′ . In Fig. b, iterations and have the same maximal regions; the only difference between the iterations is the size of the gap. Two iterations i and i + c are said to be alike if they have the same maximal regions and the gaps in iteration i + c are as large or larger than in
Modulo Scheduling and Loop Pipelining
M
v1
v6
v9
<1> v10
v11
v12
v13
<1>
v2
v14
v5
<1>
v3
v4
<1>
<1> <1>
a)
v15
v16
v17
v8
v7
DDG Iteration 1
2
3
4
5
v1 v2 v3
v1
v1
v1
v1
v4 v5 v6 v9
v9
v9
v9
v9
v7 v8 v10 v11 v12
v3 v11 v12
v11 v12
v11 v12
v11 v12
v13
v2 v4 v13
v13
v13
v13
v14
v5 v6 v7 v14
v6 v14
v6 v14
v6 v14
v15 v16 v17
v15 v16 v17
v3 v15 v16 v17
v15 v16 v17
v15 v16 v17
v8 v10
v4 v10
v10
v10
v2 v7 v5 v8
v3 v2 v4 v5 v7 v8
v3 v2 v4
b)
v5 v7 v8
Modulo Scheduling and Loop Pipelining. Fig. A greedy schedule
M
M
Modulo Scheduling and Loop Pipelining
iteration i. The gaps in iteration i + c can be shrunk to the size of the gaps in iteration i if there is no dependence path from an operation in Ak of iteration i to an operation in Aj (k < j) of iteration i + c, where iterations i and i + c are alike with j maximal regions. For example, in Fig. b iterations and are alike and there is no dependence path from the first region of iteration to second region of iteration . Thus, it is safe to shrink the gap in iteration to the size of the gap in iteration . Though in Fig. b the gaps are completely closed, it is not always safe to completely close the gaps. Gaps in an iteration may be completely closed subject to the following condition []: Condition Let iteration i and i + c be alike. Assume that all iterations i through i + c have the same number of maximal regions. Let Akj be the j-th maximal region on the k-th iteration, where i ≤ k ≤ i + c. If
Algorithm The input to this algorithm is a singly nested loop L with a dependence graph G(V, E). The output is an optimal schedule of L . Let Lk represent the k-th iteration of L .
kernel found ← false i← while ¬kernel found do Unroll the loop body once i←i+ Schedule operations in the unrolled loop body G′ (V ′ , E′ ) as soon as possible for each Lk ∈ L , where ≤ k ≤ i − do ShrinkGaps (Lk , Li ) endfor /* Look for kernel */
k+ ) gap(Akj , Akj+ ) ≤ gap(Ajk+ , Aj+
then the gaps in iteration i + c and i may be completely closed. Next, the OPT algorithm is presented. The function ShrinkGaps (Lk , Li ) shrinks the gaps in iteration Li as discussed earlier. In theory, there are loops for which the strategy of making iterations look alike cannot succeed in polynomial time. Let the denominator of slope(v) be the period of v. The length of a kernel based on this approach is at least the least common multiple of the operation periods. If many operations in a loop body have large and relatively prime periods, then the complexity of the algorithm is potentially exponential in n. The efficiency of the algorithm is dependent on the cost of computing a greedy schedule. This can be easily done using a modified topological sort of the dependence graph. The cost is proportional to the number of operations scheduled. Example Consider the data dependence graph shown in Fig. a originally proposed as an example in []. The corresponding greedy schedule (i.e., operations are scheduled as early as possible subject to data dependences) is shown in Fig. b. Even though the greedy schedule corresponds to maximum parallelization, however, from Fig. b one observes that a kernel does not emerge as successive iterations are scheduled.
Let si be the set of operations at control step i in the schedule of G′ (V ′ , E′ ) and L denote the length of the schedule. i← S ← / while S =/ V ′ ∨ i =/ L do if si contains multiple copies of an operation v then S ← / else if si ∩ S =/ / then S ← si else S ← S ∪ si endif i←i+ endwhile if S = V then kernel found ← true endif endwhile
Modulo Scheduling and Loop Pipelining
M
Iteration 1
2
3
4
v1 v2 v3
v1
v1
v4 v5 v6 v9
v9
v9
v7 v8 v10 v11 v12
v3 v11 v12
v11 v12
v1
v13
v2 v4 v13
v13
v9
v14
v5 v6 v7 v14
v6 v14
v11 v12
v15 v16 v17
v15 v16 v17
v3 v15 v16 v17
v13
v1
v8 v10
v4 v10
v6 v14
v9
v2 v7
v15 v16 v17
v11 v12
v5
v10
v13
v1
v8
v3
v6 v14
v9
v2 v4
v15 v16 v17
v11 v12
v5 v7
v10
v13
v1
v8
v3
v6 v14
v9
v2 v4
v15 v16 v17
v11 v12
v5 v7
v10
v13
v8
v3
v6 v14
v2 v4
v15 v16 v17
v5 v7
v10
v8
v3
The kernel
5
6
7
v2 v4 v5 v7 v8
Modulo Scheduling and Loop Pipelining. Fig. Optimal schedule corresponding to the dependence graph in Fig. a
In Fig. a, the cycle containing the operations v , v , v has the greatest slope (three). Operations v , v , and v are dependent on this cycle and thus have the same slope. All other operations have a slope of zero (/). From Fig. b one notes that the schedule is eventually split into two groups that repeat every iteration, one with a slope of three, the other with a slope of
zero. In an attempt to generate a kernel, the algorithm reschedules the operations not on the critical path so that they have the same slope as the operations on the critical path. In Fig. b, operations with a slope of zero (/) in iterations and could be delayed without affecting the length of the schedule. Eliminating the “gaps” in the iterations produces the (optimal) schedule
M
M
Modulo Scheduling and Loop Pipelining
shown in Fig. . The boxed area is the kernel for the loop; scheduling additional iterations (without gaps) reproduces these three control steps. Note that no operation on the critical path has been delayed as a result of re-scheduling.
Related Entries Trace Scheduling
Bibliography . Kogge P () The microprogramming of pipelined processors. In: Proceedings of the fourth international symposium on computer architecture, ACM, New York, pp – . Kogge PM () The architecture of pipelined computers. Hemisphere Publishing Corporation, Washington . Tokoro M, Takizuka T, Tamura E, Yamaura I () A technique of global optimization of microprograms. In: Proceedings of the th annual workshop on microprogramming, Pacific Grove, pp – . Charlesworth A () An approach to scientific array processing: the architectural design of the AP-B/FPS- family. IEEE Comput ():– . Rau BR, Glaeser CD () Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In: Proceedings of the th annual workshop on microprogramming, Chatham, pp – . Hsu PS () Highly Concurrent Scalar Processing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign . Lam M () Software pipelining: an effective scheduing technique for VLIW machines. In: Proceedings of the SIGPLAN ’ conference on programming language design and implementation, ACM, Atlanta . Kuck D, Kuhn R, Padua D, Leasure B, Wolfe MJ () Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg . Ferrante J, Ottenstein K, Warren J () The program dependence graph and its use in optimization. ACM T Progr Lang Sys ():– . Zima H, Chapman B () Supercompilers for parallel and vector computers. Addison-Wesley, New York . Bellman R () On a routing problem. Q Appl Math ():– . Cormen TH, Leiserson CE, Rivest RL () Introduction to algorithms. The MIT Press, Cambridge . Rau BR () Iterative modulo scheduling. Technical Report HPL--, Hewlett Packard Laboratories, Palo Alto . Mahlke SA, Lin DC, Chen WY, Hank RE, Bringmann RA () Effective compiler support for predicated execution using the hyperblock. In: Proceedings of the th international symposium of microarchitecture, Portland, pp –
. Mahlke SA, Hank RE, Bringmann RA, Gyllenhaal JC, Gallagher DM, Hwu WW () Characterizing the impact of predicated execution on branch prediction. In: Proceedings of the th international symposium of microarchitecture, San Jose, pp – . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion of control dependence to data dependence. In: Conference record of the tenth annual ACM symposium on the principles of programming languages, Austin . Park J, Schlansker M () On predicated execution. Technical Report –, Hewlett Packard Laboratories, Palo Alto . Tirumalai P, Lee M, Schlansker M () Parallelization of loops with exits on pipelined architectures. In: Proceedings of the conference on supercomputing, IEEE Computer Society / ACM, New York, pp – . Mahlke SA, Chen WY, Hwu WW, Rau BR, Schlansker MS () Sentinel scheduling for VLIW and superscalar processors. ACM SIGPLAN Notices, ():– . Mahlke SA, Chen WY, Bringmann RA, Hank RE, Hwu WW, Rau BR, Schlansker MS () Sentinel scheduling: a model for compiler-controlled speculative execution. ACM T Comput Syst ():– . Milicev D, Jovanovic Z () Predicated software pipelining technique for loops with conditions. In: Proceedings of the th international parallel processing symposium, Orlando . Feautrier P () Array expansion. In: Proceedings of the nd international conference on supercomputing, ACM, St. Malo . Feautrier P () Dataflow analysis of scalar and array references. Int J Parallel Prog ():– . Rau BR () Data flow and dependence analysis for instruction level parallelism. In: Proceedings of the fourth workshop on languages and compilers for parallel computing, Springer-Verlag, London, pp – . Tiernan JC () An efficient search algorithm to find the elementary circuits of a graph. Commun ACM (): – . Mateti P, Deo N () On algorithms for enumerating all circuits of a graph. SIAM J Comput ():– . Touzeau RF () A fortran compiler for the fps- scientific computer. In: Proceedings of the SIGPLAN symposium on compiler construction, ACM, New York, pp – . Huff RA () Lifetime-sensitive modulo scheduling. In: Proceedings of the SIGPLAN ’ conference on programming language design and implementation, Albuquerque, pp – . Lawler EL () Combinatorial optimization: networks and matroids. Holt, Rinehart and Winston, New York . Feautrier P () Fine-grain scheduling under resource constraints. In: Proceedings of the seventh annual workshop on languages, compilers and compilers for parallel computers, Ithaca . Govindarajan R, Altman ER, Gao GR () Minimizing register requirements under resource-constrained rate-optimal software pipelining. In: Proceedings of the th international symposium of microarchitecture, ACM/IEEE, San Jose, pp – . Eichenberger AE, Davidson ES, Abraham SG () Optimum modulo schedules for minimum register requirements. In:
Monitors, Axiomatic Verification of
.
.
. .
. .
.
.
.
.
.
.
. .
. .
Proceedings of the th international conference on supercomputing, ICS ’, ACM, Barcelona, pp – Altman ER, Govindrajan R, Rao GR () Scheduling and mapping: software pipelining in the presence of hazards. In: Proceedings of the SIGPLAN ’ conference on programming language design and implementation, La Jolla de Dinechin BD () Simplex scheduling: more than lifetimesensitive instruction scheduling. In: Proceedings of the international conference on parallel architectures and compilation techniques (PACT), Montreal Reeves CR () Modern heuristic techniques for combinatorial problems. Wiley, New York Wood G () Global optimization of microprograms through modular control constructs. In: Proceedings of the th workshop on microprogramming, IEEE Press, Piscataway, pp – Novack S, Nicolau A () Trailblazing: a hierarchical approach to percolation scheduling. Int J Parallel Prog () Rau BR, Yen WL, Yen W, Towle A () The Cydra departmental supercomputer: design philosophies, decisions and trade-offs. IEEE Comput ():– Dehnert JC, Hsu PYT, Bratt JP () Overlapped loop support in the Cydra . In: Proceedings of the third international conference on architectural support for programming languages and operating systems (ASPLOS-III), ACM, Boston, pp – Warter NJ, Bockhaus JW, Haab GE, Subramaniam K () Enhanced modulo scheduling for loops with conditional branches. In: Proceedings of the th international symposium of microarchitecture, ACM/IEEE, Portland Warter NJ, Partamian N () Modulo scheduling with multiple mutiple initiation intervals. In: Proceedings of the th international symposium of microarchitecture, ACM/IEEE, Ann Arbor Jacobs D, Prins J, Siegel P, Wilson K () Monte carlo techniques in code optimization. In: Proceedings of the th workshop on microprogramming, IEEE Press, Piscataway, pp – De Gloria A, Faraboschi P, Olivieri M () A non-deterministic scheduler for a software pipelining compiler. In: Proceedings of the th international symposium of microarchitecture, ACM/IEEE, Portland, pp – Gasperoni F, Schwiegelshohn U () Scheduling loops on parallel processors: a simple algorithm with close to optimum performance. In: Proceedings of the second joint international conference on vector and parallel processing: parallel processing, pp – Wang J, Eisenbeis C () Decomposed software pipelining. Technical Report RR-, INRIA-Rocquencourt, France Ruttenberg J, Gao GR, Stoutchinin A, Lichtenstein W () Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In: Proceedings of the SIGPLAN ’ conference on programming language design and implementation, Philadelphia, pp – Allan VH, Jones RB, Lee RM, Allan SJ () Software pipelining. ACM Comput Surv ():– Su B, Ding S, Xia J () URPR - an extension of urcr for software pipelining. In: Proceedings of the th workshop on microprogramming, ACM, New York
M
. Su B, Ding S, Xia J () An improvement of trace scheduling for global microcode compaction. In: Proceedings of the th workshop on microprogramming, New Orleans . Ebcio˘glu K () A compilation technique for software pipelining of loops with conditional jumps. In: Proceedings of the th workshop on microprogramming, Colarado Springs . Aiken A, Nicolau A () Optimal loop parallelization. In: Proceedings of the SIGPLAN ’ conference on programming language design and implementation, Atlanta . Aiken A, Nicolau A () Perfect pipelining: a new loop parallelization technique. Technical Report -, Dept. of Computer Science, Cornell University . Callahan D, Cocke J, Kennedy K () Estimating interlock and improving balance for pipelined machines. J Parallel Distr Com ():– . Aiken AS () Compaction-based parallelization. PhD thesis, Dept. of Computer Science, Cornell University . Cytron R () Compile-time Scheduling and Optimization for Asynchronous Machines. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign
Molecular Evolution Phylogenetics
Monitors Monitors, Axiomatic Verification of Synchronization Transactional Memory
Monitors, Axiomatic Verification of Christian Lengauer University of Passau, Passau, Germany
Definition A monitor is a shared abstract data structure that can be used by several processes in parallel. To keep its data in a consistent state, calls to its operations (or methods) are executed under mutual exclusion, and there is the possibility of suspending and resuming the caller, depending on the state of the monitor data.
M
M
Monitors, Axiomatic Verification of
Discussion Overview For a brief introduction to axiomatic verification with Hoare logic, see the related entry, Owicki-Gries Method of Axiomatic Verification. Here, the first issue is the verification of sequentially accessed abstract data structures (frequently called classes), and the second the measures that have to be taken to synchronize concurrent calls to monitor operations and make them mutually exclusive.
Hoare Rules for Classes The Hoare rules for classes are simple. The main requirement is that the data satisfy an invariant I. I must hold after initialization of the data (e.g., in Java, after each constructor call), possibly, under some assumptions; and it must hold at the end of each class method op provided it holds at its start: Constructor methods: {pre} op {I} Other methods: {pre(op) ∧ I} op {post(op) ∧ I} pre(op) and post(op) are the pre- and postcondition for method op.
Hoare Rules for Monitors To protect concurrent calls properly, two levels of synchronization are added.
Short-Term Scheduling Short-term scheduling ensures the mutual exclusion of the execution of concurrent method calls. It is a problem-independent requirement and can be installed by the compiler, that is, the programmer need not worry about it. This is so, for example, in Concurrent Pascal and Modula- but not in Java, where the programmer has to add the modifier synchronized. No additional Hoare axioms are necessary. Under the assumption of mutual exclusion, it can be assumed that the monitor data are not being corrupted in parallel while a monitor method is being executed. Mid-Term Scheduling Mid-term scheduling ensures that the monitor data are in the state expected and required by the method being executed. For example, when removing an item from a shared buffer, the buffer must not be empty. If it is empty, the method must be suspended and resumed after an item has been inserted.
Mid-term scheduling is a problem-dependent issue, that is, the programmer must be able to specify it. The usual device is a built-in abstract data structure that holds a FIFO queue (named cond) of identifiers of processes that have entered the monitor and have been suspended. The queue is initially empty and comes with two methods: wait suspends the process which calls it, and appends it to the queue; the next process waiting for monitor entry, if any, is granted access signal removes a process from the queue and resumes it after suspending the calling process There are several signaling strategies which differ in how the process that calls signal is treated and when the resumed process is given access to the monitor; these are sketched later in this entry. Hoare axioms are needed for calls to wait and signal. Let B(cond) be the condition to be enforced. The rules proposed originally by Hoare [] are: Suspension: {I} cond.wait {I ∧ B(cond)} Resumption: {I ∧ B(cond)} cond.signal {I} That is, one can presume the validity of condition B(cond) at the point of resumption if one proves it to hold at the point of suspension. The invariant must be part of both pre- and postconditions because the monitor changes hands. The waiting condition can be tuned a little bit to avoid unnecessary waiting []. Let E be such that I ∧ E ⇒ ¬B(cond): Suspension: {I ∧ E} cond.wait {I ∧ B(cond)} Resumption: {I ∧ B(cond)} cond.signal {I ∧ E}
Auxiliary and Private Variables The verification of monitors is akin to the restricted setting of Owicki and Gries (Owicki-Gries Method of Axiomatic Verification): the conditional critical region of Owicki and Gries [] corresponds to the body of a monitor operation. That is, auxiliary variables are necessary for the relative completeness of the proof method. For monitors and other shared abstract data structures, there exists a special kind of auxiliary variable, called a private variable. Of a private variable, one instance exists for each process that accesses the monitor [] and, in a proof, it is indexed by the identifier of the process (see the example below).
Monitors, Axiomatic Verification of
The auxiliary variable axiom of Owicki and Gries ensures that auxiliary variables need not be implemented.
Example Consider the implementation of an integer semaphore as a monitor, using a syntax close to Concurrent Pascal []. Say the semaphore is employed to govern the use of m instances of a resource by n parallel processes (assuming n ≥ m) in an infinite loop. The claim to be proved is that the resources are managed correctly and efficiently []. In Concurrent Pascal, a monitor is defined by giving it a name and stating its parameters, then declaring the monitor data and operations, and finally providing an initializing statement. Semaphore implementation: type semaphore = monitor (m : integer); var s : integer, q : condition; proc P; begin if s = then q.wait fi; s := s− end; proc V; begin s := s+; q.signal end; begin s := m end Main program: resource r(sem : semaphore(m)) :: cobegin S // ⋯ // Sn coend Si :: while true do noncritical section; sem.P; critical section; sem.V; noncritical section od Note that the critical section is not mutually exclusive but can be entered in parallel – by up to m processes.
M
Auxiliary variables: Inside the monitor, a two-valued private variable is declared: var INC : .. which is indexed implicitly with the identifier caller of the process currently accessing the monitor. Thus, in the proof, INC is an array of n elements; each element is private to one of the processes accessing the monitor. In the initialization, all elements of the array are set to . Monitor invariant: The invariant states that the number of current decrements of the semaphore corresponds to the number of elements of array INC currently set to : I : ≤ s = m − ∑ncaller= INC[caller] Proof outline for the main program: The main program iterates indefinitely, which is specified by the false postcondition. The program manipulates array INC such that the number of elements currently set to corresponds to the number of processes currently executing the critical section: {m ≥ } resource r(sem : semaphore(m)) :: cobegin S // ⋯ // Sn coend {false} Si :: {I ∧ INC[i] = } while true do {I ∧ INC[i] = } noncritical section; {I ∧ INC[i] = } sem.P; {I ∧ INC[i] = } critical section; {I ∧ INC[i] = } sem.V; {I ∧ INC[i] = } noncritical section {I ∧ INC[i] = } od {false} The proofs for P and V have to certify their respective manipulations of INC as well as the establishment of the invariant, provided it holds before their calls. Proof outline for P: proc P;
M
M
Monitors, Axiomatic Verification of
begin {I ∧ INC[caller] = } if s = then {I ∧ INC[caller] = ∧ s = } q.wait {I ∧ INC[caller] = ∧ s ≠ } fi ; {I ∧ INC[caller] = ∧ s > } s := s−; {INC[caller] = ∧ s ≥ } INC := {I ∧ INC[caller] = } end Proof outline for V: proc V; begin {I ∧ INC[caller] = } s := s+; {INC[caller] = ∧ s > } INC := ; {I ∧ INC[caller] = ∧ s ≠ } q.signal {I ∧ INC[caller] = } end Proof outline for the monitor initialization: {m ≥ } s := m; for i := to n do INC[i] := od {I ∧ s = m ∧ (⋀ni= INC[i] = )} Claim: A process is suspended if and only if m other processes are executing the critical section. Proof sketch: Note that the invariant holds at all points at which the monitor can change hands, that is, where processes may enter or leave the critical section or are suspended or resumed. The only points at which the invariant does not hold is in the postconditions of the semaphore decrement (in P) and increment (in V) but, at both these points, it is reestablished immediately under the mutual exclusion provided by the monitor’s short-term scheduling. In the following arguments, note that B(q) = s>: ⇒: That processes are being suspended when m processes are executing the critical section follows
directly from the invariant, which guarantees that, in this case, s=. Note that s= ⇒ ¬B(q). ⇐: Processes are suspended only when necessary, since pre(q.wait) ⇒ ¬B(q).
Other Signaling Strategies The problem of signaling received further attention almost immediately. The question is when exactly the signaler should leave the monitor and when the signalee should be resumed. The different signaling strategies make use of, in all, three waiting queues: The condition queue: the monitor’s FIFO queue for midterm scheduling. The entry queue: the monitor’s FIFO queue for shortterm scheduling. The urgent queue: an additional FIFO queue for shortterm scheduling that takes priority over the entry queue (only used in strategy SU). Five signaling strategies have been proposed []: signal + continue (SC): Java Signaler continues. Signalee moves from the condition queue to the entry queue. signal + exit (SX): Concurrent Pascal Signaler leaves the monitor for good. Signalee is removed from the condition queue and resumed. signal + wait (SW): Howard [], Modula, Euclid Signaler is added to the entry queue. Signalee is removed from the condition queue and resumed. signal + urgent wait (SU): Hoare [], Pascal-Plus Signaler is added to the urgent queue. Signalee is removed from the condition queue and resumed. automatic (AS): Owicki-Gries general setting [] (the await statement). Implicit SC. Hoare axioms, which are by nature more complicated than the ones mentioned previously in this entry, were proposed early on []. A drawback of the SC strategy is that a process, upon resumption, may encounter the monitor data not in the expected state – some other process may have interfered! Andrews [] states that the advantages of the SC
Moore’s Law
strategy are that it is simpler – it suffices to provide only a single condition queue per monitor object, not one queue per condition to be waited on – and that the proofs of wait and signal are less context-dependent and thus easier to do in isolation. He also conjectures that the SX strategy fosters operation fragmentation.
Related Entries Verification of Parallel Shared-Memory Programs, Owicki-Gries Method of Axiomatic Synchronization
Bibliographic Notes and Further Reading The verification rules for abstract data structures were first proposed by Tony Hoare [] soon after his seminal paper on axiomatic program verification []. The monitor concept was invented in the early s by Tony Hoare [] and Per Brinch-Hansen [], and it was quickly included in several programming languages of the time. Some problems with monitors were quickly discovered and discussed []. One consequence is to avoid call hierarchies, that is, calling one monitor operation while executing an operation of another monitor []. Over the years, the monitor concept has been reviewed repeatedly and its variants have been classified [, ]. An extended monitor concept for Java with the option of several signaling schemes and with prioritized resumption has been proposed []. Ole-Johan Dahl reviewed the axiomatic verification of monitors at the honorary symposium for Tony Hoare in on the occasion of his retirement from Oxford University [] and, years earlier, at a similar function for Tony Hoare’s th birthday []. Dahl seemed unaware of Howard’s early work []. In , when electricity failed in the middle of the talk due to the flooding of a transformer in bad weather, Dahl completed his presentation with clarity and composure without visual aids in the windowless room, which was illuminated only by the exit signs – and took questions afterward!
Bibliography . Andrews GR () Concurrent programming – principles and practice. Benjamin/Cummings, Redwood City . Brinch-Hansen P () The programming language Concurrent Pascal. IEEE Trans Softw Eng SE-():–
M
. Buhr PA, Fortier M, Coffin MH () Monitor classification, ACM Computing Surveys ():– . Chiao H-T, Wu C-H, Yuan S-M () A more expressive monitor for concurrent Java programming. In: Bode A, Ludwig T, Karl W, Wismüller R (eds) Euro-Par : parallel processing, Lecture notes in computer science . Springer-Verlag, Heidelberg, pp – . Dahl O-J () Monitors revisited. In: Roscoe AW (ed) A classical mind, Series in computer science, Chap . Prentice Hall International, Hemel Hempstead, pp – . Dahl O-J () A note on monitor versions. In: Davies J, Roscoe AW, Woodcock J (eds) Millennial perspectives in computer science, Cornerstones of computing. Palgrave, Basingstoke, pp – . Hoare CAR () An axiomatic basis for computer programming. Commun ACM ():–, . Hoare CAR () Proof of correctness of data representations. Acta Inform ():– . Hoare CAR () Monitors: An operating systems structuring concept. Commun ACM ():–. Corrigendum: (): . Howard JH () Proving monitors. Commun ACM :– . Howard JH () Signaling in monitors. In: Proceedings of the nd international conference on software engineering (ICSE’), pp –, San Francisco, . Keedy JL () On structuring operating systems with monitors. ACM SIGOPS Oper Syst Rev ():– . Lengauer C () On the axiomatic verification of concurrent algorithms. Master’s thesis, Department of Computer Sciences, University of Toronto, August . Technical Report CSRG- . Lister AM () The problem of nested monitor calls. ACM SIGOPS Oper Syst Rev ():– . Owicki SS, Gries D () An axiomatic proof technique for parallel programs. Acta Inform ():– . Owicki SS, Gries D () Verifying properties of parallel programs. Commun ACM ():–
Moore’s Law John L. Gustafson Intel Corporation, Santa Clara, CA, USA
Synonyms There are no exact synonyms for Moore’s law with respect to chip technology. Kryder’s law for hard disk storage, Butter’s law for fiber optic transmission speeds, and Nielsen’s law for consumer home network speeds are examples of very similar formulations for other technologies. These may be regarded as derivatives of Moore’s law, and they are often described as “A Moore’s law for mass storage” etc. Technologists often
M
M
Moore’s Law
cite Moore’s law very broadly to refer to any technology that changes as an exponential function of time.
Moore’s law states that the number of transistors on a chip doubles every months. More precisely, the law is an empirical observation that the density of semiconductor integrated circuits one can most economically manufacture doubles about every years. Thus, the cost of manufacturing a transistor drops by half about every years. Some definitions use months as the doubling period instead of years; see the Discussion section below for clarification. Caltech professor and chip pioneer Carver Mead coined the term “Moore’s law” in .
Discussion
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
Definition
Log 2 of the number of devices per integrated function
Year
Moore’s Law. Fig. Moore’s original graph estimating the trend of device density
History The idea of putting complete circuits on a monolithic block of semiconductor material originated with Jack Kilby (Texas Instruments, ) and, independently, Robert Noyce (Fairchild Semiconductor, ). Gordon E. Moore’s paper on the progress in integrated electronics [] noted the rapid rate of density improvements in just those years. He made the bold prediction that “with unit cost falling as the number of components per circuit rises, by economics may dictate squeezing as many as , components on a single silicon chip.” Figure shows Moore’s original graph []. Only four data points were used to create the original graph, and Moore’s original estimate was twice as optimistic as the rate of improvement given in later revisions: “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” Notice that these are the transistor counts for individual functions, not for processors, since the first complete microprocessors on a chip would not be possible until several years later. At the time, the driving motivation for integrated circuitry was to satisfy the needs of the military and the US space program, both of which had a high tolerance for the cost of parts needed to do a particular job. Moore observed, even in , that the decrease in cost would eventually usher in an era of personal computing and “personal portable communications equipment,” that is,
cellular phones. A cartoon accompanying Moore’s article showed a set of store kiosks for cosmetics, notions, and “handy home computers,” intended as a joke to overstate Moore’s prediction. The humor and intended exaggeration of the cartoon is almost completely lost on the modern reader, since it depicts a scene not unlike that found in current shopping malls. Around , Caltech Professor Carver Mead, a pioneer of very large-scale integration (VLSI) technology, coined the term “Moore’s law” for the density doubling rule of thumb. Moore updated his paper in []. That paper used a more complete set of data to revise the rate of improvement to a density doubling every months. The more recent graph in Fig. begins where Moore’s original graph left off, and encompasses four decades of Intel family processors. It makes clear that for processor chips, a doubling every years is an excellent match to the empirical data.
A Self-fulfilling Prophecy What started as an empirical observation about the integrated circuit business became a self-fulfilling prophecy that pervades computing. Moore himself was perhaps the first to apply that phrase to his law, in a interview. The law is as much an economic guideline as a technical one []. Moore’s paper began as a forecast but became an industry guideline. As a well-known trend
Moore’s Law
109
M
Xeon 5500 Core 2 Quad Core 2 Duo
108 Transistors per processor chip
Pentium 4 107
Pentium III Pentium II Pentium i486
106
s
th
i386
on
80286
105
ry
8088 104
g lin b u
24
m
e ev
Do 8080 8008 4004
103
1970
1980
1990 Date introduced
2000
2010
Moore’s Law. Fig. Moore’s law for CPU device density
M believed by an entire industry, there are risks to introducing products that are either above or below the trend line, since the entire supply chain from the suppliers to chip fabrication facilities to the consumers of the electronic systems built from those chips is calibrated to the expectations of the forecast. If a commercial entity aims below Moore’s trend line, the result will be a product that has insufficient performance to compete in the marketplace. If it aims above Moore’s trend line, it will find the infrastructure unavailable to produce the parts except at extremely high costs, and they will not be in balance with the performance of chips available from other vendors. Thus, the entire semiconductor industry marches to the same beat as the safest business practice.
cynical observation (Parkinson’s law) that work expands to fill the time available for its completion: “Software expands to fill the available memory.”
Bill Gates made a similar observation, but said that the net result is to keep system speed about the same, not to get worse. Gates speculates that this is for more than one reason: programmers add more and more features but also code less and less efficiently as hardware grows in capability. Worrying less about the efficiency of software makes the software less costly to produce. This is an example of how Moore’s law can enable trade-offs that save costs in areas of computing other than chip technology. Many applications and operating systems now contain tens of millions of lines of software as a side effect of Moore’s law. Moore’s Law and Software The cost of software was trivial relative to the cost of Niklaus Wirth observed that software demands for hardware in the early days of electronic computing. The hardware speed and memory capacity grow faster operation cost of a s mainframe computer was over than Moore’s law, resulting in a net decrease in a hundred times as much as the cost of a programmer system performance []. He quotes M. Reiser’s remark on a per-hour basis. Thus, it was common for program“Software is getting slower more rapidly than hardware mers to program at the lowest level of machine language becomes faster,” and a variant on C. Northcote Parkinson’s to achieve sufficiently high performance. Moore’s law
M
Moore’s Law
has dropped the price of hardware so dramatically that when one seeks a new application, programming the application is often the most expensive aspect of using the computing system. Therefore, there is considerable incentive to use the cost savings of Moore’s law improvements to reduce programming cost by putting more of the performance burden on the hardware and less on the programmer.
Misconceptions About Moore’s Law Perhaps the most common misconception about Moore’s law is that it predicts performance will double periodically, as measured by clock speed or any other performance metric. This is partially supported by device-level physics, since decreasing the size of a transistor allows it to switch faster. However, because not all performance features scale with the transistor performance (such as the speed of signals on the chip), performance improvements generally require redesign as well as the shrinking of features. Moore did not make any predictions about performance nor did he ever cite a doubling interval of months. The “ month” period of doubling is the result of an attempt by former Intel executive David House to estimate and publicize a growth rate for the performance improvement in microprocessors. While transistor counts are easy to quantify, it is much more difficult to find a rigorous definition of absolute computer performance.
Implications for Parallel Computing In recent years, Moore’s law has become a major reason that interest in parallel computing methods has become prominent and intense. The most recent and obvious consequence of the increased density of VLSI is that a single processing chip usually contains more than one processor. This has brought the issues of parallel computing to developers of even very small and inexpensive devices. A more profound effect of Moore’s law that drives parallel computing has been in operation for decades: not all technology features and requirements scale at the same rate as Moore’s law, forcing designers to make architectural changes. For example, whereas processor capability and memory capacity have improved about a million-fold with Moore’s law in the last years, the time latency between the processor and the (offchip) memory has improved only very slightly over
that same period. The simple “von Neumann architecture,” in which a single monolithic memory supplies data and instructions to a single monolithic processor, is now impractical because of this disparate scaling. Yet, programmers have created a vast body of serial software to fit that increasingly untenable model. Modifications to the von Neumann architecture (such as automatic caching of memory or the automatic overlapping of operations and memory transfers) have helped to convert Moore’s law device density improvements into higher performance without changes to that traditional serial software, but only up to a point. Beyond that point, the programmer faces the need for parallel computing techniques to exploit the higher device density of VLSI for continued improvements in performance. A simple way to understand this is to consider cost. Imagine that to double performance of a computer system while maintaining the illusion of a single memory space and a single thread of execution will increase the cost by a factor of four. The alternative is to buy two computers without changing the design, which nominally doubles the cost of the processing hardware. Added to this is the cost of connecting them and the cost of reprogramming so they can operate in parallel. Depending on the application, the connection and reprogramming costs might be quite a bit less than another factor of two, so the parallel approach has lower total cost. Moore’s law predicts a steady state in the drop of the cost of a megabyte of memory. The data for the first introduction of each new generation of dynamic random-access memory (DRAM) chip actually shows a doubling approximately every months, not every , as shown in Fig. . While memory parts became cheaper for computer makers, the price per megabyte of memory charged by vendors diverged sharply in the s between makers of distributed memory parallel computers and those making monolithic designs [], as shown in the Table below, compiled in . The first four entries reflect the commodity pricing of DRAM for personal computers and workstations. The last three entries reflect the price of proprietary memory designs from vector supercomputer makers of the same period, and are higher by almost two orders of magnitude. The nCUBE and Intel Paragon, being distributed memory parallel systems, were able to
Moore’s Law
M
1010 4G 1G 256M 108
64M 16M
107
on th s
4M 1M
21
m
106
64K 16K
ub lin
105
g
ev e
ry
256K
Do
Transistors per memory (DRAM) chip
109
104 4K
103
1K 1970
1980
1990
2000
2010
Date introduced
Moore’s Law. Fig. Device density for dynamic-access memory
Moore’s Law. Table Price for distributed memory versus monolithic memory, ca. Computer
List price to add megabyte of memory
Apple Mac IIsi
$
PC
$
Sun SPARCstation
$
SGI Indigo
$
nCUBE
$
Intel Paragon
$
NEC SX-
$,
Hitachi EX Series
$,
CRAY Y-MP
$,
price memory much closer to what was being offered for individual workstations since they did not attempt to preserve the programming view of a serial (von Neumann) architecture. By the late s, “Beowulf ”
clusters for supercomputing were built, literally out of consumer-grade personal computers and so had the same cost per megabyte of DRAM. This illustrates how Moore’s law creates upheavals in prevailing computer architectures, and in particular led to the nowubiquitous use of distributed memory and parallel processing to create the fastest computing systems.
The ASCI Program and Other “Above Trend” Efforts The Accelerated Strategic Computing Initiative was a late s US Department of Energy program intended to replace nuclear weapon testing with computer simulation. In recognizing the vast improvement in computing power required to achieve this, the proponents included “accelerated” in the name to indicate the desire to exceed Moore’s law []. While hundreds of millions of dollars were spent on the very large parallel supercomputing systems developed for the program and placed
M
M
Moore’s Law
at the national weapons laboratories, this was still such a small fraction of the overall computing market that there was no impact to accelerate Moore’s law. The density and price of the systems did not deviate from the predicted trend. The systems were somewhat more powerful than the trend of supercomputer speed versus year, simply because they were more expensive per system on an inflation-adjusted basis. The Japanese Earth Simulator was another striking national effort at stretching beyond the performance trend that succeeded through more massive expenditure. The cost for the Earth Simulator was an estimated $ million []. Since parallel supercomputer developers do not have sufficient influence in the total computing market to bend Moore’s law, they simply spend more money to build larger facilities. The largest computing facilities such as those built by Google and Microsoft contain several acres of computing equipment with attendant power, cooling, and interconnect.
The Future of Moore’s Law Early Pessimism In , Carver Mead stated in a classroom lecture at the California Institute of Technology that Moore’s law would come to a physics-based “day of reckoning” around , when transistors would be so small that a single electron would determine its on-off state. Part of the inaccuracy of his prediction may have been the result of using a -year period of doubling instead of the -year period Moore cited in his revision in . Another error may have been to assume that transistor geometries would scale down uniformly in all three dimensions; current devices are tall and thin on the chip relative to their horizontal dimensions, like a very high hedge maze, which has delayed some of the impending limits of scaling semiconductor physics.
Optical Limits Similar pessimistic predictions have appeared, often based more on engineering limitations than on fundamental physical laws. For example, the primary manufacturing method has been photolithography, in which light passing through a mask creates chemical changes in a polymer (photoresist) causing it to harden in precise patterns. Optical theory says one cannot create images
more accurate than about one wavelength of light, because of diffraction effects. This predicts that Moore’s law must end when device features became smaller than a single wavelength of visible light. That limitation would have meant that device features could not be smaller than about . μm. Yet, mass-manufactured chips, at the time of this writing, have features as small as . μm. Engineers overcame the barrier using much smaller wavelength light in the ultraviolet, in combination with sophisticated mathematical corrections to the diffraction effects so that features smaller than a single wavelength could image reliably. What appeared to be an absolute physical limit was actually simply an engineering problem to overcome.
Heat and Power Limits At present, heat and power dissipation are engineering challenges for the future of Moore’s law. Chips that use more than about W are difficult to cool economically, so designers are using changes to device parameters as well as changes to the architecture. Lowering the operating voltage is one technique, but equally helpful is to reduce the clock speed. While reducing the clock speed slows serial processing, it can allow designers to pack more processors onto a single chip because multiple processors still fit into the power budget. This is one reason the chip industry embraced “multicore” parallel architectures. Efforts to raise clock speed were at the point of diminishing returns, but the use of parallel computing allowed the performance per chip to stay on the Moore’s law trend line.
Design Effort Limits Chip designers laid out the earliest chips manually. They made masks with black tape, by hand, hence the term “taping out” a chip. As the number of devices grew to the hundreds, computers and automated equipment displaced that part of the design effort. However, even as the ability of computer software tracked the need to manage millions and then billions of individual devices, some speculated that the process would become prohibitively arduous and hence human-limited. While some devices (such as memory chips) involve many repetitions of a single small design “cell,” the more complex chips (such as processors used in servers) involve the custom design and management of many different cell types. Placing the devices and routing the connections
Moore’s Law
between them in an optimal way is an NP-hard problem that taxes both human designers and the software tools. The production of a new state-of-the-art processor involves thousands of engineers and about different software applications for various stages of the design and manufacture. While this level of investment has tended to reduce the number of industrial companies that are able to create competitive semiconductor devices, it has not resulted in a slowing of Moore’s law.
Reliability at the Atomic Scale If the feature sizes of chips continue to shrink as predicted by Moore’s law, chips will soon contain features that are only a few atoms across. This limit occurs around –. At such sizes, quantum effects and temperature noise will force us to fundamentally rethink the way circuits accomplish computation. Note that an individual atom has multiple quantum states that can store information, so Moore’s law does not necessarily end when device features are at the individual atom level of detail.
Lack of Market Need Some have even speculated that Moore’s law will end not for technological reasons but because all of our electronics will be “good enough,” and thus manufacturers will not seek further improvements to size and cost. While this argument might hold for any given device type, like a four-function calculator, many areas of human endeavor (such as high-performance computing to simulate physical behavior) have seemingly boundless appetites for improvements and have no ceiling on what they demand for speed and storage.
Manufacturing Facility Cost Another economic argument against continued progress is the exponentially increasing cost of the manufacturing facilities for semiconductor chips. According to Dan Hutcheson, CEO of VLSI Research, the cost of building a fabrication facility doubles every years. This forecast is one he called “Rock’s law,” after early Intel venture capitalist Arthur Rock []. While larger facilities create economies of scale, Rock’s law seemingly collides with the cost-reduction part of Moore’s law. However, the cost per transistor may continue to drop even as the plants that produce them become increasingly expensive to build. A closely related metric is the cost per unit
M
area of building a chip. Like the facility cost, this metric is increasing, even though the cost per transistor is decreasing.
Perspective The rate of technology advance in chip technology is a rate of improvement without precedent in human history. Some have observed that if automobile technology followed Moore’s law, a car would now cost less than a penny, would use drops of gasoline instead of gallons, and would safely go hundreds of thousands of miles per hour. The rate of change has been so staggering, and so different from that of other industries, that many computing businesses fail if their management tries to apply the kind of product marketing used for non-technology companies. A recurring situation is that the parts cost to a computer maker drops by a factor of two, and the company’s executives face the choice of passing this savings to their customers (which they fear will cause their revenue to drop with Moore’s law) or maintaining their prices with the aim of increasing their profits. The consequence of the latter choice is often that another company with little to lose and everything to gain by introducing a competitive product undercuts them by passing on the savings of Moore’s law to customers. Even in an economy that has moderate inflation, consumers of technology have come to expect inexorable improvement in the price and capabilities of any electronic device. For many years, Moore’s law served as a damper on parallel computing efforts. Why rewrite software when in less than years you can simply buy a serial computer that is twice as fast? Now, in contrast, the shrinking of transistor size and cost means the issues of parallel computing are everywhere.
Related Entries Exascale Computing Fault Tolerance Intel Core Microarchitecture, x Processor Family Memory Wall Metrics Power Wall SoC (System on Chip) VLSI Computation
M
M
MPI (Message Passing Interface)
Bibliographic Entries and Further Reading Ray Kurzweil has extended Moore’s law to many other areas of technological change that seem to be advancing exponentially. The Singularity is Near: When Humans Transcend Biology (Viking ; Penguin ) []. In his “The Law of Accelerating Returns,” http://www. kurzweilai.net/articles/art.html?printable=, Kurzweil points out that each apparent limit to the feasibility of progress for some technology is overcome by innovations that jump to a slightly different technology that avoids the limit, and Moore’s law is an example of this phenomenon. Niklaus Wirth lectured and wrote often about how software “bloat” seemed to negate and even overwhelm the advances provided by Moore’s law. A representative example is “A Plea for Lean Software” []. The September issue of IEEE Solid State Circuits is a special issue on Moore’s law. It includes the article by David E. Liddle referenced below that points out many of the subtle and economic implications of Moore’s law. See http://www.ieee.org/portal/site/sscs/menu item.cfabaffbbacc/index.jsp?& pName=sscs_enews_home&month=&year=. One of the best in-depth articles on the chip manufacturing technology underlying Moore’s law is by Jon Stokes, in Ars Technica (last updated September ). http://arstechnica.com/hardware/news/// moore.ars.
Bibliography . Moore GE () Cramming more components onto integrated circuits. Electronics (). ftp://download.intel.com/ museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_ _Article.pdf . Moore GE () Progress in digital integrated electronics. In: Proceedings of the IEEE electron devices meeting, vol . San Francisco, CA, pp – . Liddle DE () The wider impact of Moore’s law. IEEE Solid State Circuits . Wirth N () A plea for lean software. Computer ():–. http://cr.yp.to/bib//wirth.pdf . Gustafson J () The vector gravy train. Supercomputing Review. http://www.scl.ameslab.gov/Publications/Gus/Gravy.html . Gustafson J () Computational verifiability and feasibility of the ASCI program. IEEE Computing (). doi:http://doi. ieeecomputersociety.org/./. . Lohr S () Technology; IBM decides to market a blue streak of a computer. The New York Times, June
. Cnet news () http://news.cnet.com/Semi-survival/ -_-.html . Kurzweil R () The singularity is near: when humans transcend biology. Viking Press, New York
MPI (Message Passing Interface) William Gropp University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Message passing
Definition MPI is an ad hoc standard for writing parallel programs that defines an application programmer interface (API) implementing the message-passing programming model. MPI is very successful and is the dominant programming model for highly scalable programs in computational science. The fastest parallel computers in the world, with more than , cores, run programs written in MPI, and the fastest applications (achieving over PetaFLOPs in ) use MPI. MPI was created in two stages. The original MPI standard [], developed between and , specified a library interface for point-to-point and collective communication, along with various support routines and other features. The second stage, from to , extended MPI to include parallel I/O, remote memory access, and creation of additional processes, among other features.
Discussion MPI- The MPI standard provides a way to make use of the Communicating Sequential Processes (CSP) programming model. In MPI, conventional processes make explicit calls to library routines defined by the MPI standard to communicate data between two or more processes. In the message-passing model, processes communication by sending messages. This is a two-sided operation: the sending process describes the data to be sent
MPI (Message Passing Interface)
and the receiving process describes how to receive the message. More specifically, the sender and the receiver each must describe the following: What data is sent? To whom is it sent? Where (in memory) is it received at the destination process? In addition, the receiver may need to know: who sent the data and how much data was sent. In addition, it is common to provide one additional nonnegative integer of information to be communicated from the sender to the receiver. This is called the message tag. It can be thought of as a way to identify a specific kind or type of message to the receiving process, and the receiving process can specify that a particular receive operation will only accept a message with a specific tag (in earlier message-passing systems, this tag was sometimes called the message type). MPI generalizes the notion of what data is send and to whom the data is sent. The most common specification for data is a pointer to data and a length (e.g., in bytes). For example, the Unix write routine describes the data to write this way. In MPI, this is generalized in two ways. First, the type of data is described in terms of an MPI_Datatype. The basic MPI Datatypes match the basic datatypes in the programming language being used with MPI. Thus for an “int” in C, there is an MPI_INT datatype; for an “INTEGER” in Fortran, there is an MPI_INTEGER datatype. This allows users to describe the amount of data to be sent in natural terms ( ints) rather than as a specific number of bytes, which may depend on the particular platform (e.g., is an int or bytes). In addition, this allows an MPI implementation to convert the data being sent to match the receiving process. For example, if the sender uses byte ints in big-endian format and the receiver used byte ints in little-endian format, the MPI implementation can ensure that the data is correctly received. Thus, to specify the data that is to be sent (or the location into which data is to be received), MPI specifies a -tuple: pointer, count, and MPI Datatype. The MPI Datatype parameter allows an additional generalization. In MPI, there are routines to define new, user-defined datatypes in terms of the any defined (basic predefined or user-defined) datatype. These datatypes may describe multiple data items and need not be contiguous. For example, MPI provides a routine to define a “vector” datatype that describes blocks of a datatype, separated by a stride in memory. By using such a datatype (or one of the other MPI datatype
M
constructors) in the -tuple describing the data to send (or to receive), the programmer can describe an arbitrary set of data to send or receive. MPI also generalizes how the source and destination processes are described. Perhaps the simplest way to identify the source and destination is to number all of the processes, starting from zero (this works to identify MPI processes belonging to the same executing MPI program; it does not and is not intended to identify the MPI processes outside of the MPI program as a hostname and process id (pid) pair would on typical cluster). The number between zero and the number of processes minus is called the rank of the process. In early message-passing systems, this rank was all that was necessary to identify the source and destination of a process. In MPI, the rank of a process is relative to a collection of processes. These collections are called MPI_Groups, and the programmer may construct new groups from any existing MPI_Group. When an MPI program starts, there is a group that contains all processes in the MPI program. This group is combined with a communication context (more on that below) in an MPI object, called a Communicator. In MPI-, all communication is relative to a communicator. When an MPI program starts, the communicator MPI_COMM_WORLD is defined, and each process can determine which rank it is relative to this communicator and how many processes are part of this communicator. Allowing the creation of new communicators containing different numbers of processes makes it much easier to build applications and libraries that can work with subsets of processes. Thus, in MPI, the destination process for a send (or the source process in a receive) is specified with the -tuple of rank and communicator. In addition to sending and receiving data, MPI provides routines to initialize and end an MPI program; these are the routines MPI_Init and MPI_Finalize, and routines to determine the number of processes in a communicator (and in the MPI program if the communicator is MPI_COMM_WORLD) and the rank of a process in a communicator. These routines are MPI_Comm_size and MPI_Comm_rank, respectively. With these routines, and with MPI_Send to send a message and MPI_Recv to receive a message, we can show a simple MPI program.
M
M
MPI (Message Passing Interface)
The following example of an MPI program illustrates these basic MPI routines. In this complete C program, the process with rank in MPI_COMM_WORLD sends two integer values to the process with rank , which then prints out the values received. #include “mpi.h” #include <stdio.h> int main(int argc, char ∗ argv[]) { int rank, size, sbuf[], rbuf[]; int dest, source, tag; MPI_Status mystat; MPI_Init(& argc, & argv);/∗ Initialize MPI ∗ / MPI_Comm_rank(MPI_COMM_WORLD, & rank); MPI_Comm_size(MPI_COMM_WORLD, & size); if (rank == ) { sbuf[] = ;/∗ Any data process zero wants to send∗ /; sbuf[] = size; dest = ;/∗ destination rank ∗ / tag = ;/∗ message tag ∗ / MPI_Send(sbuf, , MPI_INT, dest, tag, MPI_COMM_WORLD); } else if (rank == ) { source = ;/∗ Source rank ∗ / tag = MPI_ANY_TAG;/∗ receive the next message from source, independent of the tag ∗ / MPI_Recv(rbuf, , MPI_INT, source, tag, MPI_COMM_WORLD, & mystat); printf(“Received %d %d from sender/n”, rbuf[], rbuf[]); /∗ mystat.MPI_SOURCE contains the source rank; mystat.MPI_TAG contains the tag of the message ∗ / } printf(“Process %d is exiting/n”, rank); MPI_Finalize();/∗ No MPI calls after MPI_Finalize executes ∗ / return ; } Most of these features have already been discussed; the one MPI item that is new here is the MPI_Status object. This is used to contain information about the
source, tag, and size of the message. It is also important to understand how this parallel program executes. When receiving, a specific source and tag may be specified, or MPI_ANY_SOURCE or MPI_ANY_TAG used, respectively. In that case, the MPI_Status object may be used to determine the source and tag, respectively. These also allow the specification of nondeterministic programs. In MPI, each process executes with its own address space. There are no shared variables or direct access to memory in another process (in MPI-). If the above program is started with two processes (which may be on two processors, two cores of the same processor chip, or even two processes executing on the same core), each process has a full copy of all of the variables. For example, in the above example, there is an “int sbuf[]” in each process. In addition, MPI programs execute independently; it is only when an MPI routine is called that two or more processes may communicate. In particular, language features or routines that are not part of MPI will execute independently in each process. In this example, the printf right before the call to MPI_Finalize is executed by each process and in whatever order occurs as the program executes.
Major Features of MPI- MPI provides a rich set of routines. They are described in separate chapters in the MPI standard, and include both point-to-point (two party) and collective (many party) communication. The core of MPI- is the point-to-point communication routines. We have already seen the two most basic routines – MPI_Send and MPI_Recv. MPI provides a number of variations on each of these. One of the most important is the nonblocking versions of these routines. In this case, the corresponding MPI routine only initiates the communication – a separate MPI routine is required to ensure that the communication has completed. This permits the MPI implementation to overlap the execution of these operations with other computation or communication. Another variation is different send modes. In addition to the basic send mode, there is a synchronous send mode, where the send operation completes only when the matching receive begins, and the ready send mode, which requires that the matching receive has already been initiated on the destination
MPI (Message Passing Interface)
process (permitting optimization of the message delivery). MPI also provides for persistent communication for operations that are repeated, such as communication within a loop. MPI also provides routines to test for the presence of a message and to cancel nonblocking communication. MPI Datatypes have already been mentioned; MPI provides routines to create datatypes that represent regularly strided data (vector), blocks of different size (similar to the Unix IOV structure), and (in MPI-) blocks of the same size. By specifying a data layout with an MPI datatype, rather than by having the user first gather the data into a contiguous buffer and then send it, the MPI library can eliminate a memory copy and, in some cases, exploit special instructions to more efficiently move data. In addition to point-to-point communication, MPI provides a rich set of collective communication and computation routines. For example, the routine MPI_ Bcast allows one process to broadcast the same data to every other process in an MPI communicator. Other routines provide one-to-many (MPI_Scatter), many-toone (MPI_Gather), and many-to-many (MPI_Alltoall) communication. These routines are provided because efficient, scalable implementations require great care and depend upon the specifics of the underlying network hardware. Collective computation is also very important, and MPI provides routines to perform parallel prefix (MPI_Scan), reduction (MPI_Reduce), reduction to all (MPI_Allreduce), and a distributed reduction (MPI_Reduce_scatter). Like the point-to-point communication routines, data is described by the -tuple of pointer, count, and datatype. An innovative feature of the MPI API is the profiling interface. Each MPI routine is available in two versions: one with the MPI prefix and one with a PMPI prefix. The PMPI version performs exactly the same operations as the MPI version. The user is permitted to override the MPI versions. Thus, to count the number of times that MPI_Send is used in a program, all an MPI programmer needs to do is to compile the following code and link it with their application: #include “mpi.h” static int sendCount = ; int MPI_Send(void ∗ buf, int count, MPI_Datatype dtype, int rank, int tag, MPI_Comm comm)
M
{ sendCount++; return PMPI_Send(buf, count, dtype, rank, tag, comm); } int MPI_Finalize(void) { int rank; MPI_Comm_rank(MPI_COMM_WORLD, & rank); printf(“Process %d called MPI_Send %d times/n”, rank, sendCount); return PMPI_Finalize(); } This works on any system and does not rely on operating system-specific features. A number of tools take advantage of this feature to provide performance and correctness debugging information about an MPI program. As mentioned above, communication in MPI is relative to communicators that contain a group of processes and a communication context. This communication context is not accessible to the user but is a hidden property of the communicator. The reason for this is to support modular programming, where different suppliers provide software libraries that use MPI. These libraries must be certain that their communication cannot be intercepted by other routines in the application. Without private communication contexts, it isn’t possible to make this guarantee. Thus, each communicator in MPI contains a group of processes and a private communication context. A new communication context is created every time a new communicator is created. The routine MPI_Comm_dup creates a new communicator with the same group of processes but a new communication context; it is recommended that libraries using MPI call this routine to get a private communicator and use that communicator for all communication. While the communication context is not directly accessible to the user, the group of process is and may be manipulated by the user. However, for many users, the best way to create a new communicator is by “splitting” an existing communicator into new ones with MPI_Comm_split. MPI also provides a set of routines to create and use virtual process topologies; these are arrangements of
M
M
MPI (Message Passing Interface)
processes so that each process has a natural set of neighbors. MPI_Cart_create, for example, takes as input an MPI communicator and the description of a Cartesian mesh and returns a new communicator, where the neighbors of each process in the mesh of processes can be determined and where the processes may be mapped onto the physical network in such a way that communication with neighbor processes is especially efficient. Other features in MPI- allow the definition of data to be cached on a communicator and routines to define the behavior when a recoverable error is encountered. The rich set of features of MPI, along with the focus on performance and completeness, helped MPI rapidly become the preferred way to use the message-passing programming model on parallel computers. But within two years of the release of MPI, users were asking for additional features. To respond to that need, the MPI Forum reconvened and began defining a set of extensions to MPI.
MPI- MPI- []defines a set of extensions to MPI-: any valid MPI- program is a valid MPI- program. While MPI- roughly doubled the number of routines in MPI, most of the additions were in four areas: remote memory access, dynamic processes, parallel I/O, and C + + and Fortran language bindings.
Major Features of MPI- Remote memory access (RMA) provides a “one-sided” model of communication, where one process specifies both the source and the destination of the communication. Sometimes called “put/get” programming, MPI provides routines to put data into a remote process (MPI_Put), get data from a remote process (MPI_Get), and update data in a remote process (MPI_Accumulate). This resembles more conventional shared memory programming, however, with several important differences. First, all data transfers still happen as a result of a call to an MPI routine. The process performing the transfer is called the origin; the process that is the target of the transfer is called the target. Second, the local and remote completion of that transfer may require additional MPI calls. While the details are
quite lengthy and subtle, there are two modes in MPI. The first, called active target, requires each process to call a routine to begin and end both use of RMA routine (the access epoch) and access by other processes (the exposure epoch). The Bulk Synchronous Programming model has a very similar form. The second mode is call passive target. In this mode, the origin process makes initiates and completes the RMA operation without requiring that any MPI routine be called on the target process. This provides something similar to a conventional shared memory model. The reason for these two modes is that there are implementation issues and opportunities for optimization with each; the programmer can pick the one that is most appropriate for their needs. The following code fragment is an example of the MPI one-sided interface. It implements the same communication of data and the point-to-point example above. This example uses the active target synchronization with the routine MPI_Win_fence. MPI also provides routines to specify the region of memory that is exposed to access or updates from remote processes. The MPI_Win object contains this information and can be considered as the one-sided version of the MPI Communicator. MPI_Win_create(rbuf, , MPI_INT, sizeof(int), …, & win); MPI_Win_fence(, win); if (rank == ) { MPI_Put(sbuf, , MPI_INT, , , , MPI_INT, win); } MPI_Win_fence(, win); If (rank == ) { printf(“Received %d %d from sender/n”, rbuf[], rbuf[]); } MPI_Win_free(& win); In the MPI- model, the number of processes is fixed when the program begins. In MPI-, routines are provided to create new MPI processes as part of the same running MPI program (e.g., MPI_Comm_spawn) and to connect two MPI programs that are already running (e.g., MPI_Comm_connect and MPI_Comm_ accept). These routines return a special kind of
MPI (Message Passing Interface)
communicator (called an intercommunicator) that contains two groups of processes. In the case of MPI_Comm_spawn, the two groups are the group of processes that called MPI_Comm_spawn and the group of processes created by MPI_Comm_spawn. Intercommunicators are part of MPI-, but until MPI-, they were rarely used. MPI- also provides a rich set of parallel I/O routines. The viewpoint taken is that writing to a file is like sending a message and reading from a file is like receiving a message. Following that model, MPI provides routines to read and write data, using the same -tuple to describe the data in the MPI program. By using userdefined MPI datatypes to set a “file view” (describing which parts of a file each of the MPI processes can access), very general I/O patterns can be described in a single MPI call. And in addition to independent I/O routines, MPI provides collective I/O routines, where all processes (in the communicator use to open the file with MPI_File_open) perform coordinated I/O operations. These collective I/O routines can provide far more scalable performance than the independent I/O routines. MPI- specified bindings to C and Fortran , and MPI- added a binding to C++ and defined Fortran bindings. These bindings are relatively simple; there was little attempt, for example, to provide a high-level C++ binding. In the original MPI model, a collection of processes is created (by some mechanism not defined by the MPI standard) to begin the execution of an MPI program. However, many implementations provided a program, mpirun, to start an MPI program. To provide a more standard interface, the MPI- standard recommends (but does not require) a program mpiexec to start MPI programs. MPI- has been a mixed success. The MPI I/O routines are widely available and used by software libraries such as HDF and parallel netCDF. The onesided (RMA) routines are available but with uneven implementation quality; in addition, while the MPI RMA model permits an interesting class of applications (including client-server), it is a poor match to other one-sided applications. The absense of a read–modify– write operation makes operations such as fetch-andincrement very difficult. Despite these problems, MPI
M
remains the dominant parallel programming model in computational science.
The Future of MPI MPI continues to evolve. The MPI Forum has begun discussions on the next version of MPI, which will be called MPI-, and will most likely be a further extension of the current MPI- API. Among discussion for MPI- are the following features. Blocking collectives communication operations can limit the scalability of algorithms. During the development of MPI-, the MPI Forum felt that users could implement nonblocking versions of the collectives simply by creating a thread and calling a blocking MPI collective routine from within the thread. In practice, this was not sufficient. In some cases, the overhead of thread support was too costly, particularly for collectives involving small amounts of data. In addition, some systems, such as the IBM Blue Gene/L (one of the most powerful parallel computers) allowed only one thread per core, making this approach inefficient. MPI- is likely to offer nonblocking collectives, similar to the nonblocking point-to-point and nonblocking parallel I/O routines in MPI. The remote-memory access interface defined in MPI- was designed to support certain types of operations, such as halo exchanges used in numerical simulations, and it works well for those. But it is ill suited to more dynamic uses of RMA, and it is not an effective API for the operations needed by implementations of Partitioned Global Address Space (PGAS) languages. Other APIs, such as ARMCI and GASNET, offer a more flexible interface (while requiring either more hardware support or less precise semantics than MPI). The MPI Forum is exploring ways to extend the MPI RMA model to make it a more general purpose. One strength of MPI has been its support for components and multilingual programming. The MPI Forum is looking at ways to ensure that MPI works well with a variety of programming models, from SMP parallelism to coexistence with PGAS languages such as UPC and CAF. Large scale system with millions of processor cores, on which MPI is likely to be used, are likely to suffer frequent faults. MPI (starting with the original MPI-) does provide a flexible error-handling model. But after
M
M
MPI (Message Passing Interface)
an error, the behavior of MPI is undefined. The MPI Forum is attempting to define what faults and situations the MPI implementation should continue through (while indicating an error) to enable the development of portable, fault-tolerant programs. The above show that MPI will evolve to address new capabilities and requirements in large-scale parallel computing systems. See www.mpi-forum.org for information on the current status of MPI and the MPI standardization efforts.
More on History of MPI MPI really began at the Workshop on Standards for Message Passing in a Distributed Memory Environment in Williamsburg, Virginia, held in April, . The meeting was called by Ken Kennedy, who had the foresight to see both the need and opportunity for a specification of a standard library for message passing. A preliminary proposal, developed by Dongarra, Hempel, Hey, and Walker, led the community to come together to develop a standard. An organizational meeting held at Supercomputing in Minneapolis in November, , laid out the goals, along with a commitment by the author of this entry to maintain an implementation of the current draft of the standard. This draft and reference implementation helped to both refine the standard and provide initial implementations on a variety of platforms, aiding in the adoption of the new standard. Today, there are many high-quality implementations, both open source (such as MPICH and Open MPI) and proprietary implementations from the major parallel computer vendors. MPI was developed by a group of researchers, application specialists, and vendor representatives. An open process was used with regular (every – weeks) meetings. Email (and later wiki) discussions were used to engage the broader community. Votes, apportioned as one per participating institution, were used to choose the standard; voting at two consecutive meetings was required, allowing for “buyer’s remorse.” More details on the process are in [].
Conclusion MPI is possibly the most successful parallel programming model ever, having reached age (as of ) and
continuing to go strong. Built upon the sound basis of communicating sequential processes and using a truly open process to define the interface, MPI, particularly for the highly scalable systems, has become the parallel programming model to beat.
Related Entries CSP (Communicating Sequential Processes) HDF MPI-IO NetCDF I/O Library, Parallel
Bibliographic Notes and Further Reading An annotated version of the MPI standards are available in [, ]. Official versions of the documents are available at www.mpi-forum.org. A tutorial introduction to MPI- and MPI- is available in [, ]. There are many other tutorial and research publications on MPI; there is one annual conference devoted to MPI: the EuroMPI (formerly Euro PVMMPI) meetings produce a proceeding every year in Springer’s Lecture Notes in Computer Science (e.g., see []).
Bibliography . Gropp W, Huss-Lederman S, LumsdaineA, Lusk E, Nitzberg B, Saphir W, Snir M () MPI – The complete reference, vol , The MPI- extensions, MIT Press, Cambridge, MA . Gropp W, Lusk E, Skjellum A () Using MPI: portable parallel programming with the message passing interface, nd edn. MIT Press, Cambridge, MA . Gropp W, Lusk E, Thakur R () Using MPI-: advanced features of the message-passing interface, MIT Press, Cambridge, MA . Hempel R, Walker DW () The emergence of the MPI message passing standard for parallel computing. Comput Stand Interfaces :– . Keller R, Gabriel E, Resch MM, Dongarra J () Recent advances in the message passing interface – th European MPI users’ group meeting. Springer, Stuttgart . Message Passing Interface Forum () MPI: A message passing interface standard. High Perform Comput Appl : – . Message Passing Interface Forum () MPI: a message passing interface standard. Int J Supercomput Appl :– . Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J () MPI-the complete reference, vol , The MPI Core. MIT Press, Cambridge, MA
MPI-IO
MPI- I/O MPI-IO
MPI-IO Jean-Pierre Prost Morteau, France
Synonyms MPI- I/O
Definition MPI-IO is a portable interface defined by the Message Passing Interface (MPI) Forum in order to perform parallel I/O operations within distributed memory programs, leveraging MPI key concepts such as communicators, datatypes, and collective operations. It was first introduced as the I/O chapter of the second specification of the Message Passing Interface, referred to as MPI-.
Discussion Introduction In June , the MPI Forum released their first draft MPI ., defining point-to-point and collective communication operations between tasks (i.e., virtual processes) within a given context, called a communicator. Point-to-point communication operations could be either blocking or nonblocking. This draft also introduced the concept of a datatype describing a virtual data layout in the memory of the sending or the receiving process(es). In March , a second draft MPI . was released to correct errors and make clarifications from the first draft. Until July , implementations of the MPI . draft were released by major computer companies and by open source initiatives, and practical experiences with the “standard” were gathered by numerous distributed memory application developers. Among the shortcomings of the draft specification that this experimentation helped identify was the lack
M
of support for parallel I/O in a portable way across existing and emerging distributed and parallel file systems, limiting the scope of distributed memory applications to in-memory processing or to coding using either proprietary I/O interfaces or the Posix interface that these file systems supported. Therefore, in July , the MPI Forum published the second version of MPI, referred to as MPI ., that contained an I/O chapter, specifying a portable interface for parallel I/O aimed at MPI applications. MPI-IO was born. The MPI key concepts are leveraged by the MPI-IO specification. The basic idea behind MPI-IO is to consider reading a file like receiving data from the file into virtual memory and writing a file like sending data from virtual memory into the file. The file can therefore be assimilated to either a sending task in read operations or a receiving task in write operations. With this analogy in mind, independent I/O read and write operations are like point-to-point receive and send operations, respectively, and collective I/O operations are like collective communication operations between the group of initiating tasks in the context of a communicator and the file. These read and write operations, just like communication operations, can be blocking or nonblocking. And describing from where to read data within the file or where to write data into the file is only a matter of defining a file datatype, referred to as a file view, after opening the file. In order to gain better advantage of parallel file system specifics and reach optimal performance, without sacrificing application portability across parallel systems, file hints are introduced. Each target run time environment is free to support a given file hint, and setting an unsupported file hint to an open file simply has no (positive or negative) effect.
Access Pattern Expression In order to specify the access pattern associated with an open file, a file view must be defined by each task accessing the file, unless the default view is to be used and each task sees the file as a linear byte stream. A file view essentially specifies: ●
An absolute displacement into the file (in bytes), specifying where the file pointer ought to be placed
M
M
MPI-IO
in the file before the next access takes place by the calling task; ● An elementary datatype (etype), representing the MPI datatype of each element within the file to be accessed by the calling task; ● A file datatype (filetype), representing the layout of the elements to be accessed by the calling task; this layout must be an MPI datatype derived from the etype. Typically, all tasks participating to the collaborative access of a given file set complementary views of the file, in order to partition the file data among themselves. Figure shows how a file can be partitioned in a cyclic manner across three tasks by chunks of two elements each. A file view is associated with an open file by modifying the file handle returned when the file was opened. At any given time, only one file view can be set by a given task on a given open file. Setting a new file view changes the current file view. Note: The default file view corresponds to a displacement of zero bytes, and to the etype and the filetype set to MPI_BYTE.
Data Access Functions File data can be accessed either independently by each task, which has opened the file, or collectively by all tasks having opened the file. The area of the task’s virtual memory involved in the data access is specified by a memory buffer’s address, an MPI datatype, and a count. The region of the file to be accessed is specified by the file view associated with the open file and a file
position, which represents the location within the file where the access is to begin. The file position is expressed either as an explicit offset within the file view, i.e., as a number of etype elements to be skipped within the file view before reading or writing file elements, or as the current position of the task’s file pointer. Each open file has two types of file pointers maintained by the MPI environment, an individual file pointer for each task and a shared file pointer shared by all tasks having opened the file. The file pointer to be used for a given data access is determined by the name of the data access function called. After opening a file, each task’s file pointer and the shared file pointer are set to zero and point to the byte within the file associated to the displacement specified in the file view or to the beginning of the file if the active file view is the default view. Note that the same file view must be set by all tasks having opened the file when the shared file pointer is to be used. All data access can be either blocking or nonblocking. In the case of blocking calls, the call returns to the task when the data has been read from (or written to) the file. In the case of nonblocking calls, the call returns immediately to the task, which may choose later on to either test whether (call to the MPI_Test or MPI_Testall functions) or wait until (call to the MPI_Wait or MPI_Waitall functions) the data has been read from (or written to) the file. Note however that nonblocking collective data access functions are called split collective since they are made of two suboperations, a (nonblocking) begin operation that initiates the access and a (blocking) end operation that completes the access for all tasks synchronously.
etype filetype1
Task 1
filetype2
Task 2
filetype3
Task 3
File
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
MPI-IO. Fig. Partitioning an open file across three tasks
...
MPI-IO
M
MPI-IO. Table Data access function characteristics Data access function name
Independent/collective
Blocking/nonblocking
File position
MPI_File_read_at
Independent
Blocking
Explicit offset
MPI_File_write_at
Independent
Blocking
Explicit offset
MPI_File_iread_at
Independent
Nonblocking
Explicit offset
MPI_File_iwrite_at
Independent
Nonblocking
Explicit offset
MPI_File_read_at_all
Collective
Blocking
Explicit offset
MPI_File_write_at_all
Collective
Blocking
Explicit offset
MPI_File_read_at_all_begin
Collective
Split collective
Explicit offset
Collective
Split collective
Explicit offset
MPI_File_read
Independent
Blocking
Individual fp
MPI_File_write
Independent
Blocking
Individual fp
MPI_File_iread
Independent
Nonblocking
Individual fp
MPI_File_iwrite
Independent
Nonblocking
Individual fp
MPI_File_read_all
Collective
Blocking
Individual fp
MPI_File_write_all
Collective
Blocking
Individual fp
MPI_File_read_all_begin
Collective
Split collective
Individual fp
Collective
Split collective
Individual fp
MPI_File_read_shared
Independent
Blocking
Shared fp
MPI_File_write_shared
Independent
Blocking
Shared fp
MPI_File_iread_shared
Independent
Nonblocking
Shared fp
MPI_File_iwrite_shared
Independent
Nonblocking
Shared fp
MPI_File_read_ordered
Collective
Blocking
Shared fp
MPI_File_write_ordered
Collective
Blocking
Shared fp
MPI_File_read_ordered_begin
Collective
Split collective
Shared fp
Collective
Split collective
Shared fp
MPI_File_read_at_all_end MPI_File_write_at_all_begin MPI_File_write_at_all_end
MPI_File_read_all_end MPI_File_write_all_begin MPI_File_write_all_end
MPI_File_read_ordered_end MPI_File_write_ordered_begin MPI_File_write_ordered_end
Table specifies for each data access function whether the access is independent or collective, whether it is blocking or nonblocking, and whether it uses an explicit offset or the individual (or shared) file pointer (fp) to express the file position. Assuming fh refers to an open file, whose file view has just been set to the view described in Fig. ,
Table above identifies the calling task and the etype elements read or written by each data access function call stipulated. It is assumed that buf points to a valid task memory area that contains the appropriate number of etype elements being read or can contain the appropriate number of etype elements being written.
M
M
MPI-IO
MPI-IO. Table Examples of data access function calls Data access function call and calling task(s)
Etype elements read or written
Task : MPI_File_read_at(fh, , buf, , etype, status);
Task : elements , ,
Task : MPI_File_write_at_all(fh, , buf, , etype, status);
Task : , , , – Task : , , , – Task : , , ,
Task : MPI_File_write_at_all(fh, , buf, , etype, status); Task : MPI_File_write_at_all(fh, , buf, , etype, status); Task : MPI_File_read(fh, buf, , etype, status);
Task : , , , ,
Task :MPI_File_read(fh, buf, , etype, status);
Task : ,
Task : MPI_File_iwrite(fh, buf, , etype, status);
Task : , ,
Task : MPI_File_read_all_begin(fh, buf, , etype);
Task : , , – Task : , , – Task : , ,
Task : MPI_File_read_all_begin(fh, buf, , etype); Task : MPI_File_read_all_begin(fh, buf, , etype); Task : MPI_File_read_all_end(fh, buf, status); Task : MPI_File_read_all_end(fh, buf, status); Task : MPI_File_read_all_end(fh, buf, status);
FileA
FileB
FileC
Task 0
*
=
Task 1
*
=
Task 2
*
=
Task 3
*
=
FileC
MPI-IO. Fig. Matrix partitioning across four computing tasks
File Consistency File consistency determines the outcome of concurrent and conflicting accesses to the same file by multiple tasks. Two accesses are concurrent if the respective data spans associated with the accesses overlap. They conflict
if they are concurrent and at least one access is a write operation. Sequential consistency of concurrent and conflicting accesses to the same file guarantees each access is atomic and the outcome of the accesses is the same as
MPI-IO
#include #include #include #include #include
M
"sys/types.h" "stdlib.h" "stdio.h" "string.h" "mpi.h"
#define P #define Nb #define Ng
4 64 ( Nb * P )
/* number of tasks */ /* slice size */ /* global size of matrix */
static char fileA[] = "fileA"; static char fileB[] = "fileB"; static char fileC[] = "fileC"; void main( int argc, char *argv[] ) { int i, j, k; int iv; int myrank, commsize; int extent; int colext; int slicelen; int sliceext; int length[ 3 ]; MPI_Aint disp[ 3 ]; MPI_Datatype type[ 3 ]; MPI_Datatype ftypeA; MPI_Datatype ftypeB; MPI_Datatype ftypeC; MPI_Datatype slice_typeB; MPI_Datatype block_typeC; MPI_Datatype slice_typeC; int mode; MPI_File fh_A, fh_B, fh_C; MPI_Offset offsetA; MPI_Offset offsetB; MPI_Offset offsetC; MPI_Status status; int val_A[ Nb ][ Ng ], val_B[ Ng ][ Nb ], val_C[ Nb ][ Nb ]; /* initialize MPI */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &myrank ); MPI_Comm_size( MPI_COMM_WORLD, &commsize ); MPI-IO. Fig. Sample code for an out-of-core D matrix multiplication
that, which would have been achieved if the accesses were performed in a serial order, although that order is unspecified. By default, sequential consistency is only ensured by MPI-IO for concurrent accesses to the same file by the tasks, which opened it in the same collective operation.
Sequential consistency for conflicting accesses can only be ensured across a set of tasks, if they enabled the atomic mode when they collectively opened the file. For all other concurrent and conflicting accesses, sequential consistency cannot be ensured unless the
M
M
MPI-IO
MPI_Type_extent( MPI_INT, &extent ); colext = Nb * extent; slicelen = Ng * Nb; sliceext = slicelen * extent; /* create filetype for matrix A */ length[ 0 ] = 1; length[ 1 ] = slicelen; length[ 2 ] = 1; disp[ 0 ] = 0; disp[ 1 ] = sliceext * myrank; disp[ 2 ] = sliceext * P; type[ 0 ] = MPI_LB; type[ 1 ] = MPI_INT; type[ 2 ] = MPI_UB; MPI_Type_struct( 3, length, disp, type, &ftypeA ); MPI_Type_commit( &ftypeA ); /* create filetype for matrix B */ MPI_Type_vector( Ng, Nb, Ng, MPI_INT, &slice_typeB ); MPI_Type_hvector( P, 1, colext, slice_typeB, &ftypeB ); MPI_Type_commit( &ftypeB ); /* create filetype for matrix C */ MPI_Type_vector( Nb, Nb, Ng, MPI_INT, &block_typeC ); MPI_Type_hvector( P, 1, colext, block_typeC, &slice_typeC ); length[ 0 ] = 1; length[ 1 ] = 1; length[ 2 ] = 1; disp[ 0 ] = 0; disp[ 1 ] = sliceext * myrank; disp[ 2 ] = sliceext * P; type[ 0 ] = MPI_LB; type[ 1 ] = slice_typeC; type[ 2 ] = MPI_UB; MPI_Type_struct( 3, length, disp, type, &ftypeC ); MPI_Type_commit( &ftypeC ); /* open files */ mode = MPI_MODE_RDONLY; MPI_File_open( MPI_COMM_WORLD, fileA, mode, MPI_INFO_NULL, &fh_A ); mode = MPI_MODE_RDONLY; MPI_File_open( MPI_COMM_WORLD, fileB, mode, MPI_INFO_NULL, &fh_B ); mode = MPI_MODE_CREATE | MPI_MODE_WRONLY; MPI_File_open( MPI_COMM_WORLD, fileC, mode, MPI_INFO_NULL, &fh_C ); MPI-IO. Fig. (Continued)
tasks accessing the same file explicitly perform appropriate file sync operations collectively among themselves.
File Hints for Performance File hints can be specified when a task opens a file in order to take advantage of parallel file system specifics or minimize the use of system resources,
leading to optimized parallel I/O when accessing that file. Reserved hints are defined by MPI-IO to set the access style to the open file (random, sequential, reverse sequential, read mostly, write mostly, read once, write once), to enable collective buffering (allowing the same buffer to be shared across accessing tasks), to specify an
MPI-IO
M
/* set file error-handlers to MPI_ERRORS_ARE_FATAL in order to shorten sample code (default file error handler is MPI_ERRORS_RETURN) */ MPI_File_set_errhandler( fh_A, MPI_ERRORS_ARE_FATAL ); MPI_File_set_errhandler( fh_B, MPI_ERRORS_ARE_FATAL ); MPI_File_set_errhandler( fh_C, MPI_ERRORS_ARE_FATAL ); /* set views */ MPI_File_set_view ( fh_A, MPI_OFFSET_ZERO, MPI_INT, ftypeA, "native", MPI_INFO_NULL ); MPI_File_set_view( fh_B, MPI_OFFSET_ZERO, MPI_INT, ftypeB, "native", MPI_INFO_NULL ); MPI_File_set_view( fh_C, MPI_OFFSET_ZERO, MPI_INT, ftypeC, "native", MPI_INFO_NULL ); offsetA = MPI_OFFSET_ZERO; offsetB = MPI_OFFSET_ZERO; offsetC = MPI_OFFSET_ZERO; /* read horizontal slice of A */ MPI_File_read_at_all( fh_A, offsetA, val_A, slicelen, MPI_INT, &status ); /* loop on vertical slices */ for ( iv = 0; iv < P; iv++ ) { /* read next vertical slice of B */ MPI_File_read_at_all( fh_B, offsetB, val_B, slicelen, MPI_INT, &status ); offsetB += slicelen; /* compute block product */ for ( i = 0; i < Nb; i++ ) { for ( j = 0; j < Nb; j++ ) { val_C[ i ][ j ] = 0; for ( k = 0; k < Ng; k++ ) { val_C[ i ][ j ] += val_A[ i ][ k ] * val_B[ k ][ j ]; } } } /* write block of C */ MPI_File_write_at_all( fh_C, offsetC, val_C, Nb * Nb, MPI_INT, &status ); offsetC += Nb * Nb; } /* close files */ MPI_File_close( &fh_A ); MPI_File_close( &fh_B ); MPI_File_close( &fh_C ); /* free filetypes */ MPI_Type_free( &ftypeA ); MPI_Type_free( &slice_typeB ); MPI_Type_free( &ftypeB ); MPI_Type_free( &block_typeC ); MPI_Type_free( &slice_typeC ); MPI_Type_free( &ftypeC ); /* finalize MPI */ MPI_Finalize(); }
MPI-IO. Fig. (Continued)
I/O node list and a striping pattern across these nodes. Each implementation is free to support or not these reserved hints. If they are not supported, they are simply ignored. If they are supported, the implementation
should comply with the semantics defined for them in the MPI-IO specification. Additional implementation specific hints can be supported by an implementation. The naming of these
M
M
MPI-IO
hints should be chosen in such a way that conflicting names across implementations be prevented. For instance, IBM defined additional hints in their MPI-IO implementation on top of the IBM General Parallel File System (GPFS) in order to specify a sparse access style to the file, to control GPFS data prefetching, or to benefit from GPFS data shipping feature. This feature partitions a file across a specific number of I/O agents and ships all data accesses from the MPI tasks to these I/O agents over the interconnecting network, hence limiting the number of concurrent tasks accessing the file and grouping several small accesses into fewer larger data blocks to be accessed. It is clear that file hints are to be used cautiously and only when relevant to the I/O patterns exhibited by the application on a per file basis. Applying inappropriate file hints, when supported by a given implementation, may lead to decreased performance.
Sample Code: Out-of-Core Matrix Multiplication In order to illustrate the main MPI-IO concepts, Fig. above depicts the matrix partitioning scheme that is used in order to perform the out-of-core multiplication of two square matrices A and B. Each task is computing a slice of the resulting matrix C. Figure presents the corresponding code. Matrix A is initially stored in file “fileA,” matrix B in file “fileB,” and the result of the multiplication, matrix C, is stored in file “fileC.” All matrices are ∗ square matrices containing integer numbers, stored in row major order.
Related Entries Benchmarks Collective Communication File Systems I/O Metrics MPI (Message Passing Interface)
Bibliographic Notes and Further Reading The MPI-IO interface [] was defined in the late s by the Message Passing Interface Forum [] in an attempt to provide an interface for portable and efficient parallel I/O operations in distributed memory programming environments.
Implementations have been developed over the years on top of various parallel file systems, such as the General Parallel File System on IBM Scalable Power systems [] and on the Blue Gene/L supercomputer [], the Lustre file system [], the Parallel Virtual File System Version [], or the NEC SX Global File System []. An open source implementation of MPI-IO, referred to as ROMIO [], is available. It provides portability onto various back-end parallel file systems through an abstract-device I/O interface []. Performance studies [, ] and benchmarks [] have been carried out to demonstrate the efficiency that can be achieved for read and write operations through specific I/O strategies [, ] and the use of MPI-IO’s file hints [].
Bibliography . Allsopp NK, Hague JF, Prost JP () Experiences in using MPIIO on top of GPFS for the IFS weather forecast code. In: Proc Euro-Par, Manchester, UK . Corbett P, Feitelson D, Fineberg S, Hsu Y, Nitzberg B, Prost JP, Snir M, Traversat B, Wong P () Overview of the MPI-IO parallel I/O interface. Chapter in: High performance mass storage and parallel I/O: technologies and applications, Jin H, Cortes T, Buyya R (eds). Wiley/IEEE Press, ISBN: --- . Dickens P, Logan J () Towards a high performance implementation of MPI-IO on the Lustre file system. In: Proc GADA’: grid computing, high-performance and distributed applications, Monterrey, Mexico . Latham R, Ross R, Thakur R () The impact of file systems on MPI-IO scalability. In: Proc EuroPVM/MPI, , Budapest . Message Passing Interface Forum () MPI-: a message passing interface standard. HPCA (–) . Prost JP, Treumann R, Hedges R, Jia B, Koniges A () MPIIO/GPFS, an optimized implementation on top of GPFS. In: Proc Supercomputing, Denver, CO . Prost JP, Treumann R, Hedges R, Koniges A, White A () Towards a high-performance implementation of MPI-IO on top of GPFS. In: Proc Euro-Par , Munich, Germany, pp – . Rabenseifner R, Koniges A, Prost JP, Hedges R () The parallel effective I/O bandwidth benchmark: b_eff_io. Chapter in: parallel I/O for cluster computing, Cérin C, Jin H (eds). Kogan Page Science, London, ISBN ---, pp – . Thakur R, Gropp W, Lusk E () An abstract-device interface for implementing portable parallel-I/O interfaces. In: Proc th Symposium on the frontiers of massively parallel computation Annapolis, MD, pp – . Thakur R, Gropp W, Lusk E () On implementing MPI-IO portably and with high performance. In: Proc th Workshop on Input/Output in parallel and distributed systems, Atlanta, GA, pp –
Multiflow Computer
. Worringen J, Träff JL, Ritzdorf H () Improving generic non-contiguous file access for MPI-IO. In: Proc th European PVM/MPI User’s Group Meeting, Venice, Italy, Lecture Notes in Computer Science, vol , Springer-Verlag, pp – . Yu H, Sahoo RK, Howson C, Almasi G, Castanos JG, Gupta M, Moreira JE, Parker JJ, Engelsiepen TE, Ross R, Thakur R, Latham R, Gropp W () High performance file I/O for the BlueGene/L supercomputer. In: Proc th International symposium on high-performance computer architecture (HPCA-), Austin, TX
MPP Synonyms SIMD (Single Instruction, Multiple Data) Machines The Goodyear Massively Parallel Processor (MPP) [] was an SIMD machine built by Goodyear Aerospace Corporation which was a subsidiary of Goodyear Tire and Rubber Company. The machine was delivered in .
Related Entries
Geoff Lowney Intel Corporation, Husdon, MA, USA
Definition The Multiflow Trace was a VLIW computer system, designed and manufactured by Multiflow Computer, Inc. of Branford, CT, USA. The Multiflow Trace was distinguished by its ability to issue as many as , , or operations in each instruction, whose width could be up to , bits, depending upon the model. Designed as a target for a trace scheduling compiler, users were not required to divide the code into parallel routines. The compiler, sometimes with guidance from the programmer, found instruction-level parallelism in ordinary code, and constructed the very long instructions prior to the running of the program. Design of the Multiflow Trace began in , with the first systems delivered in . The last of the approximately systems sold was delivered in .
Discussion
Flynn’s Taxonomy
Introduction
. Batcher KE () Design of a massively parallel processor. IEEE Trans Comput C:–
Mul-T Multilisp
Multicomputers Clusters Distributed-Memory Multiprocessor Hypercubes and Meshes
Multicore Networks Interconnection Networks
Multiflow Computer
Cray XT and Cray XT Series of Supercomputers
Bibliography
M
In the late s, instruction-level parallelism (ILP) was a research topic. In fact, Joseph (Josh) Fisher coined the term in , while a professor at Yale University. Dynamic ILP, where the parallelism is discovered by the hardware as the program runs, had been explored in supercomputing. Static ILP, where the microcoder, assembly language programmer, or compiler scheduled multiple operations into a single instruction, existed in microcoded machines and in some specialized accelerators. Several researchers had demonstrated that only limited amounts of instruction-level parallelism could be discovered in a basic block, which seemed a natural boundary for compile-time scheduling. In his dissertation research at New York University, Fisher demonstrated that extensive instruction-level parallelism could be discovered by scheduling across multiple basic blocks. His compiler technique, called trace scheduling, identified paths, or traces, through a program at compile-time, unrolling loops if needed. Operations on the trace were scheduled across basic block boundaries and packed into long instruction words.
M
M
Multiflow Computer
Fisher’s ideas made it practical to build generalpurpose computers designed to exploit instructionlevel parallelism. No expensive dynamic hardware would be required. Fisher called these machines VLIW (Very Long Instruction Word) computers, and his research team at Yale developed the concepts behind the VLIW architecture and trace scheduling compilers. Fisher and two of his colleagues from Yale (John Ruttenberg and John O’Donnell) founded Multiflow in . Compiler and hardware design started immediately, with the compiler written at Yale serving as an initial guide to the hardware design. Three years later, Multiflow delivered the Trace /, followed quickly by the /, and the /, /, and /, all running the Unix operating system, with code compiled by Multiflow’s Trace Scheduling Compiler. During the years of their delivery, the Trace systems were consistently at or near the top of the performance/price ratio of scientific computers, despite their relatively modest ns cycle time.
Design Principles The Trace machines were designed as a target of a trace scheduling compiler. As much as possible, all scheduling decisions and resource allocation checks were made by the compiler, not the hardware. At compile-time, the compiler modeled how the program would behave on a cycle-by-cycle basis, created a schedule of its execution, and then at run-time, the hardware executed the schedule. For events that were not modeled by the compiler, such as an instruction cache miss, the entire execution pipeline was stalled until execution resumes. Compile-time modeling of program behavior was the essence of the trace scheduling design. This idea is also applicable to modern processors that support more dynamic behavior in hardware, and it has been used successfully in many compilers derived from the Multiflow compiler.
The Trace Machines Implementation. There were two generations of Multiflow computers. The series, first shipped in January, with a ns cycle time, was implemented in CMOS gate arrays and TTL logic, using Weitek CMOS floating point chips. The series followed in July, , using the Bipolar Integrated Technologies (BIT) ECL
floating point parts. The cycle time remained at ns. A third generation of Multiflow machines was under design when the company closed in March . It was an ECL semi-custom implementation targeting a ns cycle time. Basic architecture. An instruction on the Multiflow Trace machine consisted of several RISC-level operations packed together into a single wide instruction. The wide instructions directly controlled multiple register files, pipelined memory units, branch units, integer ALUs, and pipelined floating point units. Execution was fully synchronous, with a single program counter directing instruction fetch. All operations took a statically predictable number of machine cycles to complete. Pipelining was exposed at the instruction level. The processor was not scoreboarded, and machine resources could be oversubscribed. The memory system was interleaved. The compiler needed to avoid register conflicts, schedule the machine resources, and manage the memory system. The machines came in three widths: a seven-wide, which had a -bit instruction issuing seven operations; a -wide with a -bit instruction; and a -wide with a ,-bit instruction. The wider processors were organized as multiple copies of the seven-wide functional units; the seven-wide group of functional units was called a cluster. A block diagram of the / is below (Fig. ). There were four functional units per cluster: two integer units and two floating units. In addition, each cluster could compute a branch target. An instruction was divided into two ns beats. The integer ALUs could issue in the early and late beats of an instruction. The floating ALUs and branch unit could issue only in the early beat. Altogether, the cluster had the resources to issue seven operations for each instruction. Most integer ALU operations were complete in a single beat. The load pipeline was seven beats. On the series, the floating point pipelines were four beats. Branches issued in the early beat and the branch target was reached on the following instruction, effectively a two beat pipeline. There are nine register files per cluster. Data going to memory was first moved to a store file. Branch banks were used to control conditional branches and the select operation.
M
Multiflow Computer
256-bit Instruction Word PC
Branch opcode
Int/mem opcode
Br Bnk 6⫻1
Integer opcode
Integer opcode
Floating opcode
Floating opcode
FALU & FMUL
IALU
IALU
FALU & FMUL
IREG
IREG
FREG
FREG
SREG
32⫻32
32⫻32
16⫻64
16⫻64
16⫻64
BR
Br Bnk 6⫻1
Int/mem opcode
M Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Memory 32 Mb 8 banks
Interleaved Memory Total of 512 Mb Total of 64 Banks
Multiflow Computer. Fig. Block diagram of the /
The instruction cache held K instructions. Each cluster had a slice of the cache, holding the instruction fields which control the cluster. There was no data cache. The memory system supported MB of physical memory with up to way interleaving. The virtual address space was GB. There were two IO processors. Each supported a MB/s DMA channel to main memory and two MB/s VME busses. Code sample. The code sample below contains two instructions of / code, extracted from the compiled inner loop of the × LINPACK benchmark. Each operation is listed on a separate line. The first two fields identify the cluster and the functional unit
to perform the operation, the remainder of the line describes the operation. Note the destination address is qualified with a register-bank name (e.g., sb.r); the ALUs could target any register bank in the machine (with some restrictions). There was extra latency in reaching a remote bank. instr cl0 ialu0e st.64
sb1.r0,r2,17#144
cl0 ialu1e cgt.s32 li1bb.r4,r34,6#31 cl0 falu0e add.f64 lsb.r4,r8,r0 cl0 falu1e add.f64 lsb.r6,r40,r32 cl0 ialu0l dld.64
fb1.r4,r2,17#208
cl1 ialu0e dld.64
fb1.r34,r1,17#216
M
Multiflow Computer
cl1 ialu1e cgt.s32 li1bb.r3,r32,zero cl1 falu0e add.f64 lsb.r4,r8,r6 cl1 falu1e add.f64 lsb.r6,r40,r38 cl1 ialu0l st.64
sb1.r2,r1,17#152
cl1 ialu1L add.u32 lib.r32,r36,6#32 cl1 br true and r3 L23?3 cl0 br false or r4 L24?3; instr cl0 ialu0e dld.64
fb0.r0,r2,17#224
cl0 ialu1e cgt.s32 li1bb.r3,r34,6#30 cl0 falu0e mpy.f64 lfb.r10,r2,r10 cl0 falu1e mpy.f64 lfb.r42,r34,r42 cl0 ialu0l st.64
sb0.r4,r2,17#160
cl1 ialu0e dld.64
fb0.r32,r1,17#232
cl1 ialu1e cgt.s32 li1bb.r4,r35,6#29 cl1 falu0e mpy.f64 lfb.r10,r0,r10 cl1 falu1e mpy.f64 lfb.r42,r32,r42 cl1 ialu0l st.64
sb0.r6,r1,17#168
cl1 ialu1l bor.32
ib0.r32,zero,r32
cl1 br false or r4 L25?3 cl0 br true and r3 L26?3;
Data types. The natural data types of the machine were -bit signed and unsigned integers, -bit pointers, -bit IEEE-format single precision, and -bit IEEE-format double precision. -bit integers and -bit characters were supported with extract and merge operations; bit strings with shifts and bitwise logicals; long integers by add with carry; and booleans with normalized logicals. There was no hardware support for extended IEEE precision, denormalized numbers, or gradual underflow. Loads and stores were either or bits. Natural alignment was required for high performance; misaligned references were supported through trap code, with a substantial performance penalty. Memory was byte addressed. Byte addressing eased the porting of C programs from byte-addressed processors such as the VAX and the motorola . The low bits of the address were ignored in a load or a store, but were read by extract and merge operations. Memory system and data paths. The Multiflow Trace had a two-level interleaved memory hierarchy exposed
to the compiler. All memory references went directly to main memory; there was no data cache. There were eight memory cards, each of which contained eight banks. Each bank could hold MB, for a total capacity of MB. Memory was interleaved across the cards and then the banks. The low byte of an address determined its bank: bits – were ignored, bits – selected a card, and bits – selected a bank. Data was returned from memory on a set of global busses. These busses were shared with moves of data between register files on different clusters; to maintain full memory bandwidth on a /, the number and placement of data moves needed to be planned carefully. Each level of interleaving had a potential conflict. ●
Card/bus conflict. Within a single beat, all references were required to be distinct cards and to use distinct busses. If two references conflicted on a card or a bus, the result was an undefined program error. ● Bank conflicts. A memory bank was busy for four beats from the time it was accessed. If another reference touched the same bank within the four-beat window, the entire machine stalled. To achieve maximum performance, the compiler was required to schedule successive references to distinct banks. A / could generate four references per beat, and, if properly scheduled, the full memory bandwidth of the machine could be sustained without stalling. Global resources. In addition to functional units and register banks, the Trace machines had a number of global shared resources that needed to be managed by the compiler. ●
Register file write-ports. Each integer and floating register file could accept at most two writes per beat, one of which could come from a local ALU. Each branch bank could accept one write per beat. Each store file could accept two writes per beat. ● Global busses. There were ten global busses; each could hold a distinct -bit value per beat. The hardware contained routing logic; the compiler needed only to guarantee that the number of busses was not oversubscribed in a beat. ● Global control. There was one set of global controller resources, which controlled access to link registers used for subroutine calls and indirect branches.
Multiflow Computer
Integer units. The integer units executed a set of traditional RISC operations. There were a number of features added to support trace scheduling. ●
The most common code optimization in trace scheduling was to move operations above a conditional branch and execute them speculatively, before the branch condition was known. To support speculative execution, there was a set of dismissible load operations. These operations performed a normal load and also set a flag to enable the exception handler to know the load was being performed speculatively. ● All operations that computed booleans were invertible. Invertible branches enabled the trace scheduler to lay out a trace as straight-line code by inverting branch conditions as necessary. ● A -input, one output select operation (a = b ? c: d) enabled many short forward branches to be mapped into straight-line code. A conditional branch was a two-operation sequence. An operation targeted a branch bank register, and then the branch read the register. Separate register files for the branch units relieved pressure on the integer register files and provided additional operand bandwidth to support the simultaneous branch operations. The two integer ALUs per cluster were asymmetric; only one could issue memory references. Due to limits on the size of the gate arrays, each integer ALU had its own register file. This fact, coupled with the low latency of integer operations, made it difficult for the instruction scheduler to exploit parallelism in integer code. The cost of moving data between register files often offset the gains of parallelism. Floating units. Each floating unit had register file with -bit registers available to the compiler. To keep the pipelines full, nine distinct destination registers were required for operations in flight. This left only six registers to hold variables, common sub-expressions, and the results of operations that are not immediately consumed. There was no pipelined floating move. A move between floating registers took one beat and consumed a register write-port resource. This could prevent another floating point operation issued earlier from
M
using the same write-port; in some situations, a floating move could lock out two floating point operations. Instruction encoding. The instruction encodings were large for code that did not use all of the functional units in each instruction. A no-op operation (NOP) indicated an unused functional unit. To save space in memory and on disk, object code was stored with NOP operations eliminated. Instructions were grouped in blocks of four, and the non-NOP operations were stored with a preceding mask word that indicated which NOPs have been eliminated. When an instruction was loaded into the instruction cache, it was expanded into its full width. To save space in the instruction cache, an instruction could encode a multi-beat NOP, which instructed the processor to stall for the specified number of beats before executing the following instruction. The large instruction format provided a generous number of immediate operands. The compiler used immediates heavily. Constants were never loaded from memory; they were constructed from the instruction word. Double precision constants were pieced together out of two immediate fields. The global pointer scheme used by most RISC machines was not required. Trap code. Trap hardware and trap code supported virtual memory by trapping on references to unmapped pages and on stores to write-protected pages. To prevent unwarranted memory faults on speculative loads, the compiler used the dismissible load operation. If a dismissible load trapped, the trap code would not signal an exception. If required, a translation buffer miss or a page fault was serviced, and, if successful, the memory reference was replayed by the trap code and the value was returned. Otherwise, a NAN or integer zero was returned. In all cases, computation continued. NANs were propagated by the floating units, and checked only when they were written to memory or converted to integers or booleans. Correct programs would run correctly with speculative execution, but an incorrect program may have missed an exception that it would have signaled if compiled without speculative execution. The hardware optionally supported precise floating exceptions, but the compiler could not move floating operations above conditional branches if this mode was in use. This mode was used when compiling for debugging.
M
M
Multiflow Computer
The trap code supported access to misaligned data by piecing together the referenced datum from multiple -bit words. In doing so, it placed - and -bit quantities in the correct place in a -bit word so that extracts and merges would work correctly.
Trace Scheduling Compiler The Multiflow compiler supported standard C and FORTRAN and discovered instruction-level parallelism in ordinary code. The compiler created a schedule of execution for the program being compiled, modeling the behavior of the program as much as possible at compile-time. It addressed several challenges posed by the Multiflow Trace machines. ●
Find Parallelism. There were a large number of pipelined functional units, requiring from – data independent operations in flight to fill the machine. This large amount of instruction-level parallelism required scheduling beyond basic blocks and also a strategy for finding parallelism across loop iterations. ● Model the machine. Machine resources could be oversubscribed, and the pipelines used resources in every cycle. The compiler needed to model precisely the cycle-by-cycle state of the machine even in situations where such a model was not critical for performance. ● Schedule the memory system. There was an interleaved memory system that was managed by the compiler. Card conflicts caused program error; bank conflicts stalled the program. ● Schedule data motion. Each functional unit had its own register file. It cost an operation to move a value between registers, although remote register files could be targeted directly at the cost of an extra beat of latency. The compiler had a traditional three-phase design: front end, optimizer, and back end. The optimizer focused both on reducing computation and exposing instruction-level parallelism. The back end was organized around the trace scheduling algorithm. Front ends. The compiler supported user-level directives to enable the programmer to pass additional information to the compiler. Loop unrolling directives specified how a loop should be unrolled. Inline directives selected
functions to be inlined. Memory-reference directives asserted facts about addresses to aid memory alias analysis. Trace-picking directives specified branch probabilities and loop trip counts. In addition, the front end could instrument a program to count basic block executions. The instrumentation was saved in a database which could be read back on subsequent compilations. This information was used to guide trace selection. The compiler supported the Berkeley Unix run-time environment. Care was given to the data structure layout rules to ease porting from the VAX and the motorola . Despite the unusual architecture of the Multiflow Trace, it was easier to port programs from BSD VAX or motorola systems to the Trace than to many contemporary RISC-based systems. The optimizer. The goal of the optimizer was to reduce the amount of computation the program would perform at run-time and to increase the amount of parallelism for the trace scheduler to exploit. Computation was reduced by removing redundant operations or rewriting expensive ones. Parallelism was increased by removing unnecessary control and data dependencies and by unrolling loops to expose parallelism across loop iterations. The Multiflow compiler accomplished these goals with standard Dragon-book-style optimization technology enhanced with a powerful memory-reference disambiguator. The optimizer was designed as a set of independent, cooperating optimizations that shared a common set of data structures and analysis routines. The analysis routines computed control flow (dominators, loops) and data flow (reaching defs and uses, live variables, reaching copies). In addition, the disambiguator computed symbolic derivations of address expressions. Each optimization recorded what analysis information it needed, and what information it destroyed. The order of optimizations was quite flexible; in fact, it was controlled by a small interpreter. The order of optimization was organized around two invocations of loop unrolling, to expose the maximum amount of parallelism. The order used for full optimization was as follows. Basic optimizations Expand function entries and returns Find register variables Expand memory operations Eliminate common sub-expressions
Multiflow Computer
Propagate copies Remove dead code Rename temps Transform ifs into selects Prepare for first loop unrolling Generate automatic assertions Move loop invariant Find register memory references Eliminate common sub-expressions Transform ifs into selects Find register expressions Remove dead code First unroll Unroll and optimize loops Rename temps Propagate copies Simplify induction variables Eliminate common sub-expressions Propagate copies Remove dead code Second unroll Unroll and optimize loops Rename temps Prepare for Phase Expand function calls Walk graph and allocate storage Analyze for dead code removal Remove assertions Remove dead code Expand remaining IL operations Propagate copies Remove dead code Rename temporaries Loops were unrolled by copying the loop body including the exit test. Unlike most machines, the Trace had a large amount of branch resource, and there was no advantage to removing exit branches. By leaving the branches in, the compiler avoided the short-trip count penalty caused by preconditioning. In addition, loops with data dependent loop exits that could not be preconditioned (e.g., while loops) were unrolled and optimized across iterations. For loops with constant trip-count, all but one exit test was removed, and shorttrip count loops were unrolled completely.
M
Loops were unrolled heavily. A loop needed to be unrolled enough to expose sufficient parallelism within the loop body to enable the instruction scheduler to fully utilize the machine. Unrolling was controlled by a set of heuristics which measured the number of operations in the loop, the number of internal branches, and the number of function calls. A loop was unrolled until either the desired unroll amount was reached or one of the heuristic limits was exceeded. The default unrolling in FORTRAN for a Trace -wide was ; at the highest level of optimization, the compiler unrolled by . The compiler did not perform software pipelining, which would have reduced the need for such large unrollings. The large size of the Trace instruction cache limited the benefits of software pipelining. The back end. The output of optimizer was a flow graph of operations lowered to machine level. The back end performed functional-unit assignment, instruction scheduling, and register allocation. The work was divided into four modules: the trace scheduler, which managed the flow graph and assured inter-trace correctness; the instruction scheduler, which scheduled each trace and assured intra-trace correctness; the machine model, which provided a detailed description of the machine resources; and a disambiguator, which performed memory-reference analysis. Trace scheduler. The trace scheduler performed the following steps: Step . The trace scheduler estimated how many times each basic block in the flow graph would be executed. The execution estimates were calculated from loop trip counts and the probabilities of conditional branches. The estimates were obtained from either a database collected during previous executions of the program, user directives, or heuristics. Step . The trace scheduler performed the following loop until the entire flow graph has been scheduled. ●
Using the execution estimates as a guide, the trace scheduler picked a trace (a sequence of basic blocks) from the flow graph. The trace scheduler first selected a seed, the yet-to-be-scheduled block with the highest execution estimate. The trace was grown forward (in the direction of the flow graph) and then backward from the seed; in each step, the trace scheduler selected a basic block that satisfied the
M
M
Multiflow Computer
current trace-picking heuristic. If no basic block satisfied the current heuristic, the trace ended. Traces always ended when the trace scheduler reached an operation that was already scheduled or an operation that was already on the trace. Also, traces never crossed the back edge of a loop. ● The trace scheduler passed the trace to the instruction scheduler. The instruction scheduler returned a machine language schedule. The instruction scheduler is discussed in more detail below. ● The trace scheduler replaced the trace with the schedule in the flow graph and, if necessary, added copies of operations to compensate for code motions past basic block boundaries. Compensation code was necessary if an operation moved below a branch out of the trace or above the target of a branch into the trace. Speculative execution, moving an operation above a branch, did not produce compensation code. This was the most common code motion in the Multiflow compiler. High priority operations from late in the trace were moved above branches and scheduled early in the trace. The instruction scheduler would perform such a move only if it was safe: An operation could not move above a branch if it wrote memory or if it set a variable that was live on the off-trace path. The hardware provided special dismissible operations and support for suppressing or deferring the exceptions generated by speculative operations. Step . The trace scheduler emitted the machine language schedules in depth first order. This ordering gave the effect of profile-guided code layout, since traces were selected along the most frequently traveled paths in the program. The trace scheduler also provided the instruction scheduler the global context for a trace. The instruction scheduler scheduled one trace at a time, but it required information about neighboring traces for correctness and to optimize the performance on intertrace transitions. The functional-unit pipelines on the Multiflow computer used machine resources in every cycle; the trace scheduler communicated to the instruction scheduler the state of the machine pipeline on entry to the trace. Similarly, it passed information about memory references in flight, to avoid bank stalls.
The trace scheduler and the instruction scheduler also exchanged the register bindings at each trace entry and exit point, enabling incremental register allocation with a global view. Instruction scheduler. The instruction scheduler transformed a trace of operations into a schedule of wide instructions. For each operation, the instruction scheduler assigned registers for the operands, assigned a functional unit for the operation, and placed the operation in a wide instruction. It used a three-step algorithm. Step . The instruction scheduler built a data precedence graph (DPG) from the trace. Scheduling a trace rather than a basic block introduced very little complexity in the instruction scheduler. When building the DPG, edges were added to constrain some operations from moving above branches. After scheduling, the trace scheduler compensated for any global code motions performed by the instruction scheduler. Step . The instruction scheduler walked the DPG and assigned operations to functional units and values to register banks. The scheduler made a tradeoff between parallelism and the cost of global data motion. To implement this tradeoff, the scheduler divided the DPG into components that contained a relatively small amount of parallelism and a large amount of shared data. The scheduler did a bottom-up traversal of the DPG, assigning functional units and register banks. The scheduler imposed a high cost for the use of more than one identical resource (e.g., a floating point unit) for a particular component in the DPG. This heuristic kept the members of the same component together in the same subset of the machine and achieved parallelism by spreading the components around the machine. Step . The instruction scheduler performed list scheduling, creating the schedule and allocating registers. The scheduler created the schedule in forward order, starting with the first instruction. The scheduler maintained a list of data-ready operations and filled an instruction with the highest priority operations from the data-ready list, and then advanced to the next instruction. On the Multiflow Trace machines, long pipelines and multiple operations per instruction made floating point registers a critical resource for generating high performance code. The instruction scheduler
Multiflow Computer
performed register allocation, enabling the registers to be scheduled similarly to the other resources. List scheduling was well suited to integration with register allocation. At any point during the scheduling process, each of the registers was either occupied or available. When checking for resources to schedule a particular operation, the scheduler also included a check for a free register for operation’s result. If all the resources were available to schedule the operation, the register, along with the other resources, was reserved. When there was no free register, the scheduler would either free a register or delay the current operation. Registers could be freed by spilling, by rematerialization, or by removing an earlier operation from the schedule. Heuristics for this decision were very important to performance. The scheduler also scheduled the two-level interleaved memory system. The scheduler organized the memory references in a trace into equivalence classes based on the compiler’s knowledge of their offsets relative to each other; the classes were formed by querying the disambiguator. Within an equivalence class, the scheduler understood the potential bank and card conflicts; between two different classes, nothing was known. The scheduler would batch memory references by equivalence class to avoid potential stalls. Batching groups of references increased register pressure, and heuristics were important to control this tradeoff. Machine model. The Trace machines had limited hardware resource management and relied on the compiler to allocate machine resources and check for oversubscription. The compiler contained a detailed model of the machine. The model contained the set of operations and the mapping of each operation to functional units, register banks, and immediate fields. The latencies and resources required by each stage of an operation were represented. The model contained a graph of connections between machine elements and the latencies and resources required to traverse the graph. The model was organized to provide efficient access to the scheduler about the resources needed to schedule an operation. Disambiguator. The compiler looked for fine-grained parallelism along a trace and within the body of an unrolled loop. The scheduler typically wanted to know if a load in the current iteration could move above a store in the previous iteration. This is a different question
M
than the one asked by a parallelizing compiler, which is interested in knowing if there exists a conflict between a load and a store across any of the loop iterations. Memory analysis was performed by a disambiguator, which tried to determine if two memory references refer to the same location or whether they refer to the same memory bank. The disambiguator derived a symbolic equation of the address used in a memory reference; the terminals in the equation were constants and definitions (or sets of definitions) of variables in the program. Each derivation was normalized to a sum of products. To answer a question about two memory references, the disambiguator subtracted the two derivations and performed a GCD test. An assertion facility allowed the programmer pass additional information to improve accuracy. The disambiguator also knew basic facts about how a variable was allocated and declared, and it used this information in its analysis as well. For example, a pointer cannot reference a variable whose address is never captured by the program. Calling sequence. Multiflow used a pure caller-saves register partition; the values of registers were not preserved across a procedure call. Compiling for parallelism requires the use of many more registers than are used in sequential compilation, and typically register lifetimes do not cross function calls. On a wide machine, the saving and restoring of caller-saved registers could be intermixed with other operations before and after a call, and were frequently scheduled with minimal execution cost. Trace scheduling selected traces in priority order, enabling the saving and restoring of registers to be moved out of the most frequently executed paths in the program. Multiflow used all of the registers in the first cluster for passing arguments and returning values. A total number of integer values and double precision floating values could be passed or returned in registers. Aggregates (e.g., C structures) could be passed and returned in registers. The large number of registers available for procedure linkage enabled a fast linkage to the -and -at-a-time math functions. For C programs, the compiler passed arguments in both memory and registers; the called procedure read the arguments from registers, or from memory if required. The extra cost of passing arguments in both registers and memory was surprisingly small. The access
M
M
Multiflow Computer
to arguments on procedure entry was often on the critical path, and the arguments could be read directly from registers. The storing of arguments at a call site was often not on the critical path, and it could be scheduled alongside other operations at little cost.
Multiflow’s Influence on the Industry The Multiflow compiler technology was purchased by Intel, Hewlett-Packard, Digital Equipment Corporation, Fujitsu, NEC, Hughes, HAL Computer, and Silicon Graphics, as well as by several research institutions, including MIT and The University of Washington. Some purchased it for its highly capable compiler framework, which could be adapted to generate code for non-VLIW systems. Others bought it to aid in the development of VLIW architectures or to generate production compilers to accompany VLIW products. In the past two decades, several notable compiler product lines have been derived from the Multiflow compiler technology, in particular the compiler products from Intel, NEC, and Fujitsu. These compilers have received significant ongoing investment, and the original Mulitflow implementation has been largely rewritten. Multiflow was an important proof-of-concept of VLIW technology. The successful implementation of Multiflow’s systems demonstrated that VLIW technology was viable, something that was not widely believed before Multiflow’s engineering success. Texas Instruments had written of their skepticism before Multiflow; eventually, they were to develop VLIW processors that have had a strong position in cellular communications. VLIWs from various manufacturers are now commonplace in digital cameras, multifunction printers, network communication devices, computer graphics accelerators, televisions, digital video recorders, and other set-top boxes. The Trace machines were a direct influence on the STMicroelectronics ST-’s architecture, developed by Fisher’s team at HP. By , nearly million ST- parts were reported to be deployed in printers and video devices.
Related Entries Instruction-Level Parallelism Trace Scheduling VLIW Processors
Bibliographic Notes and Further Reading Multiflow technology was derived from Josh Fisher’s research. As a graduate student at New York University, he developed the idea of trace scheduling [, ]. As a professor at Yale, he extended the trace scheduling idea, designing a VLIW architecture [] and a practical compiler []. The dissertations of John Ellis [] and Alex Nicolau [] describe the compiler in detail. Ellis’s dissertation won the ACM Doctoral Dissertation Award, and his research compiler was a starting point of the work at Multiflow. The first paper on Multiflow was published at ASPLOS in [], the same year the first Trace machines shipped. A retrospective on the never-completed series is presented in []. A thorough description of the Multiflow compiler was published in a special issue of the Journal of Supercomputing devoted to instructionlevel parallelism []. An experimental evaluation of the Multiflow compiler is presented in [].
Bibliography . Colwell RP, Nix RP, O’Donnell JJ, Papworth DB, Rodman PK () A VLIW architecture for a trace scheduling compiler, ACM SIGARCH Computer Architecture News (), October , pp –. A revised version was published in IEEE Trans Comput ():– . Colwell RP, Hall WE, Joshi CS, Papworth DB, Rodman PK () Architecture and implementation of a VLIW supercomputer. In: Proceedings, IEEE/ACM supercomputing , October . IEEE Computer Society Press, Los Alamitos, pp – . Ellis JR () Bulldog: a compiler for VLIW architectures. MIT Press, Cambridge, MA. Also Ph.D. Thesis, Yale University, New Haven, Conn., February . Fisher JA () The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources. Ph.D. Thesis, New York University, New York . Fisher JA () Trace scheduling: a technique for global microcode compaction. IEEE Trans Comput C-():– . Fisher JA () Very long instruction work architectures and the ELI-. In: Proceedings of the th annual international symposium on computer architecture, Stockholm, Sweden. ACM, New York, pp – . Fisher JA, Ellis JR, Ruttenberg JC, Nicolau A () Parallel processing: a smart compiler and a dumb machine. In: Proceedings of ACM SIGPLAN ’ symposium compiler construction, Montreal, June . ACM, New York, pp – . Fisher JA, Faraboschi P, Young C () Embedded computing a VLIW approach to architecture, compilers, and tools. Morang Kaufman, San Francisco
Multifrontal Method
. Freudenberger S, Ruttenberg J () J phase ordering of register allocation and instruction scheduling. In: Giergerich R, Graham SL (eds) Code generation – concepts, tools, techniques. Springer, London, pp – . Freudenberger SM, Gross TR, Lowney PG () Avoidance and suppression of compensation code in a trace scheduling compiler. Trans Prog Lang Syst ():– . Lowney PG, Freudenberger S, Karzes T, Lichtenstein W, Nix R, O’Donnell J, Ruttenberg J () The multiflow trace scheduling compiler. J Supercomput (–):– . Nicolau A, Fisher J () Using an oracle to measure parallelism in single instruction stream programs. The th annual microprogramming workshop, October . IEEE, New York, pp – . Nicolau A () Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. Thesis, Yale University, New Haven, CT
M
The Multifrontal Method The multifrontal method is a direct method so that if the aim is to solve the equations Ax = b,
()
where A is sparse, then this is done by performing the matrix factorization PAQ = LU,
()
where P and Q are permutation matrices, L is a lower triangular matrix, and U is an upper triangular matrix. The solution to the Eq. is then obtained through a forward elimination step Ly = Pb, followed by the backsubstitution UQT x = y. If the matrix A is symmetric, then the factorization PAPT = LDLT
Multifrontal Method Patrick Amestoy , Alfredo Buttari , Iain Duff , Abdou Guermouche , Jean-Yves L’Excellent , Bora Uçar Université de Toulouse ENSEEIHT-IRIT, Toulouse cedex , France Science & Technology Facilities Council, Didcot, Oxfordshire, UK Université de Bordeaux, Talence, France ENS Lyon, Lyon, France
Synonyms Linear equations solvers
Definition The multifrontal method is a direct method for solving systems of linear equations Ax = b, when A is a sparse matrix and x and b are vectors or matrices. The multifrontal method organizes the operations that take place during the factorization of sparse matrices in such a way that the entire factorization is performed through partial factorizations of a sequence of dense and small submatrices. It is guided by a tree that represents the dependencies between those partial factorizations. In the following, the multifrontal method is formulated first for finite-element analysis and later generalized to assembled sparse matrices.
()
is used, where D is a block diagonal matrix with blocks of order or on the diagonal. The blocks of order are required for stability when factorizing symmetric indefinite matrices. In the positive definite case, D can be diagonal and the Cholesky factorization PAPT = L LT can be used where the diagonal entries of L are the square root of the entries in D. However, the Cholesky factorization is usually not used even in the positive definite case. This is more to avoid problems when the matrix is close to indefinite rather than to avoid taking square roots. As mentioned later, the multifrontal method can also be used to perform a matrix factorization as a product of an orthogonal and upper triangular matrix.
Finite-Element Analysis-Based Formulation The multifrontal method can be regarded as a generalization of the frontal method. The frontal method was developed for the solution of sparse linear systems arising from finite-element problems. In such problems, the system matrix is defined as the sum A = ∑ A[ℓ]
()
ℓ
of elemental matrices A[ℓ] that have nonzeros only at the intersection of the rows and columns corresponding to the variables associated with the element ℓ. It is normal
M
M
Multifrontal Method
to store the matrices A[ℓ] as small dense matrices whose rows and columns are indexed by the corresponding rows and columns of the matrix A. Thus, the summation in Eq. should be thought of as an “expanded sum.” The operation in () is commonly referred to as the “assembly” and involves elementary summations [ℓ]
aij = aij + aij .
()
Uk Ak Lk Sk
A value aij is said to be fully summed (or fully assembled) if all the operations () have been performed, and a variable is said to be fully summed if all the values in its row and its column are fully summed. The basic Gaussian elimination step aij = aij −
aik akj akk
()
only requires that variable k is fully summed. The frontal method proceeds by successive steps where, at each step, a new element is assembled and variables are eliminated as soon as they become fully assembled. The main characteristic of the frontal method is that, at each step, all the operations are performed on a dense matrix, called a frontal matrix, whose rows and columns correspond to variables in the elements that have been assembled (commonly referred to as active variables). There are several benefits from using this approach. First, the elimination operations on a frontal matrix can be performed using efficient dense matrix kernels (Level- BLAS subroutines). Second, the stability of the factorization can be improved by applying pivoting techniques within the fully assembled variables of a frontal matrix. Third, the memory required for each step of the frontal method need only be enough to hold the frontal matrix, lending the method gracefully to limited memory environments. Fourth, it is relatively easy to use out-of-core techniques so that neither the factors nor the matrix need be held in memory. Assuming that A is decomposed using an LU factorization (as in Eq. ), Fig. illustrates one step of the frontal method, as element k is assembled. The dark-shaded area represents the frontal matrix whose variables are split into assembled (in Ak ) and nonassembled variables. As the fully assembled variables are eliminated, the corresponding rows and columns end up in the U and the L factors, respectively, to form Uk+ and Lk+ . The next element k + is then assembled with Sk (commonly referred to as a contribution block) which
Multifrontal Method. Fig. One step of the frontal method
corresponds to the Schur complement resulting from the elimination of the fully assembled variables. The name “frontal” stems from the observation that the set of active variables forms a front that sweeps the problem domain element by element. Figure b shows the assembly and elimination process of the frontal method when applied to the regular finite-element problem depicted in Fig. a, with the element ordering a, b, c, d. The A[ℓ] submatrix associated with an element ℓ may contain fully assembled variables before being summed as in Eq. ; this corresponds to the case where a variable only appears in one element, for example, a variable corresponding to an interior node of a finite element. In this case, it is possible to eliminate the fully assembled variables before assembling the element, the A[ℓ] term in Eq. can be replaced by the resulting Schur complement. This is known as static condensation. The multifrontal method takes this observation to a higher level in order to achieve a better exploitation of the sparsity structure of the matrix. Building upon the idea that variables that are internal to different subdomains (by which we mean elements or groups of elements) can be assembled and eliminated independently, this method basically amounts to having multiple fronts so that the set of variables that are fully assembled within a front does not overlap with the set
Multifrontal Method
of active variables in any other front. In other words, if the frontal method corresponds to the left-to-right evaluation of the summation (), where the order of the elements is critical for the performance, the multifrontal method corresponds to any valid bracketing of the same summation. The multifrontal method is guided by a tree structure commonly referred to as an assembly tree, which expresses the way elements are assembled into the fronts. Considering the simple finite-elements problem in Fig. a, the assembly tree corresponding to the bracketing (((A[a] + A[b] ) + A[c] ) + A[d] ) is shown in Fig. a; note that in this case the multifrontal method is equivalent to a frontal method (as shown in Fig. b) with static condensation. However, thanks to the higher flexibility of the multifrontal method, different bracketing schemes can be used to improve the memory required and the execution time. For example, if a nested dissection ordering is used on the simple problem in Fig. a the bracketing ((A[a] + A[b] ) + (A[c] + A[d] ))
a
a
b
c
d
M
would result, and thus the assembly tree in Fig. b would be obtained.
Assembled Matrix Point of View The multifrontal method can be used for factorizing quite general matrices, but first consider the case when A is symmetric and numerical pivoting is not required (e.g., when A is positive definite). Assume A is of order n and a sparsity preserving ordering is known. Consider a tree data structure on n nodes, each node corresponding to a column of L (say from the factorization ()), built by setting a parent pointer for each node j as follows: p is the parent of j if and only if p = min{i > j : ℓ ij ≠ }. If no such p exists for a node j, then j is a root. The resulting tree structure is called the elimination tree. If A is irreducible, then the data structure is indeed a tree, otherwise it is a forest. A frontal matrix can be associated with each node of the elimination tree. As in the finite-element analysis case, the frontal matrices are permuted into a × block structure where all variables from the (,)-blocks can be eliminated (i.e., they are fully summed) but the Schur complements (formed by the elimination of these fully summed entries on the
b
Multifrontal Method. Fig. The assembly and elimination process in the frontal method on a regular grid with four elements
M
M
Multifrontal Method
(,)-block) cannot be eliminated until later in the factorization. The Schur complement computed at a node corresponds to the contribution block introduced in the finite-element analysis section. The variable at a node j becomes fully summed after all of the descendants of the node in the elimination tree have been eliminated. By definition of the elimination tree, however, there is only one fully assembled variable in each frontal matrix. This prevents the use of efficient Level- BLAS routines for performing the elimination and update of the Schur complement and may result in fronts whose size is too small to achieve a good exploitation of the memory hierarchy available in modern processors. It is thus advantageous to combine or amalgamate nodes of the elimination tree. In the case where two or more columns of the L factor have the same structure below the diagonal (modulo a diagonal block), the corresponding variables can be amalgamated and thus eliminated within a single frontal matrix without introducing any additional fill-in. In the general case where columns of the L factor do not have the same structure, it is possible to perform amalgamation at the cost of additional fill-in; threshold-based strategies can be used in this case to perform amalgamation so that the cost of the introduced fill-in is overcome by the advantages of more efficient computations. As a result of the amalgamation procedure, the elimination tree is transformed into an assembly tree whose nodes define the steps of the multifrontal method. At each step, values of the problem matrix are assembled together with the contribution blocks from the children nodes into a frontal matrix, the fully summed variables are eliminated and the corresponding Schur complement is formed in order to be assembled at the parent node (if any). The “Achilles heel” of sparse direct methods applied to general systems is that indirect addressing will be required. The multifrontal method cannot thus avoid this but the indirect addressing only occurs in the assemblies while all the arithmetic is performed using direct addressing as in the dense case, as in the case of frontal methods.
Discussion Phases of Solution Similar to other sparse direct methods, the solution of a linear system by means of a multifrontal method is
commonly achieved through three main phases: analysis, factorization, and solve. This subdivision enables the analysis and factorization phases to be performed only once when several linear systems with the same coefficients and differing right-hand side vectors are to be solved successively; similarly if several matrices with the same sparsity pattern are to be factorized, the analysis phase can be performed only once.
Analysis This step mostly consists of symbolic preprocessing operations prior to the actual factorization. These operations include the computation of a fill reducing pivotal order and a symbolic elimination process to determine the size of the factors and the frontal matrices. The ordering and matrix pattern will define a unique elimination tree but the important feature of the tree is that it only defines a partial ordering for the subsequent matrix factorization and this freedom can be used to great effect to reduce memory costs and to enable the efficient exploitation of parallelism. Factorization The actual numerical factorization takes place in this step. From the memory point of view, implementations of the multifrontal factorization commonly use three “logical” areas of storage: one for the factors, one to stack the contribution blocks, and one for the current frontal matrix. The idea of using a stack stems very naturally from the fact that, in a sequential environment, the nodes of the elimination tree can be processed using a postorder (all nodes within any given subtree are numbered consecutively). During the tree traversal, the storage required by the factors always grows while the stack memory (containing the contribution blocks) can both increase and decrease. When the partial factorization of a frontal matrix is completed, a contribution block is pushed on to the stack, increasing the stack size; on the other hand, when a frontal matrix is formed and assembled, the contribution blocks of its children are popped out of the stack, decreasing the stack size. This pattern of memory use permits a natural extension to out-of-core execution. First, the computed factors of the frontal matrices can be moved to secondary storage after they are computed (they are not needed until the solve phase). Second, the stack memory has a natural locality, thanks to the postorder.
Multifrontal Method
a
M
b
Multifrontal Method. Fig. Finite-element problem and examples of associated assembly trees. Fully assembled variables are shown with a dark-shaded area within each frontal matrix
Because the order in which the tree is traversed can have a significant impact on both the number of contribution blocks stored simultaneously on the stack and the working memory, great care must be taken to sort the siblings at each level of the tree. The rule of thumb followed in performing this operation can be illustrated in Fig. a (where the memory associated with factors that have already been computed is not taken into account). If the nodes of the tree are processed in the order ------, at most two contribution blocks and one frontal matrix must be stored. If, on the other hand, the nodes are processed in the order ------, then the peak memory requirement is for four contribution blocks and one frontal matrix. During the factorization phase, pivoting may be performed to improve numerical stability. The normal numerical control for pivoting is the same for multifrontal methods as for most sparse direct methods, that is threshold pivoting. In the general case, the entry aij can be used as a pivot only if ∣aij ∣ ≥ u, maxk ∣akj ∣ where u is a threshold parameter ( < u ≤ .). Clearly, in a frontal matrix, the pivot can only be chosen from the fully summed block so it is possible that not all
the fully summed block (the (,) block of the frontal matrix) can be eliminated at a particular node because of relatively large entries in the (,) block. The rows and columns corresponding to the fully summed variables which remain uneliminated are then added to the contribution block and assembled in the frontal matrix of the parent node. This gives rise to delayed pivoting. The elimination of delayed pivots may become possible at the parent node because the large entry could now be in the (,) block of the new frontal matrix, for example. It must be noted, however, that in case of delayed pivots, the size of the frontal matrix at the parent node is increased as well as that of the contribution block. For such a reason, the use of delayed pivoting requires complex dynamic data structures. In parallel implementations, dynamic scheduling strategies may be necessary to automatically adapt to such situations and this is done in the code MUMPS. A more recent alternative is to avoid delayed pivoting by using static pivoting where a pivot is still chosen even if it does not satisfy the threshold criterion but may be augmented in value to avoid instability. In this case, which is available in MUMPS as well as the other codes (SuperLU and PARDISO) described in this encyclopedia, iterative refinement or perhaps more sophisticated iterative techniques must be used to obtain an accurate solution.
M
M
Multifrontal Method
Solve During this step, the triangular factors computed at factorization time are used together with the right-hand side vector or matrix to compute the solution of the problem by means of forward and backward substitutions. More precisely, the forward step uses a bottom-up traversal of the tree similar to the factorization, and the backward step uses a top-down traversal of the tree: starting from the root, each node computes the components of the solution corresponding to its fully summed variables and informs its children of the already computed components of the solution.
Unsymmetric Matrices The multifrontal method can be applied to unsymmetric matrices as well. If the matrix A is pattern symmetric, or nearly so (in which case the pattern of M = ∣A∣ + ∣A∣T can be used), then A can be treated as before. By generalizing the partial factorization of the frontal matrices to the rectangular case, special methods for unsymmetric matrices can be developed. This approach necessitates the use of directed acyclic graphs to represent the computational dependencies. Because of asymmetry, it can be shown that in general the compact representation of this acyclic graph is not a tree. It can still be beneficial to use the simplicity of the elimination tree structure (instead of the directed acyclic graphs) in the unsymmetric case, with special attention in order to detect and make use of asymmetry in the frontal factors. Another approach is to generalize the notion of the elimination tree to unsymmetric matrices. This generalization provides a general framework to describe all unsymmetric multifrontal approaches. The multifrontal method also extends to sparse QR factorization, typically used for least-square problems where the objective is to find an x that minimizes ∥Ax − b∥.
Parallel Execution The multifrontal method presents many features that make it suitable for multiprocessor environments. Parallelism can be exploited at two different levels. First, two disjoint subtrees of the elimination tree can be processed in parallel, as there is no computational dependency between such subtrees. Second, the frontal matrices that are large enough can be factorized in
parallel using algorithms for dense linear algebra operations – such frontal matrices occur usually toward the end of the factorization process where nodes of the elimination tree that are close to the root are processed. Effective scheduling and mapping strategies are necessary for exploiting the two sources of parallelism mentioned above. In a distributed-memory environment (see MUMPS), much attention has been paid to static mapping methods to map the tasks associated with the nodes of the tree to the processes. The subtreeto-subcube mapping is used to reduce the communication overhead in parallel sparse Cholesky factorization on hypercubes. This mapping mostly addresses parallelization of the factorization of matrices arising from a nested dissection-based ordering of regular meshes. The basic idea of the algorithm is to build a layer of p subtrees (where p is the number of processors) and then to assign to each processor pi a subtree Ti . The nodes of the upper parts of the tree are then mapped using a simple rule: when two subtrees merge together into their parent subtree, their set of processors are merged and assigned to the root of that parent subtree. The method reduces the communication cost but can lead to load imbalance on irregular assembly trees. A bin-packing-based approach has then been introduced to improve the load balance on irregular trees. Given an arbitrary tree, the improvement is achieved by finding the smallest set of subtrees which can be partitioned among the processors while attaining a given load balance threshold. In this scheme, starting from the root node, the assembly tree is explored until a layer of subtrees, whose corresponding mapping on the processors produce a “good” balance, is obtained (the balance criterion is computed using a first-fit decreasing bin-packing heuristic). Once the mapping of the layer of subtrees is done, all the processors are assigned to each remaining node (i.e., the nodes of the upper parts of the tree). This method takes into account the workload to distribute processors to subtrees. However, it does not minimize communication in the sense that all processors are assigned to all the nodes of the upper parts of the tree. The proportional mapping algorithm improves upon the two above algorithms. It uses the same “local” mapping of the processors as that produced by the subtree-to-subcube algorithm with the difference that it takes into account workload information to build the layer of subtrees. To be more precise, starting from the root node to which all the processors
Multifrontal Method
are assigned, the algorithm splits the set of processors recursively among the branches of the tree according to their relative workload until all the processors are assigned to at least one subtree. This algorithm is characterized by both a good workload balance and a good reduction in communication. In a parallel computing context, memory use is much more difficult to control and limit for two related reasons. First, it may not be possible to guarantee a postorder traversal of the tree while exploiting the tree parallelism since processes may work on different branches of the tree at the same time. Thus the use of a simple stack to hold the contribution blocks as in the sequential case is not possible. Second, as many processes are active during the factorization, the global memory requirement may increase compared to a sequential execution. The memory per process should decrease but the overall memory requirement may increase. The solve step can also be executed in parallel by following the same computational pattern of the factorization phase, that is, the pattern induced by the elimination tree.
M
Liu [] provides an overview of the multifrontal method for symmetric positive definite matrices. Liu [] and Guermouche et al. [] have explored the impact of the tree traversal on the memory behavior and proposed tree traversals that minimize the storage requirements of the method. The mapping algorithms discussed in Sect. Parallel Execution are by George et al. [], Geist and Ng [], and Pothen and Sun []. Amestoy et al. [] generalize and improve the heuristics in [, ] by taking memory scalability issues into account and by incorporating dynamic load balancing decisions for which some preprocessing is done in the analysis phase. Memory management issues in the implementation of the multifrontal method are discussed by Amestoy and Duff []. Matstoms [] and Amestoy, Duff, and Puglisi [] discuss the multifrontal QR method in a multiprocessor environment.
Related Entries BLAS (Basic Linear Algebra Subprograms) Dense Linear System Solvers Mumps
Bibliographic Notes and Further Reading Based on techniques developed for finite-element analysis by Irons [] (and a technical report by Speelpenning dated ), Duff and Reid proposed the multifrontal method for symmetric [] and unsymmetric [] systems of equations. Eisenstat et al. [] proposed another method with the same basic features as the multifrontal method. More recently, Eisenstat and Liu [] have provided a new generalization of elimination trees for unsymmetric matrices using the notion of paths instead of edges to define a parent node in the tree. This generalized elimination tree leads to a general framework to describe how all existing approaches take the asymmetry into account. This includes the work of: Duff and Reid [] and Amestoy, Duff, and l’Excellent [] who apply the multifrontal method to unsymmetric matrices with symmetrized pattern; Davis and Duff [] and Gupta [] who do not use the assembly tree but use a directed acyclic graph to represent the computational dependencies; Amestoy and Puglisi [] who work on a “symmetrized” assembly tree by making use of the structural asymmetry of the frontal matrices during the factorization phase.
Reordering
Bibliography . Amestoy PR, Duff IS () Memory management issues in sparse multifrontal methods on multiprocessors. Int J Supercomput Appl :– . Amestoy PR, Duff IS, L’Excellent J-Y () Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput Methods Appl Mech Eng :– . Amestoy PR, Duff IS, Puglisi C () Multifrontal QR factorization in a multiprocessor environment. Numer Linear Algebra Appl ():– . Amestoy PR, Guermouche A, L’Excellent J-Y, Pralet S () Hybrid scheduling for the parallel solution of linear systems. Parallel Comput :– . Amestoy PR, Puglisi C () An unsymmetrized multifrontal LU factorization. SIAM J Matrix Anal Appl :– . Davis TA, Duff IS () An unsymmetric-pattern multifrontal method for sparse LU factorization. SIAM J Matrix Anal Appl :– . Duff IS, Reid JK () The multifrontal solution of indefinite sparse symmetric linear equations. ACM Trans Math Softw :– . Duff IS, Reid JK () The multifrontal solution of unsymmetric sets of linear equations. SIAM J Sci Stat Comput :– . Eisenstat SC, Liu JWH () The theory of elimination trees for sparse unsymmetric matrices. SIAM J Matrix Anal Appl : –
M
M
Multi-Level Transactions
. Eisenstat SC, Schultz MH, Sherman AH () Applications of an element model for Gaussian elimination. In: Bunch JR, Rose DJ (eds) Sparse matrix computations. Academic, New York/London, pp – . Geist GA, Ng EG () Task scheduling for parallel sparse Cholesky factorization. Int J Parallel Program :– . George A, Liu JWH, Ng E () Communication results for parallel sparse Cholesky factorization on a hypercube. Parallel Comput :– . Guermouche A, L’Excellent J-Y () Constructing memoryminimizing schedules for multifrontal methods. ACM Trans Math Softw :– . Gupta A () Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. ACM Trans Math Softw :– . Irons BM () A frontal solution program for finite-element analysis. Int J Numer Methods Eng :– . Liu JWH () On the storage requirement in the out-of-core multifrontal method for sparse factorization. ACM Trans Math Softw :– . Liu JWH () The multifrontal method for sparse matrix solution: theory and practice. SIAM Rev :– . Matstoms P () Parallel sparse QR factorization on shared memory architectures. Parallel Comput :– . Pothen A, Sun C () A mapping algorithm for parallel sparse Cholesky factorization. SIAM J Sci Comput :–
Multi-Level Transactions Transactions, Nested
Multilisp Robert H. Halstead, Jr Curl Inc., Cambridge, MA, USA
Synonyms Mul-T; MultiScheme
Definition Multilisp is a parallel programming language, focused particularly on parallel symbolic computing, developed as a research project at MIT during the ’s. The Multilisp language design and implementation projects served as a vehicle for exploring various ideas related to concurrency constructs, scheduling, parallel garbage
collection, speculative computation, and debugging tools for parallel programs.
Discussion Introduction The s saw much experimentation with parallel programming models, as supercomputer performance and performance/price ratios both reached new heights amid a growing variety of SIMD and MIMD parallel machines, using both shared- and distributed-memory architectures, from various vendors. It became increasingly apparent that scaling laws for hardware would eventually favor parallel computing as the most costeffective way to achieve high performance. A growing recognition of the challenges of programming parallel machines also led to many parallel programming research projects targeting different hardware architectures and different application areas. Two key characteristics of an application that influence the choice of a parallel computing approach are the amount of parallelism available in the application (often related to the size of the data set) and the regularity of the available parallelism: whether the set of operations to be performed, and the dependences between them, have a simple structure largely independent of the data values, or are complex and data-dependent. “Regular” applications include various signal-processing problems as well as partial differential equation solutions over regular grids, while “irregular” applications include sparse matrix computations, mesh generation, computer algebra, discrete event simulation, and various heuristic search and optimization applications. Many irregular applications are termed “symbolic” because they emphasize navigating and creating complex data structures over performing arithmetic. Some parallel programming projects focused on massively parallel computing for large numeric problems with a fairly regular structure, often a good match for vector or SIMD architectures. Multilisp was among a group of other projects that focused on exploiting parallelism in problems with a less regular structure. Multilisp itself was aimed at shared-memory MIMD architectures and oriented especially toward symbolic applications. The Multilisp team believed that the irregularity of these applications makes it too difficult to
Multilisp
preschedule operations or predetermine a good distribution of application data across the nodes of a distributed-memory machine, so a shared-memory MIMD architecture would be the best implementation target. Most applications that were investigated for Multilisp offered moderate opportunities for parallel speedup, more in the tens than in the thousands. Such speedups were often not exciting enough in the era of the s and s, when sequential processor speeds were doubling every couple of years. In the present environment, clock speeds are increasing only slowly and multicore processors are the new vehicle for increasing computation throughput, so there may be a renaissance of serious interest in concepts first developed for Multilisp and its contemporary projects. Multicore processors are precisely the shared-memory MIMD model at which Multilisp and similar projects were targeted.
Multilisp Semantics The design of Multilisp is based on several major hypotheses: . The best route to a high degree of parallelism is to identify a very large number of concurrent computations and then inexpensively and efficiently schedule them across the available processors. . Functional programming is good for parallel computing because it eliminates bugs caused by unintended nondeterminacy, but controlled nondeterminacy is desirable in some applications. . For symbolic computations that work by traversing and building linked data structures, synchronization closely tied to the data structures is often the most natural. . Parallel programming is hard enough, so a parallel programming language implementation should automate clerical tasks that we know how to automate, notably heap storage allocation and garbage collection. Point () creates a strong bias in favor of simple, lightweight concurrency and synchronization constructs. Point () drove the Multilisp team toward a design strongly influenced by the Scheme language [], which includes side-effecting primitives but can be used very naturally in a “mostly functional” programming style, where side-effecting operations are used only at
M
the comparatively infrequent points where an imperative state change on an object is exactly what is needed. Point () seems uncontroversial from today’s vantage point, after the rise of Java in the s popularized the notion that serious software can be written on a platform with garbage-collected heap storage, but in the s, there was still great debate about whether garbage-collected systems like Lisp and Scheme could be used for “real work” like C and C++ could. Scheme, a member of the Lisp family of languages, was an attractive starting point not only because it is consistent with point () above, but also because it is built on garbage-collected heap storage (point ) and is widely used for building symbolic applications. Multilisp implementations differed somewhat in how faithfully they adhered to the details of the Scheme definition, but all of them were philosophically quite close to Scheme.
Basic Concurrency Constructs On top of the Scheme foundation, Multilisp adds two basic concurrency constructs: pcall and future, described in detail by Halstead []. pcall provides classic fork-join concurrency while future provides a form of data-flow concurrency. A Multilisp expression such as (pcall F A B C . . .) allows concurrent evaluation of the expressions F, A, B, C, …. After waiting for all of these evaluations to return a value, the procedure value of the expression F is applied to the argument values resulting from evaluating A, B, C, …. The result of evaluating the pcall expression is thus equivalent to that of the sequential procedure call (F A B C . . .) except that the expressions F, A, B, C, …are evaluated concurrently. future is a more radical construct that permits concurrency structures not possible with simple forkjoin concurrency. A Multilisp expression (future X) permits a task to be spawned to evaluate the expression X, but immediately returns a placeholder for the result of that evaluation. When the evaluation of X yields a value, the future object essentially becomes that value. The future is said to resolve to the value of X.
M
M
Multilisp
The evaluation of X can proceed concurrently with the use of the placeholder value, until the point where the user actually needs some information about the value. An operation that actually needs information about a value is said to touch the value. Such an operation will have to block until the future resolves to a value, but especially in symbolic applications, often quite some time passes while a value is passed into or out of procedure calls, or is built into data structures and then retrieved, before the value is actually touched. Since the point where a value is touched can be remote in the program code, as well as in time, from the point where an evaluation was spawned, the future construct enables rich and complex concurrency structures to be easily specified that could not be specified using fork-join concurrency alone, fulfilling point () in the above list of design goals (closely linking synchronization with data structures). The future concept has its origins in the Actor systems of Baker and Hewitt []. Both pcall and future have the property that in a fully functional program (no side effects), inserting these constructs does not change the meaning of the program, so a programmer can feel free to insert them in whatever way best balances the exposure of concurrency with the concurrency management overhead costs. In addition to the pcall and future constructs, which combine task spawning and synchronization, Multilisp also includes replace and replace-ifeq operations, similar to the classic atomic swap and compare-and-swap instructions, as building blocks for higher-level synchronization operators [].
Speculative Computing and Sponsors The simplest use of the Multilisp constructs is just to use the concurrency primitives to relax the precedence constraints in a sequential algorithm, enabling the exact same set of operations to be performed with some parallelism on a parallel machine. A more aggressive approach is to use parallelism to speculatively launch new lines of inquiry, whose results may or may not be used depending on the results of other computations. Opportunities for such speculative computing are especially prevalent in search problems. Speculative computing brings with it some new challenges. Usually, some speculative tasks are consid-
ered more valuable than others, and it is important to ensure that low-value speculative tasks do not consume resources to the extent of starving out highervalue speculative tasks or “mandatory tasks” that are not speculative. Also, estimates of the value of speculative tasks can change as a computation progresses and more information is learned about the problem being solved. An important special case of this is when the computation performed by a speculative task becomes moot because of information developed by other ongoing computations. Such irrelevant speculative tasks should be terminated gracefully so they will no longer consume resources. Conversely, a task initially thought to be speculative may become recognized as mandatory and deserving of the highest priority. Managing speculation in a language like Multilisp is complex, however, because future can induce dependence relationships within a computation that are much more varied and tangled than the dependences that result from simple fork/join concurrency. This challenge can be addressed by introducing a higherlevel construct to represent the relationships between various speculatively launched subcomputations of an overall computation. Osborne [] investigated a sponsor model inspired by earlier work by Kornfeld and Hewitt []. In this model, each computational task is associated with one or more sponsors that represent the reason, or goal, for performing the computation and provide a place to store and update information about the urgency or importance of that goal. Sponsors provide a natural framework for solving various problems with priority management. For example, a task that is blocked waiting for a future can add its own sponsor to the sponsorship of the task that is computing a value for the future. Similarly, a task waiting to enter a critical region can temporarily add its sponsor to the sponsorship of a task currently in the critical region. Both of these strategies prevent priority inversions (where higher-priority tasks are blocked by lowerpriority tasks) by ensuring that any tasks whose results are awaited get the same claim on resources as the tasks that are waiting. Speculative subcomputations within a computation that is itself speculative can be represented using a tree structure of sponsors, where some sponsors are subordinated to other sponsors. Finally, a sponsor can be declared irrelevant if its associated speculative task has become moot, and this provides a natural way
Multilisp
to know which tasks should no longer receive execution resources. While considerable progress was made in developing the capabilities of the sponsor model, the efficient implementation of sponsor-based speculative scheduling on a parallel machine remains an underexplored subject. However, the sponsor model seems useful not just for speculative computing but also as a model for “system programming” on a parallel machine, where multiple computational threads that share the machine’s resources execute on behalf of different users or different computations initiated by the same user, who may change his/her mind from time to time about the priority of the different computations or may even decide to abandon some of them completely.
Futures and Continuations The Multilisp project provided a framework for investigating various higher-level questions about how to carry familiar sequential programming capabilities into the world of parallel programming. One challenging area of investigation was how to reconcile the future construct with the “first-class continuation” mechanism embodied in Scheme’s call/cc construct []. The name of this construct is a contraction of “call with current continuation.” An expression such as (call/cc f) captures the continuation to which the value of this expression will be returned, reifies it as a procedure, and passes the resulting procedure as an argument to f. Subsequent computations, within f or elsewhere, can then cause a value V to be returned from the call/cc expression by passing V as an argument to the continuation procedure. Many useful program structures can be built using call/cc, but it can also be used to produce various paradoxical behaviors, such as procedure calls that return more than once, possibly returning different values each time. Such behaviors conflict with the restriction that a future can only resolve once, to one value. Halstead [] reviews various ideas for dealing with these conflicts, but in the end there was no definitive resolution. Errors and Exceptions The handling of errors and exceptions is another challenge: what should happen if the computation X in (future X) ultimately attempts a division by zero, an
M
out-of-bounds array access, or even explicitly raises an exception? Modern languages for sequential programming provide mechanisms for catching such exceptions so execution can proceed down another path that deals with the condition. Typically these mechanisms work by unwinding the call stack back to a caller that catches the exception and specifies an action to take in response. With future, this idea is problematic because a caller that contained the (future X) expression may already have returned before the exception was raised. Two solutions to this problem were investigated. One solution, described by Halstead and Loaiza [], is for the exception in X to cause the future to resolve to an exception value. When an operation touches an exception value, the exception is raised again in that operation. Thus, exceptions propagate back through the data-flow structure of the program rather than through the control structure. An alternative possibility, outlined by Halstead [] but never fully worked out, would use continuations (for example via a specified “error continuation” which would be called to signal an exception) in combination with the sponsor model (for abandoning or suspending computations rendered moot by the raising of the exception).
Implementations Over the course of the Multilisp project, three implementations were built. The first implementation, Concert Multilisp [], ran on the Concert multiprocessor at MIT. While Concert Multilisp was a versatile testbed for many implementation ideas, it was based on a bytecode interpreter and therefore was not fast in absolute terms. The second implementation, MultiScheme [], was based on the MIT Scheme system and ran on the BBN Butterfly multiprocessor. It also used a bytecode interpreter. The third implementation, Mul-T [], was based on the T system from Yale and used T’s ORBIT compiler to generate native machine code. Mul-T ran on an Encore Multimax multiprocessor.
Storage Management Concert Multilisp included a copying, incremental garbage collector intended to allow as much concurrency as possible between garbage collection operations and computational operations []. To minimize synchronization and improve locality, each processor had
M
M
Multilisp
its own heap in which to allocate objects. Each processor’s heap was divided into two semispaces: oldspace and newspace. During each garbage-collection cycle, all live objects were copied from oldspace to some processor’s newspace. At the beginning of each cycle, all processors needed to synchronize to swap the roles of the two semispaces, but after a short period to copy objects in the root set, all processors could proceed independently, interleaving computational operations with housekeeping intervals for moving live heap objects from oldspace to newspace. This approach has the advantage of allowing all processors to be fully utilized without requiring the garbage-collection load to be spread evenly across them, but the cost on stock hardware of checking pointers loaded from the heap to ensure that they do not point into oldspace is significant and might be too great to tolerate in an implementation based on compiled code. To avoid the cost of checking pointers loaded from the heap, both MultiScheme and Mul-T employed non-incremental garbage collectors that required all processors to suspend computational activities and synchronize, then focus exclusively on garbage collection until the garbage collection pass is done, at which point they can resume computation. This approach streamlines the execution of computational tasks but puts severe requirements on the evenness with which garbage-collection work can be distributed among the processors. The parallelism achieved during garbage collection in these implementations was not carefully tuned and is not conclusively stated in the published reports, but there was evidence that speedups of or more were difficult to achieve. The garbage collector was also called on to perform other housekeeping tasks. All Multilisp implementations used the garbage collector to splice out resolved futures, replacing pointers to the future object with pointers to the future’s value, eliminating any subsequent costs for indirecting through the future’s placeholder object. MultiScheme also used the garbage collector to kill tasks that are working on resolving futures that have become inaccessible.
Scheduling In programs with large amounts of concurrency, unfolding the task-spawning tree breadth-first can require exponentially more resources than unfolding
the task tree depth-first. Therefore, depth-first scheduling is strongly preferred, once enough concurrency has been exposed to keep all processors busy. In practice, this means that when processing the expression (future X), execution of the parent task should be suspended and the evaluation of X should begin immediately. If there is already plenty of work to keep all processors busy, the parent task will still be suspended when the evaluation of X completes, and the execution of the parent task can be resumed at that point. This was the scheduling policy used in Concert Multilisp []. In the design of Mul-T it was observed that the implementation of this policy could be speeded up by skipping the creation of a child task, allowing the “parent” task to just continue into the evaluation of X. If another processor needs work before the evaluation of X completes, it can break into this task’s stack and steal away the continuation of the computation beyond the future construct. This policy of lazy task creation [] greatly reduces the cost of task management, since most potential task-spawning points never actually result in the creation of a new task. This policy was subsequently adopted, and put on a sound theoretical footing, by the MIT Cilk project []. It is also related to the “evaluate-and-die” mechanism of Glasgow Parallel Haskell []. The informally stated goal of the Mul-T implementation was to make future “no more expensive than a procedure call.” With lazy task creation, this cost was reduced to - machine instructions in most cases [], coming close to this goal.
Debugging Tools In sequential programming, it is important that programs both be correct and have the needed performance, and this is no less true in parallel programming. Multilisp’s mostly functional nature is intended to assist in developing correct programs by reducing opportunities for nondeterminacy and supporting a development style where testing a program on a sequential machine will go a long way toward assuring its correct behavior on a parallel machine. Beyond that, it is useful to be able to execute programs in parallel and replay the same execution path later on, if a problem is noticed. The Multilisp team’s MulTVision project [] included a logand-replay technique, implemented in Mul-T, which reduced logging costs substantially by logging only task
Multilisp
creations and side effects on shared data structures, which are not common in mostly functional programs. While performance optimization is often important for sequential programs, it is critical for parallel programs, since performance is the main reason to target a program to a parallel machine in the first place. In a large program, however, it can be quite hard to understand the reasons for the observed performance on a parallel machine. MulTVision also included a logger for events such as task creation, blocking, and unblocking, and a visualizer for viewing the resulting execution traces. The project was disbanded before the value of these tools could be clearly demonstrated, but this is likely to become an urgent problem in software development for modern multicore architectures.
Performance Mul-T is the most significant of the Multilisp implementations from the standpoint of assessing Multilisp’s actual potential to deliver real parallel speedup over the best sequential implementation. The extra costs in the Mul-T implementation, compared with the sequential T implementation, come principally from two sources: synchronization and task scheduling. The major component of the synchronization costs is the checks that must be included in every operation that touches its operands, to determine whether any of the operands is a future and, if so, to take the needed actions. The cost for this depends considerably on the nature of the program, but after some optimizations, an increase of about % in execution time due to touch checks was fairly typical []. It would be interesting to know how this measurement would look on today’s machines. When combined with lazy task creation, Mul-T was measured as delivering performance for a range of benchmarks, on processors, of up to about times the performance of a sequential program running in the unmodified T system [].
Related Work and Influences on Other Projects Several other parallel Lisp languages and implementations of the s are reviewed by Halstead [] and are briefly mentioned here. Qlisp was based on Common Lisp augmented with futures as well as several other concurrency constructs and was implemented at
M
Stanford University on an Alliant FX- multiprocessor. Butterfly Lisp (closely related to MultiScheme) and Parallel Portable Standard Lisp were implemented on the BBN Butterfly multiprocessor. Both rely primarily on futures for concurrency. Farther afield, Concurrent Scheme at the University of Utah was targeted to a distributed-memory MIMD multiprocessor, while Connection Machine Lisp was designed for the SIMD architecture of the Connection Machine. In addition to influencing contemporaneous parallel Lisp systems that also used futures, ideas from the Multilisp project, particularly in the areas of synchronization, task scheduling, and garbage collection, have found their way into various other systems. Especially notable among these systems is the work-stealing scheduler of the MIT Cilk system [], which was developed in the mid-s and continues in use to this day. Cilk has been used to build very large parallel applications, such as the competition-quality CilkChess program. Most recently, these ideas have been commercialized in Cilk++, a tool for programming present-day multicore processors.
Related Entries Actors Cilk Debugging Futures Glasgow Parallel Haskell (GpH) Processes, Tasks, and Threads Speculation, Thread-Level
Bibliographic Notes and Further Reading The basic principles behind Multilisp and Mul-T are outlined in [] and [], respectively. Lazy task creation, the essential idea that has been picked up in subsequent “work-stealing” schedulers, is described in []. Halstead [] gives an extended overview of the entire parallel Lisp field as of . Many members of the parallel Lisp and parallel symbolic computing communities gathered at a sequence of workshops held in , , and . The proceedings of the first [] and second [] of these workshops include several papers discussing various aspects of the Multilisp project as well as papers about other contemporaneous projects mentioned in
M
M
Multimedia Extensions
this article. These volumes are thus a good starting point for further exploration of the literature.
Multimedia Extensions Vector Extensions, Instruction-Set Architecture (ISA)
Bibliography . Baker H, Hewitt C () The incremental garbage collection of processes. MIT Artificial Intelligence Laboratory Memo , Cambridge . Blumofe R et al () Cilk: an efficient multithreaded runtime system. In: Proceedings of the fifth ACM symposium on principles and practice of parallel programming (PPoPP). July , pp – . Halstead R () Multilisp: a language for concurrent symbolic computation. ACM Trans Prog Lang Syst ():– . Halstead R () Parallel symbolic computing. IEEE Comput Mag vol ():– . Halstead R () New ideas in parallel lisp: language design, implementation, and programming tools. In: Proceedings of US/Japan workshop on parallel lisp, Lecture Notes in Computer Science . Springer, Heidelberg, pp – . Halstead R, Ito T (eds) () Parallel symbolic computing: languages, systems, and applications. In: Proceedings of US/Japan workshop on parallel symbolic computing, Lecture Notes in Computer Science . Springer, Heidelberg . Halstead R, Kranz D, Sobalvarro P () MulTVision: a tool for visualizing parallel program executions. In: Proceedings of US/Japan workshop on parallel symbolic computing, Lecture Notes in Computer Science . Springer, Heidelberg, pp – . Halstead R, Loaiza J () Exception handling in multilisp. In: Proceedings of the international conference on parallel processing, St. Charles, IL, August , pp – . Ito T, Halstead R (eds) () Parallel lisp: languages and systems. In: Proceedings of US/Japan workshop on parallel lisp, Lecture Notes in Computer Science . Springer, Heidelberg . Kelsey R, Clinger W, Rees J (eds) () Revised report on the algorithmic language scheme. ACM SIGPLAN Not ():– . Kornfeld W, Hewitt C () The scientific community metaphor. IEEE Trans Syst Man Cybern, pp – . Kranz D, Halstead R, Mohr E () Mul-T: a high-performance parallel lisp. In: Proceedings of ACM SIGPLAN ’ Conference on programming language design and implementation. Portland, Oregon, pp – . Miller J () MultiScheme: a parallel processing system based on MIT scheme. MIT Laboratory for Computer Science Technical Report TR-, Cambridge . Mohr E, Kranz D, Halstead R () Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans Parallel Dist Syst ():– . Osborne R () Speculative computation in multilisp. In: Proceedings of US/Japan workshop on parallel lisp, Lecture Notes in Computer Science . Springer, Heidelberg, pp – . Peyton Jones S, Clack C, Salkild J () High-performance parallel graph reduction. In: PARLE ’ Proceedings, Lecture Notes in Computer Science . Springer, Berkley, pp –
Multiple-Instruction Issue Superscalar Processors
Multiprocessor Networks Hypercubes and Meshes Interconnection Networks
Multiprocessor Synchronization Synchronization Transactional Memory
Multiprocessors Distributed-Memory Multiprocessor Shared-Memory Multiprocessors
Multiprocessors, Symmetric Shared-Memory Multiprocessors
MultiScheme Multilisp
Multistage Interconnection Networks Networks, Multistage
Multi-Threaded Processors
Multi-Streamed Processors Multi-Threaded Processors
Multi-Threaded Processors Mario Nemirovsky Barcelona Supercomputer Center, Barcelona, Spain
Synonyms Multi-streamed processors
Definition Opposed to standard single-threaded processors, in which only one software thread can be present and executed at any given time, Multithreaded Processors can hold multiple software threads which can be running simultaneously. The amount of information needed to be held per software thread is defined as the context. Therefore, a processor that can hold multiple contexts concurrently is defined as a Multithreaded Processor.
M
own instruction and data. These processes can be totally independent or have different levels of sharing. A multithreaded processor is able to concurrently execute instructions of different threads within a single pipeline. Multithreading could be applied either to increase performance of multiprogramming or multithreaded workloads or to increase performance of a single program. Multithreaded processors interleave the execution of instructions of different user-defined threads in the same pipeline. This requires that multiple program counters and contexts be present, often stored in different register sets. The execution units are multiplexed between the running threads. Latencies that occur in the computation of a single instruction stream (also known as bubbles) are filled by the instructions of another thread. Context switching within the running threads of the hardware is automatically assigned. A multithreaded processor capable of executing multiple instructions from different instruction streams (i.e., threads running within the hardware) simultaneously is called simultaneous multithreaded (SMT) processor. SMT is the confluence of multithreading techniques on awides uperscalar process or so that the hardware pipeline is used more efficiently.
Discussion Introduction In the continuing search for a higher level of performance, computer architects have been consistently pushing the speed of the system and the microarchitecture design. For instance, microprocessor speeds in the last years have grown by , times while the instructions per clock have increased from . on the Intel to at least per core on the Intel Xeons which have a -bit data width compared to the -bit . Yet, the semiconductor technology and microarchitecture innovations that led to these processor performance gains have started to provide diminishing returns. As pulling out substantial performance benefits from these techniques starts turning increasingly tedious, higher processor efficiency becomes attainable through the exploitation of higher-level parallelism. It is important to take into account all the different levels at which parallelism can play a significant role so looking beyond Instruction Level Parallelism (ILP) is Thread Level Parallelism (TLP). A thread is a process with its
Classification of Multithreading Techniques Multithreaded processors execute different instructions from different threads simultaneously within a same pipeline. At the early stages, multithreading techniques were developed for use in scalar processors and consisted of two types commonly known as Fine Grain Multithreading (FGMT) and Coarse Grain Multithreading. In fine grain multithreading, every new instruction issue derives from a different thread so that no two instructions from the same thread are present in the pipeline at any given time. Furthermore, the overall efficiency of the processor is increased over standard scalar processors because the bubbles normally caused by high latency events get used by other threads. In order to have a high efficiency, at least as many threads as pipe stages is needed but if only one single thread is active, then the performance is /n (n being the number of pipe stages) of a scalar processor. The hardware complexity on one
M
M
Multi-Threaded Processors
hand increases due to the need of replicating the context registers, however, on the other hand, the pipeline complexity becomes simplified because dependencies between instructions is almost nonexistent, since every instruction is from a different thread, so bypass logic, branch prediction, and out of order execution are not needed. Coarse grain multithreading (CGMT), similarly can handle various threads in a single pipeline but every new issued instruction can derive from any thread independent of what thread instructions are currently in the pipeline. This implies that if a single thread is active, then the performance of that thread is the same as one would have achieved in a single-threaded processor. Coarse grain is beneficial in cases that require high performance similar to a single-threaded processor or systems where there are fewer software threads available. Although the complexity of a CGMT processor is larger than the FGMT, since it includes both multithreading hardware support plus the dependency checks and bypasses of a single-threaded processor, a CGMT requires less threads to remain efficient than the fine grain and is more efficient than a single-threaded processor when more than one thread is running. The advent of superscalar processors created the possibility of executing several instructions at a time that could, by applying multithreading techniques, be derived from several threads leading to what is commonly known as simultaneous multithreaded (SMT). The amount of instruction bubbles that can be covered by SMT is generally greater than fine or coarse grain multithreading; however, the hardware cost complexity is also greater.
Benefit of Multithreaded Processors Multithreaded Processors (MTP) achieve higher performance than single-threaded processors by taking advantage of the thread level parallelism (TLP) and executing various threads to cover bubbles that get generated in the pipelines. These bubbles are the result of long latency instructions, data and control dependencies, cache misses, exceptions, etc. By covering these bubbles by executing instructions from other ready threads, MTPs can achieve much greater efficiency than their single-threaded counterparts. In Fig. , it can be seen how the bubbles that get generated on a single-threaded scalar gets utilized by
A
A
A A
A
A A
a
A
b
B
C D
A
A B C D A B C D A B C D A
B
A A B B B A B B A A B B
c
Multi-Threaded Processors. Fig. (a) Single-threaded scalar, (b) fine grain multithreading scalar, and (c) coarse grain multithreading scalar
A B C D
A
a
A A
A
A A
A
A A
A A
A
A
A
A
A B C D A B C
D A B C
A
A
A
A
C D A
C D A
A
b
A B
A B
B A
C A
c
A B B A A A
A B B
B B A A B B
B B A A
A A
B A
B A
B
Multi-Threaded Processors. Fig. (a) Single-threaded super scalar, (b) fine grain multithreading super scalar, and (c) coarse grain multithreading super scalar
a fine grain multithreading (b) and a coarse grain multithreaded processor (c), whereas the superscalar version are represented in Fig. . Nowadays, there is a significant amount of applications that feature a large quantity of thread level parallelism (TLP) making multithreaded processors a valued practical solution for achieving significant improvements in performance and power. Examples of applications with large TLP include video, gaming,
Multi-Threaded Processors
modeling, simulations, web searching, as well as web . applications. There are two forms of attaining thread level parallelism. One way is to extract TLP from an application by splitting it into separate tasks that can be run in parallel, for example, splitting a video screen into sections and having them be executed in parallel. Additionally, thread level parallelism exists when several independent tasks arrive almost simultaneously, for instance with network processors, packets arrive very close to each other (i.e., packet bursts) and each packet can be processed in its own thread. Multithreading can also be used to make a single thread run faster. Even if a thread cannot be separated into further parallel tasks or there are not many threads available, multithreading techniques such as multipath speculative execution and helping threads (explained in section Simultaneous Multithreading) can be used to accelerate the execution of single threads. It is important to differentiate the use of multithreading from chip multiprocessors (CMPs). Multicores are made up of independent processors that normally only share a L cache while a multithreaded processor shares the processor pipeline yet has separate context registers for the threads. Furthermore, multithreading is used to augment the processor utilization by filling pipeline bubbles while multicores are just adding processors to the overall chip. In terms of resources needed to build multithreaded processors compared to multicore processors, the cost of adding a thread is much lower than adding a whole separate core. However, there is a performance saturation point for how many threads can be put into a single core after which the more threads added can actually cause performance degradation. Therefore what is commonly used is a blending of multithreading and multicore techniques which is discussed below.
Fine Grain Multithreading (FGMT) In an FGMT processor, in every cycle, instructions from a unique thread get issued and only when those instructions retire can new instructions from that same thread be issued. For example, when a thread accesses memory, the thread does not get scheduled again until the memory transaction gets completed. This model requires at least as many threads as pipeline stages in the processor, and
A
B
C
M
D
A
A
B
C
A
B
B
C D
B
C
C
C
D
D
A
A
C
D
A
B
B
D
A
C
D
B
C
C
D
A
A
B
C
A
B
C
D
A
C
C
D
A
D
A
A
B
D
Multi-Threaded Processors. Fig. Simultaneous multithreading
M adding a few more threads will allow the processor to tolerate higher latencies. The first commercial FGMT computer was the Denelcor HEP [, , ] in ; however, in the s, Control Data Corporation in their Peripheral Processors (PPs) of the CDC [] incorporated threads (or PP) that executed in a round robin fashion meaning that the execution units would execute one instruction cycle from the first PP, then one instruction cycle from the second PP, etc. This was done both to reduce costs and because accesses to memory required PP clock cycles: when a PP accessed memory, the data was available next time the PP received its slot time. The HEP system shown in Fig. could have upto processors with threads per processor. It had an eight-stage pipeline that matched the minimum latency delay from memory fetches. Therefore, upto eight threads were in execution at any given time within a single processor. Furthermore, only a single instruction from a given process was permitted to be in the pipeline at any point in time.
M
Multi-Threaded Processors
Relink
... ... ...
...
...
16 Task Queues of Up to 64 PTs
... ... Snapshot Queue
Execution Pipeline 0 steps non SFU instructions
Store Result
Instruction and Operand Fetch
Local Memory Interface SFU Relink
Solid lines show the flow of Process Tags within HEP
... ...
... ... ...
SFU Instruction Routing
Storage Function Unit Queues
... ... SFU Snapshot
Pipelined Switch I/O System
Data Memory
Data Memory
Data Memory
Multi-Threaded Processors. Fig. HEP- processor pipeline
...
... IF (T1 >= RISE) ... ...
In Course Grain MultiThreading (CGMT) a single thread is executed until it encounters a high latency
IF (PINI = 0) ...
MICROCODE STORE ...
Coarse Grain Multithreading
T1 = T1 + 1 SIP0 SIP1 SIP2 ...
In the late s, Delco Electronics, a division of General Motors, introduced the first real-time fine grain multithreaded processor, and perhaps the most successful FGMT processor deployed since , known as the Timer Input Output (TIO) []. It was designed as a two-stage pipeline with two levels of fine grain multithreading. The first level used a round robin method of choosing between the M. Instruction Processor (MIP) and the S. Instruction Processor (SIP). On the second level, the SIP on its own could hold up to separate threads. The issue pattern resembled: MIM, SIM_, MIM, SIM_, MIM, SIM_, . . . MIM, SIM_, MIM, SIM_. Figure shows the SIP scheduler. Other popular FGMT processors were the Horizon [], the Cray MultiThreaded Architecture (MTA) [], the Multilisp Architecture for Simbolic Applications (MASA) [], and M_Machine [].
SIPn
Multi-Threaded Processors. Fig. The single instruction processor of the TIO
event at which point the hardware automatically performs a context switch to another thread. Examples of long latency events are page faults and L cache misses. In situations in which context switches can be performed very quickly, even small bubbles generated by such events as branches can be used to trigger a context swap. Even with a small number of threads high processor utilization can be achieved while if only
Multi-Threaded Processors
a single thread is being executed, then the processor performance is like that of a single-threaded processor. CGMT is very useful in situations that many different processes need to be run not all of which have the same priority. A particular example of using CGMT for real-time processing was the DISC []. DISC was a CGMT processor that performed context switching in order to distribute processing power between the active threads. Also the MSparc [] was targeting real-time systems. Several processors use the CGMT technique: the MIT Sparcle [], the MIT J-Machina [], and the Rhamma processor [].
Simultaneously Multithreading A superscalar processor where instructions can be issued from multiple threads in the same cycle is called Simultaneous Multithreaded (SMT). Figures and compare a superscalar, a CGMT, an SGMT, and an SMT. It is clear to visualize how these diverse techniques can take advantages of the bubbles that gets generated in a processor. SMT is the technique that has more flexibility to cover the inefficiencies of the single-threaded architectures. An SMT processor has more flexibility than CGMT or FGMT processors because it allows an instruction from any thread to be executed at any time while in parallel, an instruction from a different thread may also be executed. The first work where SMT was proposed on top of a standard RISC processor (IBM RS Power) was presented by Yamamoto in [, ]. This work extended the concept of FG and CG multithreading (then called multistreaming) technique to general-purpose superscalar processor architecture. In their work, an IBM Power processor was modified to produce an SMT processor capable of running up to threads simultaneously. They also presented an analytical model of the processor. Another pioneering approach to SMT was the multithreaded processor created in the Media Research Laboratory of Matsushita Electric Ind. (Japan) []. However, it was the SMT process or architecture, proposed by D. Tullsen, S. Eggers, and H. Levy [] at the University of Washington, Seattle, WA, which has given the name to the technique. Another way to use multithreading is to increase the performance of sequential programs. A processor can spawn threads from a single thread and execute
M
them speculatively concurrently with the main thread in order to accelerate the main program. Several SMT approaches fall in this area such as simultaneous subordinate microthreading [], multiple path execution [], the multiscalar processor [], and the single-program speculative multithreading architecture []. Another interesting work on SMTs is the Simultaneous Multithreaded Vector (SMV) architecture [], which combines simultaneous multithreaded execution and out-of-order execution with an integrated vector unit and vector instructions.
Simultaneous Multithreading Commercial Processors Alpha. Compaq unveiled its Alpha EV in [], a four-threaded eight-issue SMT processor. The out-of-order execution processor had a large onchip L cache, a direct RAMBUS interface, and an on-chip router for system interconnect of a directorybased, cache-coherent NUMA (nonuniform memory access) multiprocessor. This million transistor chip was planned for the year . The project was abandoned, as Compaq sold the Alpha processor technology to Intel. In , Clearwater networks unveiled the CNPSP processor [] shown in Fig. . This core incorporates Simultaneous Multithreading (SMT) capabilities, whereby eight threads execute simultaneously, utilizing a variable number of resources on a cycle by cycle basis. In each cycle, anywhere from zero to three instructions can be executed from each of the threads depending on instruction dependencies and availability of resources. The maximum instructions per cycle (IPC) of the entire core is ten. In each cycle, two threads are selected for fetch and their respective program counters (PCs) are supplied to the dual-ported instruction cache. Each port supplies eight instructions, so there is a maximum fetch bandwidth of instructions. Each of the eight threads has its own instruction queue (IQ) which can hold up to instructions. The two threads chosen for fetch in each cycle are the two that have the fewest number of instructions in their respective IQs. Each of the eight threads has its own entry register file (RF). Since there is no sharing of registers between threads, the only communication between threads occurs through memory. The eight threads are
M
M
Multi-Threaded Processors
Execution pipelines
Instruction issue Floating-point Queue (64 entries) Instruction decode
I-cache
Decode
PC
Int
8 Pcs (one per thread)
Fpt Veet
Instruction slots
FPU 2
(128 registers) data
Integer Queue (64 entries)
Integer register file
8 rename table sets (one per thread)
ALU 1
ALU 2 Memory Queue (64 entries)
Thread Id
FPU 1
(128 registers)
Addr.gen.
Memory
Instruction fetch
Floatingpoint register file
Addr.gen. Reorder buffer Vector register file
Vector Queue (64 entries) (128 registers; 1 read, 1 write per register)
data VFU 1
Crossbars
VFU 2
VFU 3
VFU 4
Multi-Threaded Processors. Fig. Simultaneous multithreaded vector architecture
divided into two clusters of four for ease of implementation. Thus the dispatch logic is split into two groups where each group dispatches up to six instructions from four different threads. Eight function units are grouped into two sets of four, each set dedicated to a single cluster. There are also two ports to the data cache that are shared by both clusters. A maximum of ten instructions can be dispatched in each cycle. The function units are fully bypassed so that dependent instructions can be dispatched in successive cycles. Clearwater was out of business in by tape out date. The UltraSPARCV, which exploited SMT, was able to switch between two different modes depending on the
type of work – one mode for heavy duty calculations and the other for business transactions such as database operations []. UltraSPARC T [] is a microprocessor that is both multicore and multithreaded. The processor is available with four, six, or eight CPU cores, each core able to handle four threads concurrently. Thus the processor is capable of processing up to threads concurrently. UltraSPARC T [] is a member of the SPARC family, and the successor to the UltraSPARC T.The processor, manufactured in nm, is available with eight CPU cores, and each core is able to handle eight
Multi-Threaded Processors
M
Fetch Select (2 of 8 Threads Selected)
Instruction Cache (64K Bytes, 4-Way Set Associative)
256
256
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
RF
RF
RF
RF
RF
RF
RF
RF
Dispatch (6 inst. max., 3 inst. max. per thread)
Dispatch (6 inst. max., 3 inst. max. per thread)
M FU
FU
FU
FU
AGEN AGEN
FU
FU
FU
FU
Data Cache (64K Bytes, 4-Way Set Associative)
32
32 Copyright © 2002, Stephen W. Melvin
Multi-Threaded Processors. Fig. The CNPSP processor
threads concurrently. Thus the processor is capable of processing up to concurrent threads. Hyper-Threading Technology [] proposed SMT for the Pentium -based Intel Xeon processor family to be used in dual and multiprocessor servers. Hyper-Threading Technology makes a single physical processor appear as two or more logical processors by applying the SMT approach. Each logical processor maintains a complete set of the architecture state, which
consists of the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses. Each logical processor has its own APIC. The interrupts sent to a specific logical processor are handled only by that logical processor.
M
Multi-Threaded Processors
Multithreading on Power Efficiency An important issue to discuss is power efficiency on multithreaded computers. As shown in the previous sections multithreaded processors fill these bubbles that get generated in the pipeline with useful instructions thus making the power per instruction overall more efficient. Speculation, used to achieve high performance on single-threaded machines, normally consumes not only power at the moment of computation but also at the moment of squashing if a misprediction occurs. Additionally, the hardware complexity of the predictors drain energy but multithreaded processors can make significant energy efficiency strides by not needing to implement as many predictors to achieve as high performance as single-threaded processors. Furthermore, a significant amount of logic used in superscalar design can be eliminated, such as out of order. In the case of FGMT, most of these logics which in turn represent additional power saving. On the other hand, the additional required context cost in multithreaded processors demands more power but in general this amount of power is not significant. Therefore, the overall power efficiency of multithreaded processors is much better than with single-threaded processors and explains why the technique is in use in the top-of-the-line powerefficient processors today.
Pros The major benefit of multithreading is the improvement of processor and memory utilization. Ideally as you add threads to the processor, the performance of the running thread will remain constant while the total system performance will be the sum of the performance of each thread. In real systems, as you add threads the performance of the running ones will degrade. In order to achieve performance increase throughout multithreading, it is important that the performance obtained by adding a thread should be larger than the performance degradation caused to the other threads. In many cases, threads could share code and/or data resulting in positive interference in the caches. In other words, as a thread needs a new piece of code it finds it in the instruction cache because a co-running thread had already requested it. Additionally, what will normally be a bubble in the execution pipeline can be easily used by a co-running thread to execute instructions without
causing penalties to the main running thread. The hardware complexity of the system can be reduced since it is the total system performance that is sought and not just the single thread performance, making complex microarchitecture designs (e.g., OOO execution, branch predictors, etc.) unnecessary.
Cons There are also a few issues to take into account before using multithreaded processors. If the performance of a single running thread is what has been sought, then the best way to use a multithreaded architecture is to use the “helping thread or speculative multithreading” [guri]. However, in a system were multithreading is used to achieve higher levels of system performance, it is important to notice that the individual performance of each thread will be, in general, lower than a thread will have when running alone. This is due to the fact that interference between threads will be mostly negative, causing trashing on the caches and TLBs, therefore increasing the misses and thread running time. It is essential to take advantage of thread interference by co-running threads with very low negative interference and high positive interference.
Future Directions Today, multithreaded multicore is pervasive in the industry. However, there are many areas that require further study, for instance, how to program these processors, how to map threads to streams, and how to perform run time debugging. These interesting challenges are the scope of work of various professionals and academics today, ensuring the extension of multithreading throughout the computing world which will branch out to also cover most areas of processing such as graphics, DSP, high performance, and low power. In the future, asymmetric multithreaded processors, in which every thread could be specialized for a particular task, will start appearing.
Related Entries Computer Graphics Control Data DEC Alpha Denelcor HEP Instruction-Level Parallelism POSIX Threads (Pthreads)
Multi-Threaded Processors
Parallelization, Automatic Processes, Tasks, and Threads Shared-Memory Multiprocessors Speculation, Thread-Level Superscalar Processors Tera MTA
Bibliographic Notes and Further Reading As stated earlier, multithreaded work began in the s on the Control Data Computer (CDC) Peripheral Control Processor; a description can be found in Design of a Computer – The Control Data []. The first FGMT paper by B. J. Smith, “A Pipelined Shared Resource MIMD Computer,” [] in where he introduced the HEP machine followed a few years later in by the Horizon machine in “A Processor Architecture for Horizon,” Mark R. Thistle and B.J. Smith []. The DISC architecture which was a pioneer of SMT can be read in “DISC: dynamic instruction stream computer,” proceedings of the th annual international symposium on Microarchitecture []. There have been several valued reviews on multithreaded processors. A very complete one can be found in the book Multithreaded Computer Architecture: A Summary of the State of the Art by Robert A. Iannucci, Guang R. Gao, Robert H. Halstead Jr, and Burton Smith []. The book is divided in four parts covering from basic definitions up to compilation techniques. Additionally, Theo Ungerer, in , has done a good examination of multithreaded processors in his paper “A survey of processors with explicit multithreading” [].
Bibliography . Agarwal A et al () Sparcle: an evolutionary processor design for large-scale multiprocessors. IEEE Micro :– . Anderson W () Experiences using the cray multi-thread architecture (MTA ). user group conference (DoD UGC’), Bellevue, ISBN: --- . Chapell R et al () Simultaneous subordinate microthreading (SSMT). In: Proceedings of the th annual international symposiumon computer architecture, Atlanta, GA. IEEE Computer Society Washington, DC, pp – . Dubey P et al () Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grain multithreading. Tech. Rep. RC . IBM, Yorktown Heights, NY . Emer J () Simultaneous multithreading: multiplying aspha’s performance. In: Proceedings of the microprocessor forum, San Jose, CA
M
. Espasa R, Valero M () Exploiting instruction- and data-level parallelism. IEEE MICRO ():– . Grunewald W, Ungerer T () A multithreaded processor designed for distributed shared memory systems. In: Proceedings of the international conference on advances in parallel and distributed computing, APDS-, Shanghai . Halstead Jr RH () MASA: a multithreaded processor architecture for parallel symbolic computing. ACM SIGARCH Computer Architecture News ():– . Hirata H et al () An elementary processor architecture with simultaneous instruction issuing from multiple threads. In: Proceedings of the th international symposium of computer architecture, Gold Coast, Australia. ACM Press, New York, pp – . http://www.opensparc.net . http://www.pcworld.com/article//sun_hints_at_ultrasparc _v_and_beyond.html . http://www.zytek.com/~melvin/clearwater.html . Iannucci R, Gao G, Halstead R Jr, Smith B () Multithreaded computer architecture: a summary of the state of the art. The Springer International Series in Engineering and Computer Science. Kluver, Norwell, ISBN --- . Jordan HF () Performance measurements on HEP – a pipelined MIMD computer. In: Proceedings of the th annual international symposium on computer architecture. IEEE, New York, ISBN --- . Keckler SW, Dally WJ et al () The M-machine multicomputer. In: Proceedings of the th annual international symposium on computer architecture, Ann Arbor, MI. ACM Press, New York, pp – . Marr D et al () Hyper-threading technology architecture and microarchitecture: a hypertext history. Intel Technol J () . Metzner A, Niehaus J () MSparc: multithreading in real-time architectures. J Univ Comput Sci ():– . Nemirovsky et al () Microprogrammed timer processor. US patent . Nemirovsky M, Brewer F, Wood R () DISC: dynamic instruction stream computer. In: Proceedings of the th annual international symposium on microarchitecture. Albuquerque, New Mexico . Noakes MD, Wallach DA, Dally WJ () The J-machine multicomputer: an architectural evaluation. In: Proceedings of the th annual international symposium on computer architecture, San Diego, CA. ACM Press, New York, pp – . Smith B () On parallel MIMD computation: HEP supercomputer and its applications. Massachusetts Institute of Technology, Cambridge, MA, ISBN --- . Smith BJ () A pipelined, shared resource MIMD computer. In: Proceedings of the international conference on parallel processing. IEEE press, Piscataway, pp – . Sohi G, Breach S, Vijaykumar T () Multiscalar processor. In: Proceedings of the nd annual international symposium on computer architecture, Santa Margherita Ligure, Italy, June . Kluwer, Norwell, pp –
M
M
Mumps
. Thistle MR, Smith BJ () A processor architecture for horizon. In: Proceedings of the ACM/IEEE conference on supercomputing, Orlando, FL. IEEE Computer Society Press, Los Alamitos, pp – . Thornton JE () Design of a computer – the control data . Scott, Foresman and Co, Glenview . Tullsen D, Eggers S, Levy H () Simultaneous multithreading: maximizing on chip parallelism. In: Proceedings of the nd annual international symposium on computer architecture, Santa Margherita Ligure, Italy, June . Kluwer, Norwell, pp – . Ungerer T, Bobic B, Silc J () A survey of processor with Explicit Multithreading. ACM Comput Surv (CSUR) ():– . Wallace S, Calder B, Tullsen D () Threaded multiple path execution. In: Proceedings of the th annual international symposiumon computer architecture, Barcelona, Spain. IEEE Computer Society Washington, DC, pp – . Yamamoto W, Nemirovsky M () Increasing superscalar performance through multistreaming. In: Proceedings of the international conference on parallel architectures and compilation techniques, Limassol, Cyprus. IEEE Computer Society Washington, DC, pp – . Yamamoto W, Serrano M, Wood R, Nemirovsky M () Performance estimation of multistreamed superscalar processors. In: Proceedings of the Hawaii conference on systems and science, Hawaii, January . IEEE Computer Society, Los Alamitos, pp –
Mumps Patrick Amestoy , Alfredo Buttari , Iain Duff , Abdou Guermouche , Jean-Yves L’Excellent , Bora Uçar Université de Toulouse ENSEEIHT-IRIT, Toulouse, France Science & Technology Facilities Council, Didcot, Oxfordshire, UK Université de Bordeaux, Talence, France ENS Lyon, Lyon, France
Synonyms MUMPS
a real unsymmetric, symmetric positive definite, or symmetric indefinite coefficient matrix and will solve complex systems where the matrix is unsymmetric or complex symmetric. MUMPS has a large number of options, some to enhance functionality and some to improve performance or core memory usage. Whereas most direct solvers for distributed memory environments rely on static approaches where the computational tasks are known and assigned to the processors in advance, one of the main originalities of MUMPS is its ability to perform dynamic pivoting in order to guarantee numerical stability, leading to dynamic data structures and non-fully predictable tasks graphs. In order to handle these dynamic aspects, the management of parallelism is based on a fully asynchronous approach, and the computational tasks are defined and scheduled dynamically at runtime.
Discussion Introduction MUMPS (MUltifrontal Massively Parallel Solver) is a direct code, based on Gaussian elimination, for the solution of sparse linear equations on distributed memory machines, although a serial version is available. With a direct code, the first step in solving a linear system of equations of the form Ax = b
()
is to perform the matrix factorization PAQ = LU,
()
where P and Q are permutation matrices, L is a lower triangular matrix, and U is an upper triangular matrix. The solution to the Eq. is then obtained through a forward elimination step Ly = Pb, followed by the backsubstitution UQT x = y.
Definition
When A is sparse, that is, when there are many zero MUMPS (MUltifrontal Massively Parallel Solver) is entries in A, the computation will be organized so that a parallel library for the solution of sparse linear L and U are also sparse, but there are in general signifiequations. It primarily targets parallel platforms with cantly more nonzeros in L and U than in A. distributed memory, where the message passing paradigm If the matrix A is symmetric, then the factorization MPI is used. MUMPS is a direct code, based on GausPAPT = LDLT () sian elimination. It will solve sparse linear systems with
Mumps
is used, where D is a block diagonal matrix with blocks of order or on the diagonal. The blocks of order two are required for stability when factorizing symmetric indefinite matrices. MUMPS uses the factorization () when the matrix is unsymmetric and the factorization ( ) when the matrix is symmetric. Before discussing the MUMPS code in more detail in the following sections, it is worth recalling how the steps given above for solving Eq. are implemented within sparse direct codes. In most sparse direct codes, this is done in three steps, namely. Analysis: In this step of the computation, the sparsity pattern of A (or, when the matrix is unsymmetric, of the pattern of ∣A∣ + ∣A∣T , where ∣A∣ is the matrix whose entries are the modulus of the corresponding entries in A) is used to determine an ordering to maintain sparsity in the triangular factors, L and U. This is called the ordering phase. The analysis step then continues by organizing data structures so that the subsequent numerical factorization can be computed efficiently. The main part of this phase is often called symbolic factorization. Factorization: In the next step, the factors L and D (or L and U in the unsymmetric case) are computed using the information from the analysis step. This is the most computationally intensive step and generally takes the most time to execute, although much work has been done to run this efficiently on parallel computers. If the matrix is positive definite, it is usually possible just to follow exactly the pivoting sequence obtained from the ordering phase. If, however, the matrix is unsymmetric or symmetric indefinite, then care must be taken to avoid numerical instability. Solve: In this last step, the factors from the factorization step are used to solve the system through the forward and back substitutions shown above. While this step is computationally cheap, involving effectively only two arithmetic operations for each entry in the factors, it suffers greatly in performance because of the amount of data movement relative to this small amount of arithmetic. Sparse direct codes differ in the way these steps are performed. MUMPS uses a multifrontal approach to effect these steps. The details of this approach are
M
discussed in another entry of this Encyclopedia. Later in this entry, a short history of the multifrontal developments that led to the first version of MUMPS is given at the beginning of section Origins of MUMPS. In the reminder of this section, the way MUMPS realizes the three steps indicated above will be discussed. MUMPS is as much a research project as a code, and this aspect is discussed in section Further Comments.
General Comments As said in the definition, MUMPS is a parallel code that uses MPI for message passing. It will solve linear systems with a real unsymmetric, symmetric positive definite, or a symmetric indefinite coefficient matrix and will solve complex systems where the matrix is unsymmetric or complex symmetric. Over the years, a large number of options have been developed, some to enhance functionality and some to improve performance. The range of functionalities of MUMPS is far greater than that of any of other sparse direct code at the present time. In addition to handling the very wide range of matrices listed above, MUMPS accepts the input matrix as a distributed assembled matrix or in elemental format, that is to say the matrix is represented as an expanded sum of small dense matrices each based on a single finite element or on a group of such elements. The parallel model used by MUMPS relies on a dynamic and distributed management of the computational tasks (dynamic scheduling): each process is responsible for providing work to some other processes while at the same time acting as a worker for others. One outcome of such a distributed management is that the work is performed asynchronously so that computation and communication are very much overlapped. Since the main pivoting algorithms in the symmetric indefinite and unsymmetric cases will alter the order of computation of the factors and will usually change the data structures, MUMPS has from its early days always used dynamic scheduling []. While initially required to accommodate numerical pivoting, the dynamic scheduling approach is also a powerful tool for the control of memory usage and for efficiency in a multiuser environment where the load on the processors can have a significant influence on performance. In fact, much of the research activities related to MUMPS (summarized in section Further Comments) have been concerned with this aspect of the package.
M
M
Mumps
Options exist to run the analysis step in parallel, including a priori scaling of the matrix, and both the factorization and solution phases can be run in parallel although uniprocessor versions for all phases are also available.
Analysis Step As discussed in section Introduction, the analysis step consists of two main phases: the selection of a pivot ordering to maintain sparsity and the creation of data structures for the subsequent factorization and solution steps (the symbolic factorization). In MUMPS, the symbolic factorization phase produces the assembly tree that is used in the multifrontal factorization. In the analysis step, MUMPS does a static scheduling phase that distributes the nodes of the tree to the processes. This static phase attempts to balance arithmetic and storage over the processors but, for the reasons mentioned in the previous section, will need to be followed by dynamic scheduling in the factorization step. In particular, because of limited tree parallelism near the top of the tree, more than one process is required for each node that appears in that area: the process in charge of such a node will select dynamically other processes that will participate in the computations at that node, possibly among a set of candidate processors [] chosen at analysis. Because one of the features of MUMPS is that the execution should fit within a memory area allocated at the beginning of the factorization, an accurate estimate of the memory required for the numerical factorization is required. Such an estimate is computed by each process at the analysis phase through a “symbolic” traversal of the tree where the algorithm mimics the memory requirements that will occur at the factorization step. This feature is available for both parallel and sequential execution. The accuracy of the estimate degrades depending on the amount of numerical pivoting which may happen during the numerical factorization, and on the dynamic behavior of the parallel factorization (which is by nature difficult to predict). In practice, allowing a small percentage increase to the estimate is sufficient. For the ordering phase, there are several sparsity ordering methods available to MUMPS users, and there is an option for automatically choosing an ordering method based on the size and density of the matrix, the ordering packages available to MUMPS at installation time, and the number of processors being used. The
user can also input his or her own pivotal order. Note that the ordering option can have a significant impact on the shape of the assembly tree and on the memory requirements during the numerical phases [, ]. A range of ordering options is supported natively by MUMPS including the AMD (Approximate Minimum Degree) method and some variants of it, AMF (Approximate Minimum Fill), QAMD (Approximate Minimum Degree with detection of quasi dense rows), and PORD (Paderborn ORDering). Graph partitioningbased orderings are also supported by means of external software packages such as METIS or SCOTCH; these tools are usually capable of providing better orderings (based on nested dissection) for large-scale, threedimensional problems. They are also recommended in the parallel case because they lead to reasonably wide trees with more parallelism than the ordering options natively supported by MUMPS. Note that after the assembly tree is built, there is still room to improve the memory behavior of the solver by computing and using a tree-traversal that is optimal in terms of memory usage. This is done within MUMPS by using a variant of the algorithm proposed by Liu []. In the early days of the development of parallel sparse direct codes, little attention was paid to the analysis step because of its low cost relative to the factorization and solve phases. However, for two reasons this has become a more major concern in recent years. One reason is that the efficient parallelization of the factorize step means that if the analysis is still run on one processor then the elapsed time for this can be significant or even dominant if many processors are used for the factorization. The second reason is that on some multiprocessor machines there is insufficient memory on a single processor to hold even the integer information for the whole matrix. Thus MUMPS now offers a parallel option for the analysis step []. After an initial redistribution of the matrix, a parallel, nested-dissection-based tool, such as PT-SCOTCH or ParMETIS, is used to compute a fill-reducing pivotal order. According to the partitioning induced by the separators computed during the ordering step, the subsequent symbolic factorization is executed in parallel through an algorithm based on the use of distributed quotient graphs. This approach proved to be very effective in reducing the memory requirements and in improving the elapsed time for the analysis phase.
Mumps
Most of the analysis step requires only the pattern of the matrix. However, scaling is a preprocessing for improving some of the numerical properties of the matrix, and this requires also the numerical values. Among a number of scaling alternatives available in MUMPS, the following two are the most powerful ones. The first uses linear programming duality where the scaling operators are obtained from weighted bipartite matching algorithms. The computation of this scaling alternative is performed sequentially on a central processor. A particular example of this is in the indefinite case where scaling can be used to identify potential two by two pivots allowing the analysis to be continued on a reduced graph obtained by collapsing the nodes corresponding to those pivots [, ]. The second scaling is an iterative algorithm. At each iteration of this algorithm, each row and column of the current matrix is scaled so that it is normalized according to a predefined norm, where the scaling of the rows and the scaling of the columns are independent. The scaling for this transformation is then combined with the scaling at the previous iteration in order to obtain a scaling for the original matrix. This algorithm is implemented in MUMPS for distributed memory environments in such a way that the communication and computation requirements of an iteration are equivalent to the sum of those for sparse matrix-vector and sparse matrix transpose-vector multiplication operations with the same matrix.
Factorization Step The factorization step uses the tree and subsidiary information from the analysis to effect the numerical factorization. As mentioned earlier, a main feature of this phase is the dynamic scheduling that is used to balance work and storage at execution time when the initial mapping is not appropriate because of numerical or load balance considerations [, ]. So that each processor can take appropriate dynamic scheduling decisions, an up to date view of the load and current memory usage of all the other processors must be maintained. This is done using asynchronous messages, as described in []. The nodes of the tree can be processed on a single processor or can be split between several processors. In MUMPS, a one-dimensional splitting and mapping can be used within intermediate nodes and a twodimensional splitting can be used at the root node. The root node is by default factorized using ScaLAPACK.
M
In the factorization step, before the actual numerical factorization, there are a range of options available to scale the matrix including a version which runs in parallel. The most important scaling options were described when discussing the analysis step, but it is interesting to perform scaling at the factorization step rather than at the analysis step when successive factorizations are performed on matrices that share the same structure. In addition to using standard threshold pivoting (see article on the Multifrontal Method in this Encyclopedia) for indefinite and unsymmetric systems, there is an option to use a static pivoting approach where pivots that are too small to satisfy a threshold criterion are augmented to avoid instability. In this case, one can avoid the perturbations to the original analysis ordering caused by numerical pivoting, making the performance closer to that obtained for positive definite matrices. However, the factorization will be that of a perturbed matrix and so an iterative refinement method or other more powerful iterative method must later be used to obtain a satisfactory solution. The use of threshold partial pivoting may lead to a dynamic modification of the size of the tasks that have to be processed. Indeed, pivots which, for reasons of numerical stability, cannot be eliminated from the frontal matrix at a given node of the assembly tree are delayed to the parent node inducing an increase in its size and cost. This issue requires complex memory management mechanisms dealing with dynamic data structures. In addition, the fact that the cost of the tasks may differ from the predictions made during the analysis phase requires the scheduling mechanisms to dynamically adapt the mapping. This is done within MUMPS by using dynamic schedulers which complete the partial scheduling produced at the analysis phase (see [] for more details). This enables the code to react to load variations due to delayed pivots. In addition to a standard factorization of the matrix as in Eqs. or , other options are available in MUMPS. It is possible to request a partial factorization of the matrix in order to compute the Schur complement. Consider a matrix ordered and partitioned as ⎤ ⎡ ⎢ A A ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎢A A ⎥ ⎢ ⎥ ⎦ ⎣ and assume that the elimination is restricted to choose the pivots from the (,) block. Then the factorization
M
M
Mumps
will proceed to compute the LU (or LDLT ) factorization of A and to compute the Schur complement matrix A − A A− A which can then be returned to the user, either centralized on one processor, or distributed over the processors. This partial factorization feature is particularly powerful for problems where partitions of the above form are used, for example when using domain decomposition. Options exist to solve the whole system after the Schur complement system has been solved. There is also an option to detect null or almost null pivots and compute the numerical rank of the matrix. In cases where the matrix is rank deficient, vectors from the null space can be returned at the solve step. One of the main issues that commonly arises during the factorization of large scale problems concerns the amount of main memory required. Although parallelism represents a natural way of reducing the local memory requirements, this may still not be enough. Operating systems provide mechanisms that can handle transparently cases where an application requires more memory than available on the system. These techniques, commonly referred to as virtual-memory techniques, consist in extending the main memory using disk storage. Portions of data in the main memory, so-called pages, are swapped out of main memory to the disk when they are unlikely to be used soon (for example, based on a least recently used policy). They are brought back to main memory if later access to them is required. Although these techniques apply transparently to any application, relying on them normally results in considerable performance loss, since the swapping of pages is done according to general policies that cannot fully exploit the access pattern to data in an application. Instead of this mechanism, MUMPS provides an out-of-core feature [] that greatly reduces main memory consumption by saving the factors on disks. Although not available in the current version of MUMPS, specific strategies for the traversal of the elimination tree could be used in order to limit the amount of data moved to and from the disk []. Also because of a clever use of asynchronous I/O operations which allow an overlap of computations with disk accesses, the out-of-core feature only has a relatively small impact on the performance of the factorization phase while delivering great benefits in terms of memory consumption. The numerical robustness provided by partial pivoting
used in MUMPS is fully guaranteed even when the out-of-core feature is enabled.
Solve Step The solution phase, that is the forward and back substitution using the matrix factors, also exploits parallelism relying on the computational pattern defined by the elimination tree. Performance gains can also be achieved when there are multiple right-hand sides. The resulting solution vector(s) can be left as a distributed vector or can be returned as a centralized vector. If the matrix is unsymmetric, the solution of the transpose equations can be obtained using the same factorization. Considerable performance gains can be achieved in MUMPS when sparse (perhaps multiple) right-hand sides are supplied since the solution phase only requires parts of the factors. In fact, nonzero values in the righthand sides induce paths in the elimination tree which identify the portions of the factors involved in the solution phase. In the case of multiple, sparse right-hand sides, MUMPS uses grouping strategies in order to optimize the exploitation of such paths. In an out-of-core context, the grouping of sparse right-hand sides aims to maximize the overlap of these paths which is equivalent to minimizing the amount of data loaded from disks, and thus the number of expensive I/O operations []. As mentioned above, the solve step of MUMPS also provides an option for returning either all or specific vectors from the null-space basis.
Postprocessing Step In common with many of its precursors, MUMPS offers further information on the solution after it has been computed. Options exist to return information on the residuals, the condition number of the matrix, and an estimate of the error and iterative refinement can be used to improve the solution by solving equations of the form Aδx = r where r is the residual r = b − Ax and then correcting the solution by δx.
Origins of MUMPS MUMPS started in with the PARASOL Project which was an Esprit IV European project. MUMPS
Mumps
implements a multifrontal method (see the Related Entry) and was inspired by an experimental prototype of an unsymmetric multifrontal code for distributedmemory machines using PVM []. That experimental prototype was itself inspired by the code MA41, developed by Amestoy during his Ph.D. thesis [] at CERFACS under the supervision of Duff. MA41 exploits vector computers and shared-memory multiprocessor environments to solve unsymmetric linear systems. Going back further in the past, MA41 started from the work of Duff, who designed a parallel version of the unsymmetric code from Duff and Reid (HSL code MA37 [] developed at the Harwell Laboratory). This parallel code was run primarily on the Alliant FX/ []. The PARASOL Project consisted of a consortium of six software providers that included teams at CERFACS, INPT(ENSEEIHT)-IRIT, and RAL, and four application-based users of the software. MUMPS, so named because of the target of massive parallelism using the MPI message passing protocol, was the sole direct code in the project which also included linear solvers based on domain decomposition (including a FETI implementation) and multigrid. The first functionalities of MUMPS were developed to meet the requirements of the industrial partners most of whom worked with finite-element models []. All the software produced by this project was in the public domain although by the end of the project, in September , most were still almost at a prototype level. Since the first prototype of MUMPS at the end of the PARASOL project (), MUMPS has been supported by research institutions (CERFACS, CNRS, INPT-IRIT and INRIA) and projects (academic or industrial), that have helped to develop new functionalities over the years. This development of the MUMPS package (including development of new features, maintenance, integration of research work, support to users and project management) has been led by the teams of Patrick Amestoy (INPT) at IRIT Toulouse and JeanYves L’Excellent (INRIA) at LIP-ENS Lyon. Some developments also involve members of the LaBri laboratory in Bordeaux, France, and of the French research institute CNRS, while CERFACS has contributed by financing research related activities through Ph.D. grants. Both postdoctoral and engineer positions have been directly financed by CNRS and INRIA, and MUMPS
M
has also been supported by a large number of contracts, projects and grants including ●
International projects like the France-Berkeley Fund in collaboration with the University of California, Berkeley or the Egide-Aurora project in collaboration with the University of Bergen, Norway or the Multicomputing project in collaboration with the University of Tel Aviv, Israel ● Projects financed by research institutions like CNRS, INRIA or ANR like the ANR-SOLSTICE project ● Contracts financed by private institutions and industries such as Samtech
Further Comments MUMPS is both a software platform and a research project that has developed many new and innovative techniques for the parallel solution of large sparse linear systems. Several Ph.D. theses have been obtained by students directly connected to MUMPS [, , , ]. Some of the major current research interests include further work on out-of-core implementation, on memory scalability, on sparse right-hand sides, and on combinatorial aspects related to preprocessing and analysis. One way in which this stimulating symbiosis between code development and research can be seen is in the restricted availability of prototype versions of the code and the use of experimental parameters which both help the research efforts in addition to acting as a beta-test for new features. MUMPS is coded in Fortran and C and uses MPI for message passing. It calls standard codes from the BLAS, BLACS, and ScaLAPACK so that some additional fine multi-threading can be obtained by using appropriate versions of the BLAS. Single and double precision versions are supported for both real and complex systems. MUMPS supports all variants of UNIX systems including Linux, and Microsoft Windows thanks to a port developed by MUMPS users. There are interfaces to MUMPS from Fortran , C, C++, MATLAB and SCILAB. The code will run on -bit architectures and exploits long integers that are becoming increasingly needed as the size of problems being solved by direct solvers continues to increase. The MUMPS software is continually updated both to correct any bugs and also to add new functionality.
M
M
MUMPS
The current version (as of November ) is Release .., which is in the public domain. Up-to-date information including the MUMPS Users’ Guide and information on downloading the package can be obtained through the web pages http://graal.ens-lyon.fr/MUMPS or http://mumps.enseeiht.fr/. MUMPS is currently downloaded by approximately a , users per year from these web sites. MUMPS is interfaced or included (and redistributed) within various commercial and academic packages such as Samcef from Samtech, FEMTown from Free Field Technologies, Code_Aster from EDF, IPOPT, PETSc (see the PETSc (Portable, Extensible Toolkit for Scientific Computation) entry in this Encyclopedia).
Related Entries Dense Linear System Solvers Linear Algebra Software Linear Algebra, Numerical Multifrontal Method PARDISO PETSc (Portable, Extensible Toolkit for Scientific Computation) SuperLU
Bibliography . Agullo E () On the out-of-core factorization of large sparse matrices. PhD thesis, École Normale Supérieure de Lyon, France, November . Agullo E, Guermouche A, L’Excellent J-Y () A parallel outof-core multifrontal method: storage of factors on disk and analysis of models for an out-of-core active memory. Parallel Comput (–):– . Agullo E, Guermouche A, L’Excellent J-Y () Reducing the I/O volume in sparse out-of-core multifrontal methods. SIAM J Sci Comput, to appear . Amestoy P, Duff I, Guermouche A, Slavova T () Analysis of the solution phase of a parallel multifrontal approach. Parallel Comput . doi: ./j.parco... . Amestoy PR () Factorization of large sparse matrices based on a multifrontal approach in a multiprocessor environment. INPT PhD thesis TH/PA//, CERFACS, Toulouse, France . Amestoy PR, Buttari A, L’Excellent J-Y () Towards a parallel analysis phase for a multifrontal sparse solver. June . Presentation at the th International workshop on Parallel Matrix Algorithms and Applications (PMAA’) . Amestoy PR, Duff IS, L’Excellent J-Y () Multifrontal solvers within the PARASOL environment. In: Kågström B, Dongarra J, Elmroth E, Wa´sniewski J (eds) Applied parallel computing, PARA’, Lecture Notes in Computer Science, No. , pp –, . Springer, Berlin
. Amestoy PR, Duff IS, L’Excellent J-Y, Koster J () A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J Matrix Anal A ():– . Amestoy PR, Duff IS, Vömel C () Task scheduling in an asynchronous distributed memory multifrontal solver. SIAM J Matrix Anal A :– . Amestoy PR, Guermouche A, L’Excellent J-Y, Pralet S () Hybrid scheduling for the parallel solution of linear systems. Parallel Comput ():– . Duff IS () Parallel implementation of multifrontal schemes. Parallel Comput :– . Duff IS, Pralet S () Strategies for scaling and pivoting for sparse symmetric indefinite problems. SIAM J Matrix Anal A ():– . Duff IS Pralet S () Towards stable mixed pivoting strategies for the sequential and parallel solution of sparse symmetric indefinite systems. SIAM J Matrix Anal A (): – . Duff IS, Reid JK () The multifrontal solution of unsymmetric sets of linear systems. SIAM J Sci Stat Comput :– . Espirat V () Développement d’une approche multifrontale pour machines a mémoire distribuée et réseau hétérogène de stations de travail. Technical Report Master thesis, ENSEEIHT-IRIT . Guermouche A () Études et optimisation du comportement mémoire dans les méthodes parallèles de factorisation de matrices creuses. PhD thesis, ENS Lyon, France . Guermouche A, L’Excellent J-Y () A study of various load information exchange mechanisms for a distributed application using dynamic scheduling. In: th International Parallel and Distributed Processing Symposium (IPDPS’) . Guermouche A, L’Excellent J-Y () Constructing memoryminimizing schedules for multifrontal methods. ACM T Math Software ():– . Guermouche A, L’Excellent J-Y, Utard G () Impact of reordering on the memory of a multifrontal solver. Parallel Comput ():– . Liu JWH () On the storage requirement in the out-of-core multifrontal method for sparse factorization. ACM T Math Software ():– . Pralet S () Constrained orderings and scheduling for parallel sparse linear algebra. Phd thesis, Institut National Polytechnique de Toulouse, September . CERFACS Technical Report, TH/PA// . Slavova Tz () Parallel triangular solution in the out-of-core multifrontal approach for solving large sparse linear system. PhD thesis, Institut National Polytechnique de Toulouse, . CERFACS Technical Report, TH/PA//
MUMPS Mumps
Myrinet
Mutual Exclusion Path Expressions Synchronization
Myri-G Myrinet
Myricom Myrinet
Myrinet Scott Pakin Los Alamos National Laboratory, Los Alamos, NM, USA
Synonyms Interconnection network; LANai; Myri-G; Myricom; Network architecture
Definition Myrinet is a family of high-performance network interface cards (NICs) and network switches produced by Myricom, Inc. (Myrinet and Myricom are registered trademarks of Myricom, Inc.) Some salient features of Myrinet are that it utilizes wormhole switching and emphasizes fully programmable NICs and simple switches.
Discussion History The Myrinet network [] is a direct descendent of Caltech’s Mosaic C multicomputer [] and USC/ISI’s ATOMIC local-area network (LAN) []. The Mosaic C was based on custom processors that integrated a network interface, router, and simple CPU onto a single die. Although the Mosaic C processor was intended to be used as the building block for a massively
M
parallel processor, the ATOMIC project investigated using Mosaic C processors in a LAN setting as intelligent NICs – protocol processors – as a means for providing high-speed communication to nodes with more powerful CPUs. The project was deemed a success, and members of the Caltech and USC/ISI teams formed Myricom, Inc. to enhance and commercialize the technology. The first Myrinet product was released in and utilized separate NICs and switches, in contrast to the Mosaic C’s integrated system. The NICs initially comprised a -bit microcoded processor called the LANai and KiB of static random-access memory (SRAM). As in ATOMIC, the NICs resided on an I/O bus. The switches were based on - or -port crossbars. Connectivity was via copper cables, and the link data rate was Mb/s (. MiB/s) in each direction. The second-generation Myrinet network () replaced the -bit microcoded LANai . with a bit pipelined LANai , increased the NIC’s on-board SRAM to MiB, made some physical changes to the cabling (supporting both “LAN” and “SAN” configurations and both copper and optical cables) and to the signaling and encoding mechanisms, doubled the link data rate to , Mb/s (. GiB/s) in each direction, and added support for the PCI bus. -port Myrinet switches became available. During Myrinet’s second generation, the LANai processor went through numerous performance enhancements and major revisions, ending with the LANai , whose core is still in use in recent Myrinet products. The third-generation product, Myrinet , appeared in and further increased the data rate to , Mb/s (. GiB/s) in each direction. Within that same generation, Myricom began supplying a “converged” architecture that continued to support the extant Myrinet cables and protocols while adding support for Gigabit Ethernet cables and protocols. This required altering most parts of the LANai (most noteworthily excluding the processor proper) to support multiple interfaces, resulting in the LANai XM, XP, and XP, each with a different tally of host and external interfaces. Support for the PCI-X I/O bus was also added. Finally, the fourth-generation Myrinet product, Myri-G, increased the data rate to , Mb/s (. GiB/s), added support for the PCI Express bus,
M
M
Myrinet
200 180 160 140 120 100 80 60 40 20 0
Nov. 1998 Jun. 1999 Nov. 1999 Jun. 2000 Nov. 2000 Jun. 2001 Nov. 2001 Jun. 2002 Nov. 2002 Jun. 2003 Nov. 2003 Jun. 2004 Nov. 2004 Jun. 2005 Nov. 2005 Jun. 2006 Nov. 2006 Jun. 2007 Nov. 2007 Jun. 2008 Nov. 2008 Jun. 2009 Nov. 2009 Jun. 2010
Number of Top500 installations
Myrinet. Fig. Myrinet’s popularity in the HPC community
and upgraded the convergence products from Gigabit Ethernet to Gigabit Ethernet via new LANai ZE, ZS, and ZES implementations. According to the Top list (http://www.top. org/), Myrinet’s popularity in the high-performance computing (HPC) community peaked in November with of the world’s fastest supercomputers using a Myrinet network. However, as the Top data show, Myrinet’s presence in the Top list dropped rapidly starting in (Fig. ). At the time of this writing (June ), InfiniBand and Gigabit Ethernet have largely cornered the HPC interconnect market.
Myrinet. Fig. A fully populated -port Myri-G enclosure (reproduced with permission from http://www. myri.com/Myri-G/switch/configuration_guide.html)
Hardware A Myrinet network consists of NICs and switches. Each computer on the network contains a NIC (possibly more than one). NICs are responsible for transferring data (bidirectionally) between main memory and the nearest switch. Switches are responsible for transferring data from their input ports to their output ports, where a port can be connected to either a NIC or another switch. Within a chassis, switches are connected to each other via a backplane, and between chassis, switches are connected via cables.
Switches Each individual switch contains some routing logic and a crossbar (a -port crossbar in the Myri-G product). In principle, switches support arbitrary network topologies. That is, as far as the switch hardware is concerned, any switch can be connected to any other switch or NIC. This was originally convenient for local area network (LAN) settings, in which desktop computers
could simply be connected to the nearest switch without concern for topology. Today, Myrinet switches come in integrated chassis pre-wired into a regular, folded-Clos topology. In an HPC environment, multiple Myrinet chassis are usually also networked into a folded-Clos topology to provide high bisection bandwidth. Figure shows a fully populated -port Myri-G enclosure, which measures U (approximately in. or cm) in height. This unit contains a total of switches, with exposing half of their ports externally and routing the other half to a row of all-internal “spine” switches. Myrinet switches employ wormhole switching, which is fairly uncommon in cluster networks, although the technique had previously been used in a number of massively parallel processors (MPPs) such as the nCUBE, Intel Paragon, and Cray TD. Wormhole switching means that switches do not need to buffer an entire incoming packet before sending it out. Rather, as
Myrinet
each flit (flow control digit – one -bit byte in Myrinet) arrives at a switch, it can immediately be sent downstream. In fact, a packet’s first flit can reach the destination NIC before the source NIC has transmitted the packet’s final flit.
Network Protocols ANSI/VITA - [], the ANSI standard that defines Myrinet, specifies the network’s behavior at various layers of the Open System Interconnection (OSI) communication model. At the OSI data-link layer, Myrinet employs link-level flow control to prevent upstream NICs or switches from overwhelming a downstream NIC or switch. For example, if two packets arriving at a switch target the same output port, one packet will be allowed to proceed while the other is delayed. The switch sends a Stop symbol to the sender to instruct it to stop sending data. When the target port is free, the switch then sends a Go symbol to allow the sender to resume transmission. The other data-link-layer symbols are Gap and Beat. Gap indicates that no data is being sent and that the link is between packets. The next data byte to arrive after a Gap is therefore the first byte of a new packet. Gap symbols obviate the need to specify a maximum transmission unit (MTU) size in hardware. Unlike most other networks, Myrinet allows packet frames to be arbitrarily large, although ANSI/VITA allows implementations to limit MTU sizes – but to no less than MiB []. Beat is a heartbeat symbol sent every μs to indicate that data are being transmitted correctly (e.g., that the sender is not stuck repeatedly sending the same value). At the OSI network layer, Myrinet packets are defined fairly simply. Each packet contains a number of routing bytes followed by a number of data bytes, followed by an -bit cyclic redundancy check (CRC-) byte. Myrinet is a source-routed network, which means that the NIC that initiates a packet specifies the complete route to the destination NIC. Each packet begins with one routing byte per intermediate switch. Because there can be any number of switches between NICs, the number of routing bytes in a packet varies accordingly. In early versions of Myrinet, each routing byte specified an absolute port number. For example, a routing byte of {} means that the packet should exit the subsequent switch from port . The network-layer protocol was later modified to use relative port numbers, which can be positive or negative. For example, a routing byte of {}
M
means that if the packet enters a switch on port , it should exit on port . The advantage of relative addressing is that reverse routes are easy to calculate; a NIC needs only to invert the signs and reverse the order of the bytes. For example, if NIC A sends a packet to NIC B through three intermediate switches using routing bytes {+, −, +}, then NIC B can send a packet to NIC A through the same three switches using routing bytes {−, +, −}. Each switch discards the routing byte at the start of the packet and forwards the remaining bytes downstream. When the packet reaches its destination NIC it is not preceded by any routing bytes. The packet’s data payload follows the routing bytes. The data begin with a four-byte packet type (two bytes for the primary type, two bytes for the secondary type) to indicate to the destination NIC what software layer is expected to handle the packet. For example, packet type FH indicates a topology-mapping packet used by the MX messaging layer [], and packet types H – H are used for various purposes by the Illinois Fast Messages messaging layer []. While the use of packet types is technically not required by the hardware, ANSI/VITA - dictates its use to prevent software from mistaking one higher-level protocol’s packets for another’s. Myricom maintains a registry of packet types, and customers can request values dedicated to their own messaging layers. The final byte of a Myrinet packet is a CRC- that covers the entire packet (routing bytes and data payload). It is inserted by the NIC that initiates the packet and is checked by each switch on the routing path and by the destination NIC. Because each switch modifies the packet by removing a header byte, each switch replaces the incoming CRC- byte with a new CRC- byte while sending the packet downstream.
Network Interface Cards One of Myrinet’s design philosophies is to keep the switches simple and fast and put all of the communication “smarts” into the endpoints. Hence, a Myrinet NIC is not a passive device controlled by the host computer’s operating system, as was typical for local area networks when Myrinet was first introduced, but rather an active entity that can autonomously manage data transfers between the network and host memory. The basic NIC architecture is illustrated in Fig. . The main units shown in that figure are a link interface, which connects the NIC to an external switch; a
M
M
Myrinet
SRAM
LBUS
Link interface
Packet interface
Processor
EBUS interface
EBUS Network switches
Host computer
Host DRAM
Myrinet. Fig. Block diagram of a Myrinet NIC
packet interface, which manages the transfer of packets to and from the network; a custom processor, which controls all aspects of the NIC; an EBUS (External BUS) interface, which manages the transfer of data to and from the host’s I/O bus (e.g., PCI Express); and a small quantity of high-speed, multiported SRAM connected directly via an LBUS (Local BUS). The link interface, packet interface, processor, and EBUS interface are integrated onto a single chip called the LANai. Throughout the generations of Myrinet, the link interface adapted to Myrinet’s ever-faster link speeds; the packet interface began including support for network types other than Myrinet (viz. Ethernet and InfiniBand) and in some instances doubled to two packet interfaces per NIC; the processor changed from being -bit microcoded to being -bit pipelined and underwent various modifications to the instruction set and clocking; the EBUS interface tracked each new I/O bus technology; and the SRAM switched from asynchronous
to zero bus turnaround (ZBT) technology for a speed boost and also increased in size from KiB (with bits/byte) to MiB (with bits/byte) with a maximum address space of MiB. All of those implementation changes notwithstanding, the basic NIC architecture represented by Fig. has remained largely unchanged across all versions of Myrinet. The LANai processor utilizes a typical reduced instruction set computer (RISC) instruction set with a few additional features to support its role in managing communication. There is no floating-point unit. Because the processor pipeline is non-interlocked and because all memory is constant-latency SRAM, it is relatively easy to reason about LANai performance: Exactly one instruction is issued every cycle. The software running on the LANai, known as the Myrinet Control Program (MCP), generally works by initiating direct memory access (DMA) transfers between local SRAM and either host DRAM or the
Myrinet
network. LANai SRAM can be mapped into the host’s address space, so programmed I/O – word-by-word writes from the host to LANai SRAM or word-by-word reads from LANai SRAM to the host – is also possible. The LANai provides various features in hardware to perform these tasks with minimal overhead. DMA transfers between host memory and LANai SRAM are performed by writing source and destination addresses into a set of special memory-mapped registers. These registers are also mapped into host memory, so the host can control parts of the NIC hardware independently from the LANai. Dedicated hardware in the EBUS interface can generate CRC- values of data as it is copied between host memory and LANai SRAM. DMA transfers between the network and LANai SRAM are performed by writing starting and ending addresses into special send and receive registers. Messages being sent can comprise multiple DMA requests; a “final ending address” register designates the end of a message. CRC values can be computed on both incoming and outgoing data and can be injected into an outgoing message at a specified location. As a simple example of data transmission, the following code is all that is needed to send a msglen-byte message from buffer message: C code SMP = (void *)message; SMLT = (void *)&message[msglen];
LANai assembly code ! Put msglen’s value in r. ld [_msglen],%r7 ! Put message’s address in r. mov _message,%r6 ! r is the DMA’s starting address. st %r6,[0xFFFFFEF0] ! Put message’s address in r. mov _message,%r8 ! Point r just pass the message. add %r7,%r8,%r6 ! Initiate the send. st %r6,[0xFFFFFF08]
M
Address FFFFFEFH corresponds to special register Send-Message Pointer (SMP), which points to the start of the message. Address FFFFFFH corresponds to special register Send-Message Limit, with the Tail (SMLT), which points to one byte past the last byte of the message. Storing a value in the SMLT register triggers the DMA from LANai SRAM into the network. Note that the preceding code assumes that the message begins with the requisite routing and packet-type bytes. Waiting for send completion is simply a matter of polling the interrupt status register, ISR, until the SEND_INT bit is set: C code while ((ISR & SEND_INT_BIT) == 0) ;
LANai assembly code ! Store SEND_INT’s bit offset in r. mov 8,%r13 ! Load ISR into r. L3: ld [0xFFFFFE50],%r6 ! Delay until r is ready. nop ! Delay until r is ready. nop ! Test r’s SEND_INT bit. and.f %r6,%r13,%r0 ! Try again if not set. beq L3
The nop instructions are necessary because the LANai has a non-interlocked processor pipeline. The LANai supports two execution contexts, “user” and “system,” each with its own program counter and other registers. Context switching is extremely fast and can be performed either synchronously via an explicit machine instruction (punt) or asynchronously from user to system context in response to a hardware interrupt. The LANai signifies DMA completion, message availability, timer wraparound, and various other conditions by setting a bit in the ISR. The MCP can choose to poll the ISR for status changes or request that
M
M
Myrinet
modifications to certain bits trigger an interrupt, which the system context can subsequently handle.
Software Communication-Layer Design Issues Because the LANai is fully programmable and development tools are readily available, it is possible to implement any number of low-level communication layers. These can provide different feature sets, different semantics, and make different performance tradeoffs. In fact, there have been a wide variety of low-level, software communication layers developed over the years not only by Myricom – the so-called Myrinet API followed by GM (Glenn’s Messaging) followed by MX (Myrinet Express) – but also by researchers at numerous institutions. Research communication layers targeting Myrinet include Active Messages (AM), the Basic Interface for Parallelism (BIP), the BullDog Myrinet Control Program (BDM), Fast Messages (FM), Hamlyn, Link-level Flow Control (LFC), ParaStation, PM, Trapeze, U-Net, Virtual Memory-Mapped Communication (VMMC), as well as the Message-Passing Interface (MPI) implemented atop various Myrinet-specific MPICH channel devices. All of those are designed to be user-level communication layers, meaning that the operating system has essentially no involvement in the critical path of communication. However, traditional, operating-system-based Internet Protocol (IP) communication can run atop Myricom’s low-level communication layers. Each of the communication layers listed above must address issues of memory pinning, data transfer, reliability, and route discovery, among others. The LANai can transfer data only to and from physical addresses in the host. Consequently, the host operating system must ensure that memory pages that may be accessed by the LANai are “pinned,” making them ineligible for being swapped out to disk and replaced by another page. Some communication layers have the operating system pin a fixed region of addresses at initialization time, let the LANai transfer data only to/from that region, and use host software to copy data to/ from that region into arbitrary (non-pinned) buffers provided by the application. Other communication layers modify the operating system to respond to LANai interrupts, pin pages on demand, and notify the MCP of the physical addresses to use.
LANai-initiated data transfers to/from the host must be performed using the LANai’s DMA mechanism. However, host-initiated data transfers can use either DMA or programmed I/O (word-by-word reads or writes). DMA-based transfers tend to have a higher startup cost but lower per-byte cost while programmed I/O-based transfers tend to have a lower startup cost but a higher per-byte cost. Programmed I/O-based transfers can address arbitrary host memory and can therefore avoid copying data to/from a pinned region or asking the operating system to pin a region of memory before the transfer begins. Related to data transfers is event notification: Is data ready to be processed? The LANai and the host can each signal the other with interrupts, which are asynchronous but, particularly in the case of the LANai interrupting the host, exhibit high overhead due to operating-system involvement and costly context switches. The alternative is to poll memory for state changes. While a single poll is faster than a single interrupt, polling must be performed frequently to avoid delaying communication progress. Myrinet provides reliable data transmission in the sense that it never intentionally discards data. However, software communication layers must ensure that there is always a place to store incoming data. Most communication layers implement some form of flow control to prevent data buffers from overflowing. However, some simply discard incoming data if there is no place left to store it and let higher layers of software or the application itself detect the data loss and induce a retransmission. Because Myrinet is a source-routed network, an MCP needs to know the precise path to every NIC in the cluster so it can inject the correct routing bytes into the network. Some communication layers simply read a configuration file that enumerates these paths and instruct the MCP accordingly. This works well for smaller networks with fairly static topologies. Other communication layers implement route discovery. In Myrinet, route discovery has to be performed entirely by software, either by the MCP or by the host with the MCP’s assistance. Route discovery works essentially by performing a breadth-first search of the network: First, all possible one-hop destinations are queried for existence, followed by all possible two-hop destinations, and so forth. Because this can be a time-consuming
Myrinet
40
M
Worse
35
Half round-trip latency (µs)
30 25 20 15 10 5 0
Better 11NOV1996
a
8JUL2001
28DEC2005
21JUN2010
Myrinet API
GM 1.4
MX 1.0
MPICH-MX 1.2.7
166 MHz x86
666 MHz x86
1.8 GHz x86-64
3.2 GHz x86-64
32b/33 MHz PCI
64b/66 MHz PCI 64b/133 MHz PCI-X
x8 PCIe
1,400
Better MPICH-MX 1.2.7, 3.2 GHz x86-64, x8 PCIe, 21JUN2010 MX 1.0, 1.8 GHz x86-64, 64b/133 MHz PCI-X, 28DEC2005
1,200
GM 1.4, 666 MHz x86, 64b/66 MHz PCI, 8JUL2001 Myrinet API, 166 MHz x86, 32b/33 MHz PCI, 11NOV1996 Bandwidth (B/µs)
1,000
800
600
400
200
0 1 b
4
16
64
256
1K 4K 16K Message size (B)
Myrinet. Fig. Myrinet historical performance (a) latency, (b) bandwidth
64K
256K
1M
Worse 4M
M
M
Myrinet
operation for large networks, route discovery is performed only occasionally.
InfiniBand Interconnection Networks Network Interfaces
Performance Figure aggregates historical performance data that have appeared on Myricom’s Web site over a span of years. Figure a presents messaging latency, measured as half the average time for a minimal-sized message to be sent back and forth between a host process on each of two computers. Figure b presents the messaging bandwidth, measured as the average rate at which a message of a given size can be sent back and forth between a host process on each of two computers. The data represent multiple generations of lowlevel Myrinet software, Myrinet NIC hardware, Myrinet switch hardware, host CPUs, and I/O buses. Consequently, performance improvements cannot be attributed to the evolution of a single component. Rather, Fig. should be interpreted as portraying the performance that one could have expected from a Myrinet cluster with typical hardware and typical vendor-supplied software at four points in time from November to June . As Fig. shows, a cluster running a recent version of MPICH-MX on recent Myrinet hardware can expect to see a minimum of . μs one-way message latency and a maximum of , MB/s of communication bandwidth – both a factor of improvement over a state-of-the-art Myrinet cluster from years earlier. The data presented in Fig. represent only Myricom’s software. Myrinet software from third-party research institutions often outperformed the corresponding offering from Myricom, generally by utilizing novel data structures and by removing nonessential, high-overhead features. As a point of comparison, contemporaneously with the Myrinet API observing a minimum latency of μs, Fast Messages, Active Messages, and PM all observed a minimum latency of μs or less. In fact, some of Myrinet’s uniqueness from a software perspective is that the LANai’s programmability makes Myrinet a suitable platform for experimentation with communication-layer design.
Related Entries Cray XT and Seastar -D Torus Interconnect Ethernet Flow Control
Network of Workstations Networks, Direct Networks, Multistage Quadrics SCI (Scalable Coherent Interface) Switch Architecture Switching Techniques
Bibliographic Notes and Further Reading One of Myricom’s few peer-reviewed publications is an overview of Myrinet that appeared in a issue of IEEE Micro []. Although the paper describes the first commercial implementation of Myrinet, much of what it says is still applicable to more recent incarnations of Myrinet. Details of Myrinet’s OSI data-link layer and OSI network layer appear in the Myrinet standardization document, ANSI/VITA - []. Various aspects of the LANai architecture and physical characteristics are covered by Myricom white papers [, ]. The LANai’s origins are in Caltech’s Mosaic C multicomputer []. Both contain a simple processor, a memory interface, and an integrated network interface with instruction-set support. However, the Mosaic C targets large-scale direct networks and therefore additionally integrates a dimension-order router for -D mesh networks onto the die. In contrast, the LANai targets indirect networks of arbitrary topology and therefore delegates routing to an external switching fabric. USC/ISI saw the potential for utilizing Mosaic C components in a LAN setting and constructed their ATOMIC LAN as a proof of the concept []. In a sense, Myrinet is a commercial implementation of the ATOMIC network architecture. An interesting part of Felderman et al.’s ATOMIC paper [] is its exposition of the challenges of route discovery on a network topology that is not guaranteed to be either symmetric or consistent. (Myrinet networks are symmetric but not necessarily consistent.) Myrinet has been the target platform for a wealth of software communication layers such as Fast Messages [], Active Messages II [], and Myrinet Express []. Two papers published in November contrast many of the Myrinet communication layers that
Myrinet
were current at that time. Bhoedjang et al. discuss the design decisions underlying user-level communication layers, step through a simple communication layer for Myrinet, and show how numerous Myrinet communication layers address various implementation alternatives []. Araki et al. take a more performance-oriented approach, presenting a LogP (latency, overhead, gap, processes) analysis of multiple Myrinet communication layers and drawing conclusions about how various communication-layer features impact communication time [].
Bibliography . Araki S, Bilas A, Dubnicki C, Edler J, Konishi K, Philbin J () User-space communication: a quantitative study. In: Proceedings of the IEEE/ACM conference on supercomputing (SC’), Orlando, Florida, – Nov . ISBN: ---X. doi: ./SC.. . Bhoedjang RAF, Rühl T, Bal HE (Novemebr ) User-level network interface protocols. IEEE Comput ():–. ISSN: -. doi: ./. . Boden NJ, Cohen D, Felderman RE, Kulawik AE, Seitz CL, Seizovic JM, Su W-K (Februar ) Myrinet: a gigabit-persecond local area network. IEEE Micro ():–. ISSN: . doi: ./. . Chin G, Parsons E, Dumont JJ, Wright D, Sherfinski H, Robak D, Jaenicke R, Vadasz I, Thompson M, Blake M, Morgan R, Bratton J,
.
.
. . .
.
.
M
Cohen D, Waggett J, McKee R, Munroe M, Peterson W, Endo D, Kwok J, Rynearson J, Alderman R, Andreas H, Bedard J, Lavely T, Patterson B () Myrinet on VME protocol specification. ANSI Standard ANSI/VITA -, VITA Standards Organization, Scottsdale. November , . A draft is available from http://www.myri.com/open-specs/myri-vme-d.pdf Chun BN, Mainwaring AM, Culler DE (January/February ) Virtual network transport protocols for Myrinet. IEEE Micro ():–. ISSN: -. doi: ./. Felderman R, DeSchon A, Cohen D, Finn G () ATOMIC: a high speed local communication architecture. J High Speed Netw ():–. ISSN: - Myricom, Inc. () LANai . June , . http://www.myri. com/vlsi/LANai.pdf Myricom, Inc. () Lanai X. July , . http://www.myri. com/vlsi/LanaiX.Rev..pdf. Revision . Myricom, Inc. () Myrinet Express (MX): a highperformance, low-level message-passing interface for Myrinet. October , . http://www.myri.com/scs/MX/doc/mx.pdf. Version . Pakin S, Karamcheti V, Chien AA (April–June ) Fast messages: efficient, portable communication for workstation clusters and MPPs. IEEE Concurr ():–. ISSN: -. doi: ./. Seitz CL, Boden NJ, Seizovic J, Su W-K () The design of the Caltech Mosaic C multicomputer. In: Borriello G, Ebeling C (eds) Proceedings of the symposium on research on integrated systems, Seattle, Washington, – Mar . MIT Press, pp –. ISBN: ---
M
N NAMD (NAnoscale Molecular Dynamics) Laxmikant V. Kalé, Abhinav Bhatele, Eric J. Bohm, James C. Phillips University of Illinois at Urbana-Champaign, Urbana, IL, USA
Definition NAMD is a parallel molecular dynamics software for biomolecular simulations.
Discussion Introduction NAMD (NAnoscale Molecular Dynamics, http://www. ks.uiuc.edu/Research/namd) is a parallel molecular dynamics (MD) code designed for high-performance simulation of large biomolecular systems [–]. Typical NAMD simulations include all-atom models of proteins, lipids, and/or nucleic acids as well as explicit solvent (water and ions) and range in size from , to ,, atoms. NAMD employs the prioritized message-driven execution capabilities of the Charm++/Converse parallel runtime system (http://charm.cs.illinois.edu), allowing excellent parallel scaling on both massively parallel supercomputers and commodity workstation clusters. NAMD is distributed free of charge as both source code and pre-compiled binaries by the Theoretical and Computational Biophysics Group (http://www.ks.uiuc.edu) of the University of Illinois Beckman Institute. NAMD development is primarily funded by the NIH through the Resource for Macromolecular Modeling and Bioinformatics. NAMD has been downloaded by over , registered users, over , of whom have downloaded multiple releases.
Biomolecular Simulation The interactions between atoms in a biomolecular system are modeled with a combination of classical (nonquantum) terms. Each atom is assigned a static partial charge (a fraction of an electronic charge) based on the difference between its nuclear charge and the nearby electron density given its molecular configuration. Standard /r electrostatic interactions between these partial charges are added to a Lennard-Jones potential, which combines a /r van der Waals attractive potential with a short-range /r repulsive term. The effects of covalent bonds are represented by harmonic terms applied to the distance between directly bonded atoms, angles formed by a pair of bonds to a common atom, and dihedral angles formed by chains of four bonded angles. Dihedral terms may also be sums of cosines, and harmonic terms are applied to enforce planarity of four atoms in the case of so-called improper dihedrals. Parameters for the Lennard-Jones and bonded terms are based on the element type of each atom, as well as its bonds to hydrogen or other heavy atoms. In preparing to simulate a biomolecular system with NAMD, atomic coordinates of the proteins, nucleic acids, and/or lipids of interest are obtained from known crystallographic or other structures. Missing parts of the structure may be modeled based on homologous molecules of known structure. Any aggregates must then be assembled, such as embedding a protein in a membrane or binding a hormone receptor to DNA. A periodic block of water is replicated to cover the entire structure up to the boundaries of the desired periodic cell, water molecules within a few Angstroms of the structure are removed, and the remaining water is merged into the structure. In order to neutralize the electrical charge of the system and obtain a proper salt concentration, some water molecules are replaced with Na+ and Cl− ions in a manner that minimizes electrostatic energy. Using the potential function, the initial atomic coordinates are adjusted to eliminate initial close contacts
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
N
NAMD (NAnoscale Molecular Dynamics)
and large forces that would destabilize the system. The Newtonian equations of motion, with modifications to control temperature and pressure, are then integrated by symplectic and reversible methods using a time step of femtosecond (or fs if water molecules and bonds to hydrogen atoms are held rigid). A typical simulation may be – ns (– million steps), and long simulations may be a microsecond or more. Although the symplectic property of the standard explicit Verlet integrator allows energy to be well conserved over long trajectories, a thermostat is typically used to initially equilibrate the simulation to a given temperature and then to control heating due to integrator energy drift. Molecular dynamics simulations are chaotic in the sense that the specific trajectory is highly dependent on the initial coordinates and velocities. Therefore, the goal of the simulations is to characterize typical conformations and fluctuations, and to extract thermodynamic properties such as free energy differences.
NAMD Features NAMD .x has been in production use for over a decade and has been developed in a highly active user environment. As a result of user requirements, NAMD has acquired a wide variety of simulation features. NAMD relies on the capabilities of its sister program VMD (Visual Molecular Dynamics, http://www.ks.uiuc.edu/ Research/vmd) for the setup, visualization, and analysis of simulations. VMD reads and writes a wide variety of file formats and is used extensively for visualizing and analyzing the output of MD codes besides NAMD, as well as other experimental and computational data. VMD can connect to a running NAMD simulation, allowing the user to apply forces to steer the simulation using the mouse or a haptic interface. NAMD supports non-periodic and periodic boundary conditions, including non-orthogonal cells and periodicity in only one or two dimensions. Minimization is performed using the conjugate gradient algorithm. Atoms may be fixed or restrained to an initial position to stabilize a structure during minimization and equilibration. Both three-point and four-point water models (TIPP and TIPP) are supported. Water molecules and bonds to hydrogen atoms may be held rigid to increase the maximum stable time step. Velocity
rescaling and Langevin dynamics constant temperature methods and Berendsen and Langevin piston constant pressure methods may be applied to provide an NVT or NPT ensemble. Free energy of conformational change surfaces or potentials of mean force may be studied along an arbitrary number of user-defined collective variables via a variety of methods, including meta-dynamics, the adaptive biasing force method, umbrella sampling, and steered molecular dynamics. Alchemical transformations, in which the chemical structure of a molecule is modified, may be studied via free energy perturbation or thermodynamic integration methods. Special support is provided for the analysis of cryoelectron-microscopy data via molecular dynamics flexible fitting. In this method, an experimentally derived electron density map is transformed into a potential on a grid. The molecular structure is restrained via additional bonded terms and simulated to allow the atoms to conform to the grid potential. This allows the combination of atomic-resolution x-ray crystallographic structures with lower-resolution cryo-electronmicroscopy data and has been particularly useful in the study of large structures with complex motions such as the ribosome. An important goal of NAMD is to provide a uniform user interface (as far as possible) across all platforms, from laptops and desktops to clusters and massively parallel Cray XT and IBM Blue Gene supercomputers. Within the limits of memory and scalability, any simulation can be run on any number of processors. The program’s spatial decomposition method is adjusted from full-sized to half-sized patches in one, two, or three dimensions as the number of processors increases. Similarly, the particle-mesh Ewald FFT decomposition is switched from slabs to pencils and spanning trees are used for coordinate and force messages. While manual tuning options are available, they are only beneficial at the limits of parallel scalability – most users happily ignore them.
Parallel Design and Implementation The sequential algorithm in most molecular dynamics applications is similar. The simulation time is broken down into a large number of very small time steps (typically or fs each). At each time step, forces on every
NAMD (NAnoscale Molecular Dynamics)
atom are calculated due to every other atom. Acceleration and velocities for the particles are obtained from the forces, based on which the new positions of the particles are calculated. This is repeated every time step. Two kinds of forces are to be calculated: () forces due to bonds and () non-bonded forces. If the nonbonded forces are calculated pairwise for all pairs of atoms, it results in a naïve O(n ) implementation. Most MD applications use an optimization where forces are not calculated explicitly beyond a certain cutoff distance. For atoms that lie beyond the cutoff distance, forces are calculated by using the particle-mesh Ewald (PME) or Gaussian spread Ewald (k-GSE) method. In these methods, the electric charge of each atom is transferred to electric potential on a grid that is used to calculate the influence of all atoms on each atom. Parallelization of the algorithm mentioned above has been traditionally done in various ways (a good survey can be seen in Plimpton et al. []). Some of them are outlined here: ●
Atom Decomposition: All atoms in the simulation box are distributed among the processors and each processor is responsible for the force calculations of the atoms it holds. ● Force Decomposition: The force matrix is distributed across the processors for parallelization. ● Spatial Decomposition: The three-dimensional simulation box is spatially divided among the processors and each processor is responsible for the atoms within that area.
N
Atom and force decomposition have high communication to computation ratios and spatial decomposition has problems of limited parallelism and load imbalance. For NAMD, a hybrid between spatial and force decomposition was developed, which combines the advantages of both []. Similar methods have been used in MD packages such as Blue Matter and Desmond. The simulation box consisting of a spatial distribution of atoms is divided into smaller cells called “patches.” The dimensions of each patch are roughly equal to rc + margin where rc is the cutoff distance and margin is a small constant. This ensures that atoms within a cutoff radius of each other stay within neighboring patches over a few times steps. Patches are assigned to some processors (typically the number of patches is much smaller than the number of processors). Force calculations for each pair of patches is assigned to a different object called a “compute.” The number of computes in a simulation is much larger and they are load balanced across all the processors. Computes responsible for nonbonded force calculations within the cutoff radius account for most of the computation time in NAMD. Computes can be assigned to any processor irrespective of where the associated patches are placed. Figure shows the communication and computation involved in a time step. At the beginning of a time step, patches send their atoms to the respective computes (nonbonded, bonded, and PME). The computes do the force calculations and send the forces back to the patches. The patches calculate velocities and new
Patch integration
Multicast
Point-to-point PME
Bonded computes
Non-bonded computes
Point-to-point
Reduction
Patch integration
NAMD (NAnoscale Molecular Dynamics). Fig. Flow diagram showing the computation of forces in NAMD every step
N
N
NAMD (NAnoscale Molecular Dynamics)
positions for their particles and then this process is repeated. The load balancing framework in Charm++ is used to adaptively balance work units or computes across processors. The runtime instruments the load of each compute for a few time iterations and then uses this information to move work units around. The first load balancing phase migrates many computes based on a greedy strategy but the subsequent phases do refinement-based balancing. On machines such as IBM Blue Gene/L and Blue Gene/P, the load balancer also uses topology information to optimize communication and minimize contention.
Performance NAMD was designed more than years ago but it has withstood the changes in technology because of its parallel-from-scratch, migratable-objects-based design. The strategy of hybrid decomposition coupled with adaptive overlap of communication and computation, supported by dynamic load balancing has helped NAMD scale to very large machines. NAMD can be run on most parallel machines due to the portability of the CHARM++ runtime. It can be run on any number of processors and has no powers-of-two restrictions
on the number. Another important feature of NAMD is that it performs well for a wide range of molecular system sizes. Table shows the four molecular systems for which benchmarking results are being presented. Important configuration parameters such as the cutoff distance used and duration of each time step (in femtoseconds) are listed. These systems represent a wide range of sizes from thousand atoms to . million atoms. The left plot in Fig. shows the performance of NAMD on the Blue Gene/P machine at Argonne National Laboratory (ANL). Four different lines show the strong scaling performance of NAMD for the chosen molecular systems from to , cores. For ApoA, one can simulate ns per day when running on , cores of Blue Gene/P. Simulation of ribosome, a . million atom system, has an efficiency of % at , cores compared to the performance at cores. NAMD is used by biophysicists and chemists at most national laboratories for biomolecular research. Table presents architectural specifications of some of the machines NAMD has been run on and demonstrated to have good performance. Of these, the Cray XT system (Jaguar) at ORNL is currently the fastest
NAMD (NAnoscale Molecular Dynamics). Table Molecular systems used for benchmarking NAMD Molecular system
Atom count
ApoA
Simulation cell (Å )
Cutoff (Å)
Time step (fs)
,
. × . × .
,
. × . × .
STMV
,,
. × . × .
Ribosome
,,
. × . × .
F-ATPase
NAMD running on Blue Gene/P
NAMD running STMV on various platforms
1024
1024 Ribosome STMV F1-ATPase ApoA1
256 128
Blue Gene/L Blue Gene/P Ranger Cray XT4 Cray XT5
512
Timestep (ms/step)
512
Timestep (ms/step)
64 32 16 8
256 128 64 32 16
4 8
2 1 256
512
1K
2K
4K
8K
16K
4 128
256
512
1K
No. of cores
NAMD (NAnoscale Molecular Dynamics). Fig. Performance (strong scaling) of NAMD
2K
4K
No. of cores
8K
16K
32K
NAMD (NAnoscale Molecular Dynamics)
N
NAMD (NAnoscale Molecular Dynamics). Table Specifications of the parallel systems used for the runs No. of nodes
Cores per node
CPU type
Clock speed
Memory per node
Type of network
PPC
MHz
MB
D Torus
PPC
MHz
GB
D Torus
Opteron
. GHz
GB
SeaStar
Opteron
. GHz
GB
SeaStar +
Xeon
. GHz
GB
Infiniband
System name
Location
Blue Gene/L
IBM
,
Blue Gene/P
ANL
,
Cray XT
ORNL
,
Cray XT
ORNL
,
Ranger
TACC
,
Figure (right plot) shows the benchmarking results for NAMD on five supercomputers at four different sites. They are representative of IBM machines, Cray machines, and Infiniband clusters. The molecular system being simulated is the one million atom STMV (Satellite Tobacco Mosaic Virus). NAMD clearly scales well on all these machines. Using , cores of the Cray XT machines, one can simulate around ns per day for the STMV system.
Conclusion
NAMD (NAnoscale Molecular Dynamics). Fig. Structural model of a photosynthetic chromatophore vesicle, a biological light-harvesting machine (shown are the constituent proteins: LH complexes surrounding dimers of LH-reaction center complexes). The system contains approximately , bacteriochlorophylls. Proteins are shown as semitransparent on the right-hand side to display the bacteriochlorophyll network. Image generated with VMD by Melih Sener based on work by himself, Jen Hsin, Chris Harrison, and Klaus Schulten
supercomputer in the world (as per the November Top list). NAMD has a small memory footprint and hence it can simulate a range of molecular systems on machines with limited memory (such as the IBM Blue Gene machines). However, scientists are now planning to simulate – million atom systems using NAMD. This will be discussed further in the Future Directions section.
Molecular dynamics simulations are hard to parallelize because of the relatively small size of molecular systems and because one needs to run for millions of time steps to simulate a small period in the life of a biomolecule. NAMD has been used as an MD package for many years now and has withstood the changes in technology. Its fundamentally parallel design based on migratable CHARM++ objects, supported by dynamic load balancing, has made it possible to scale NAMD to the machines of the petascale era. NAMD is continually being used to simulate larger and larger molecular systems, which pose new challenges for parallelization on large supercomputers. Recent modifications to NAMD such as parallel input/output and hierarchical load balancing have made this possible.
Future Directions Supercomputers are becoming larger each year with very little increase in the computational power of each individual processing core. Biomolecules are of fixed size, which places NAMD’s required performance characteristics in the “strong-scaling” category. This constrains future research for this style of molecular dynamics to either the study of systems sufficiently
N
N
NAnoscale Molecular Dynamics (NAMD)
large to perform well on future machines, maintaining consistent performance improvements for everdecreasing computation granularity, or integration with other methods in multi-physics models. Efforts are being put in all of those approaches. NSF’s Blue Waters project, a sustained Petaflop/s machine due to enter production in , includes an MD benchmark acceptance test of million atom system of DOPS, DOPC, and curvature-inducing protein BAR domains in water solvent. Support for running this system, already two orders of magnitude larger than is typical, has been added to NAMD and performance tuning for this system on the Blue Waters architecture is ongoing. Systems of similar size, such as the photosynthetic chromataphore shown in Fig. , will leverage these technologies to meet cutting edge science goals and permit the study of large structures for at least tens of nanoseconds of simulation time. The NAMD and CHARM++ development efforts continue to improve strong scaling performance through a variety of overhead reduction schemes and to adapt the software to make use of the network and computation resources idiosyncratic to new supercomputer architectures.
Bibliography . Bhatele A, Kumar S, Mei C, Phillips JC, Zheng G, Kalè LV () Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: Proceedings of IEEE international parallel and distributed processing symposium, Miami, . Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kalè LV, Schulten K () Scalable molecular dynamics with NAMD. J Comput Chem ():– . Phillips JC, Zheng G, Kumar S, Kale LV () NAMD: biomolecular simulation on thousands of processors. In: Proceedings of the ACM/IEEE conference on supercomputing, pp –, Baltimore, . Kalè LV, Skeel R, Bhandarkar M, Brunner R, Gursoy A, Krawetz N, Phillips JC, Shinozaki A, Varadarajan K, Schulten K () NAMD: greater scalability for parallel molecular dynamics. J Comput Phys :– . Plimpton SJ, Hendrickson BA () A new parallel method for molecular-dynamics simulation of macromolecular systems. J Comp Chem :– . Kalè LV, Skeel R, Bhandarkar M, Brunner R, Gursoy A, Krawetz N, Phillips JC, Shinozaki A, Varadarajan K, Schulten K () NAMD: greater scalability for parallel molecular dynamics. J Comput Phys :–, . Snir M () A note on n-body computations with cutoffs. Theory Comput Syst :– . Bhatelè A, Kalè LV, Kumar S () Dynamic topology aware load balancing algorithms for molecular dynamics applications. In: rd ACM international conference on supercomputing, Yorktown Heights,
Related Entries Anton, a Special-Purpose Molecular Simulation
Machine Charm++ N-Body Computational Methods
NAnoscale Molecular Dynamics (NAMD) NAMD (NAnoscale Molecular Dynamics)
Bibliographic Notes and Further Reading A good survey of parallelization techniques for MD programs can be found in Plimpton et al. []. Snir discusses the communication requirements of force and spatial decomposition and proposes a hybrid algorithm similar to that of NAMD []. Other MD packages similar to NAMD are CHARMM, AMBER, GROMACS, Blue Matter, and Desmond. The paper by Kale et al. [] was awarded the Gordon Bell award at Supercomputing . Detailed performance benchmarking of NAMD and recent algorithmic changes to enable scaling to large machines can be found in [] and [].
NAS Parallel Benchmarks David H. Bailey Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Acronyms NAS, NPB
Definition The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were
NAS Parallel Benchmarks
originally developed at the NASA Ames Research Center in to assess high-end parallel supercomputers []. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym “NAS” originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains “NAS.” The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan, and Sisira Weeratunga.
Discussion The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of highend scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel offerings would be most useful in real-world scientific computation. In part to draw attention to some of the performance reporting abuses prevalent at the time, the present author wrote a humorous essay “Twelve Ways to Fool the Masses,” which described in a light-hearted way a number of the questionable ways in which both
N
vendor marketing people and scientists were inflating and distorting their performance results []. All of this underscored the need for an objective and scientifically defensible measure to compare performance on these systems. At the time (), the only widely available highend benchmark was the scalable Linpack benchmark, which while it was useful and remains useful to this day, was not considered typical of most applications on NASA’s supercomputers or at most other sites. One possible solution was to employ an actual large-scale application code, such as one of those being used by scientists using supercomputers at the NAS center. However, due in part to the very large programming and tuning requirements, these were considered too unwieldy to be used to compare a broad range of emerging parallel systems. Compounding these challenges at this time was the lack of a generally accepted parallel programming model – note this was several years before the advent of the Message Passing Interface (MPI) and OpenMP models that are in common usage today. For these reasons, the NAS Benchmarks were initially designed not as a set of computer codes, but instead were specified as “paper and pencil” benchmarks defined in a technical document []. The idea was to specify a set of problems only algorithmically, but in sufficient detail that the document could be used for a full-fledged implementation complying with the requirements. Even the input data or a scheme to generate it was specified in the document. Some “reference implementations” were provided, but these were intended only as aids for a serious implementation, not to be the benchmarks themselves. The original rules accompanying the benchmarks required that the benchmarks be implemented in some extension of Fortran or C (see details below), but otherwise implementers were free to utilize language constructs that give the best performance possible on the particular system being studied. The choice of data structures, processor allocation, and memory usage was (and are) generally left open to the discretion of the implementer. The eight problems consist of five “kernels” and three “simulated computational fluid dynamics (CFD) applications.” The five kernels are relatively compact problems, each of which emphasizes a particular type of numerical computation. Compared with the simulated
N
N
NAS Parallel Benchmarks
CFD applications, they can be implemented fairly readily and provide insight as to the general levels of performance that can be expected on these specific types of numerical computations. The three simulated CFD applications, on the other hand, usually require more effort to implement, but they are more indicative of the types of actual data movement and computation required in state-of-the-art CFD application codes, and in many other three-dimensional physical simulations as well. For example, in an isolated kernel, a certain data structure may be very efficient on a certain system, and yet this data structure would be inappropriate if incorporated into a larger application. By comparison, the simulated CFD applications require data structures and implementation techniques in three physical dimensions, and thus are more typical of real scientific applications.
Benchmark Rules Even though the benchmarks are specified in a technical document, and implementers are generally free to code them in any reasonable manner, certain rules were specified for these implementations. The intent here was to limit implementations to what would be regarded as “reasonable” code similar to that used in real scientific applications. In particular, the following rules were presented, and, with a couple of minor changes, are still in effect today (these rules are the current version): ● ●
● ●
●
All floating point operations must be performed using -bit floating point arithmetic (at least). All benchmarks must be coded in either Fortran (Fortran- or Fortran-), C, or Java, with certain approved extensions. Java was added in a recent version of the NPB. One of the three languages must be selected for the entire implementation – mixed code is not allowed. Any language extension or library routine that is employed in any of the benchmarks must be supported by the system vendor and available to all users. Subprograms and library routines not written in Fortran, C, or Java (such as assembly-coded routines) may only perform certain basic functions [a complete list is provided in the full NPB specification].
●
All rules apply equally to subroutine calls, language extensions, and compiler directives (i.e., special comments).
The Original Eight Benchmarks A brief (and necessarily incomplete) description of the eight problems is given here. For full details, see the full NPB specification document []. EP. As the acronym suggests, this is an “embarrassingly parallel” kernel – in contrast to others in the list, it requires virtually no interprocessor communication, only coordination of pseudorandom number generation at the beginning and collection of results at the end. There is some challenge in computing required intrinsic functions (which, according to specified rules, must be done using vendor-supplied library functions) at a rapid rate. The problem is to generate pairs of Gaussian random deviates and tabulate the number of pairs in successive square annuli. The Gaussian deviates are to be calculated by the following well-known scheme. First, generate a sequence of n pairs of uniform (, ) pseudorandom deviates (xj , yj ) (using a specific linear congruential scheme specified in the benchmark document). Then, for each j, check whether tj = xj + √ yj ≤ . If so, set Xk = xj (− log tj )/tj and Yk = √ yj (− log tj )/tj , where the index k is incremented with every successful test; if tj > , then reject the pair (xj , yj ). In this way, the resulting pairs (Xk , Yk ) are random deviates with a Gaussian distribution. The sums S = ∑k Xk and S = ∑k Yk are each accumulated, as are the counts of the number of hits in successive square annuli. The verification test for this problem requires that the sums S and S each agree with reference values to within a specified tolerance, and also that the ten counts of deviates in square annuli exactly agree with reference values. MG. This is a simplified multigrid calculation. This requires highly structured long-distance communication and tests both short- and long-distance data communication. The problem definition is as follows. Set v = except at the twenty specified points, where v = ±. Commence an iterative solution with u = . Each of the four
NAS Parallel Benchmarks
iterations consists of the following two steps, in which k = : r := vAu
(Evaluate residual)
u := u + M r. k
(Apply correction)
k
Here M denotes the following V-cycle multigrid operator: zk := M k rk , where if k > then rk− := Prk zk− := M
k−
(Restrict residual) rk−
(Recursive solve)
zk := Qzk−
(Prolongate)
rk := rk Azk
(Evaluate residual)
zk := zk + Srk ,
(Apply smoother)
else z := Sr .
(Apply smoother)
The coefficients of the P, M, Q, and S arrays, together with other details, are given in the benchmark document. The benchmark problem definition requires that a specified number of iterations of the above V-cycle be performed, after which the L norm of the r array must agree with a reference value to within a specified tolerance. CG. In this problem, a conjugate gradient method is used to compute an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. This kernel is typical of unstructured grid computations in that it tests irregular long-distance communication, employing unstructured matrix vector multiplication. The problem statement in this case is fairly straightforward: Perform a specified number of conjugate gradient iterations in approximating the solution z to a certain specified large sparse n×n linear system of equations Az = x. In this problem, the matrix A must be used explicitly. This is because after the original NPB suite was published, it was noted that by saving the random sparse vectors x used in the specified construction of the problem, it was possible to reformulate the sparse matrix-vector multiply operation in such a way that communication is substantially reduced. Therefore this scheme is specifically disallowed.
N
The verification test is that the value γ = λ + /(xT z) (where λ is a parameter that depends on problem size) must agree with a reference value to a specified tolerance. FT. Here, a -D partial differential equation is solved using FFTs. This performs the essence of many “spectral” codes. It is a rigorous test of heavy long-distance communication performance. The problem is to numerically solve the Poisson partial differential equation (PDE) ∂u(x, t) = α∇ u(x, t), ∂t where x is a position in three-dimensional space. When a Fourier transform is applied to each side, this equation becomes ∂v(z, t) = −απ ∣z∣ v(z, t), ∂t where v(z, t) is the Fourier transform of u(x, t). This has the solution v(z, t) = e−απ
∣z∣ t
v(z, ).
The benchmark problem is to solve a discrete version of the original PDE by computing the forward -D discrete Fourier transform (DFT) of the original state array u(x, ), multiplying the results by certain exponentials, and then performing an inverse -D DFT. Of course, the DFTs can be rapidly evaluated by using a -D fast Fourier transform (FFT) algorithm. The verification test for this problem is to match the checksum of a certain subset of the final array with reference values. IS. This kernel performs a large integer sort operation that is important in certain a“particle method” codes. It tests both integer computation speed and communication performance. The specific problem is to generate a large array by a certain scheme and then to sort it. Any efficient parallel sort scheme may be used. The verification test is to certify that the array is in sorted order (full details are given in the benchmark document). LU. This performs a synthetic computational fluid dynamics (CFD) calculation by solving regular-sparse, block ( × ) lower and upper triangular systems.
N
N
NAS Parallel Benchmarks
SP. This performs a synthetic CFD problem by solving multiple, independent systems of nondiagonally dominant, scalar, pentadiagonal equations. BT. This performs a synthetic CFD problem by solving multiple, independent systems of nondiagonally dominant, block tridiagonal equations with a ( × ) block size. These last three “simulated CFD benchmarks” together represent the heart of the computationally intensive building blocks of CFD programs in most common use today for the numerical solution of three-dimensional Euler/Navier-Stokes equations using finite-volume, finite-difference discretization on structured grids. LU and SP involve global data dependencies. Although the three benchmarks are similar in many respects, there is a fundamental difference with regard to the communication-to-computation ratio. BT represents the computations associated with the implicit operator of a newer class of implicit CFD algorithms. This kernel exhibits somewhat more limited parallelism compared with the other two. In each of these three benchmarks, the same highlevel synthetic problem is solved. This synthetic problem differs from a real CFD problem in the following important aspects: . . . .
Absence of realistic boundary algorithms Higher than normal dissipative effects Lack of upwind differencing effects Absence of geometric stiffness introduced through boundary conforming coordinate transformations and highly stretched meshes . Lack of evolution of weak solutions found in real CFD applications during the iterative process . Absence of turbulence modeling effects Full details of these three benchmark problems are given in the benchmark document, as are the verification tests.
Evolution of the NAS Parallel Benchmarks The original NPB suite was accompanied by a set of “reference” implementations, including implementations for a single-processor workstation, an Intel Paragon system (using Intel’s message passing library) and a CM- from Thinking Machines Corp. Also, the original release defined three problem sizes: Class W (for the sample workstation implementation), Class A
(for what was at the time a moderate-sized parallel computer), and Class B (for what was at the time a large parallel computer). The Class B problems were roughly four times larger than the Class A problems, both in total operation count and in aggregate memory requirement. After the initial release of the NPB, the NASA team published several sets of performance results, for example, [, ], and in the next few years, numerous teams of scientists and computer vendors submitted results. By , several of the original NPB team had left NASA Ames Research Center, and, for a while, active support and collection of results lagged. Fortunately, some other NASA Ames scientists, notably William Saphir, Rob Van der Wingaart, Alex Woo, and Maurice Yarrow, stepped in to continue support and development. These researchers produced a reference implementation of the NPB with MPI constructs, which, together with a few minor changes, was designated NPB .. Then a Class C problem size was defined as part of NPB .. Another enhancement was the addition of the “BT I/O” benchmark. In this benchmark, implementers were required to output some key arrays to an external file system as the BT program is running. Thus, by comparing the performance of the BT and BT I/O benchmarks, one could get some measure of I/O system performance. This change, together with the definition of Class D problem sizes, was released in and designated NPB .. In , NASA Ames researchers Haoqiang Jin, Michael Frumkin, and Jerry Yan released NPB ., which included an OpenMP reference implementation of the NPB. In , Rob van der Wingaart and Michael Frumkin released a grid version of the NPB. In the next year or two, several additional minor improvements and bug fixes were made to all reference implementations, and Class E problem sizes were defined. Reference implementations were released in Java . and HighPerformance Fortran, and a “multi-zone” benchmark, reflective of a number of more modern CFD computations, was added. These changes were designated NPB ., which is the latest version on the NASA Web site. Full details of the current version of NPB, as well as the actual documents and reference implementations, are available at: http://www.nas.nasa.gov/Resources/ Software/npb.html
N-Body Computational Methods
Related Entries Benchmarks HPC Challenge Benchmark LINPACK Benchmark
Acknowledgment Supported in part by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC-CH.
Bibliography . Bailey DH () Twelve ways to fool the masses when giving performance results on parallel computers. Supercomput Rev – (Also published in Supercomputer, Sep. , –. Available at http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf.) . Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK () The NAS parallel benchmarks. Int J Supercomput Appl ():– . Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK () The NAS parallel benchmarks. Technical report RNR--, NASA Ames Research Center, Moffett Field, CA , Jan. , revised (Available at http://www.nas.nasa.gov/News/Techreports// PDF/RNR--.pdf) . Bailey DH, Barszcz E, Dagum L, Simon HD () NAS parallel benchmark results. In: Proceedings of Supercomputing , Minneapolis, MN, pp – . Bailey DH, Barszcz E, Dagum L, Simon HD () NAS parallel benchmark results. IEEE Parallel and Distr Tech –
N-Body Computational Methods Joseph Fogarty , Hasan Aktulga , Sagar Pandit , Ananth Y. Grama Purdue University, West Lafayette, IN, USA University of South Florida, Tampa, FL, USA
Synonyms Particle dynamics; Particle methods
Definition N-body computation refers to a class of simulation methods that model the behavior of a physical system using a set of discrete entities (e.g., atoms, astrophysical
N
bodies, etc.) and a set of interactions among them (coupling potentials). These simulations are typically time dependent. In each timestep, attributes of the discrete entities are updated (typically force, acceleration, velocity, and position), and the process is repeated, to study spatiotemporal evolution of the system.
Discussion Introduction Many interesting physical problems can be modeled as the time-evolution of a set of interacting, classical objects (assumed to be point masses). The behavior of these particles is governed by Newton’s second law. A series of seminal efforts by, among other notables, Newton, Euler, Lagrange, Hamilton, Delaunay, and Sundman demonstrated the difficulty of analitically solving the problem for systems comprised of more than two bodies. Simulation techniques solve the equation for many bodies by discretizing time and integrating particle behavior over discrete timesteps. These particles are treated as point masses, interacting through specified potentials. Systems of interest, though small in terms of macroscopic or astronomical scales, normally comprise of a large number of particles (up to or more). The tools of statistical mechanics have been developed to analyze the macroscopic (average) properties and behavior of such large systems. While powerful in the context of many problems, statistical mechanics is ill-suited to the study of smaller structures, such as nanoparticles and interactions of individual molecules. N-body simulations are able to overcome these limitations, providing powerful simulation methodologies.
Computational Techniques and Algorithmic Considerations Astrophysical simulations are classical applications of N-body methods. In these simulations, interacting particles are large-scale stellar structures (stars, stellar clusters, galaxies, galaxy clusters, dark matter, etc. [, ]) interacting through long-range gravitational forces. Range here refers to the distance dependence of the potential function. At the other end of the spatiotemporal spectrum, one may model atomic-scale systems to understand physical properties of materials or behavior of molecular systems. Individual particles in such
N
N
N-Body Computational Methods
systems correspond to atoms, and coupling potentials span long-range electrostatics and short-range Lennard-Jones potentials. In molecular scale systems, properties are modeled using customized coupling potentials. While the overall structure of computation is similar for these diverse N-body simulations, there are significant differences in the nature and computational cost of interactions and required time resolution. The computational challenges in astrophysical simulations stem largely from the long-range nature of the gravitational force, whereas atomistic simulations involve a variety of complex interaction potentials. This article discusses N-body methods primarily in the context of atomistic simulations. Where relevant, it highlights differences with respect to astrophysical simulations and the implications thereof. In molecular dynamics (MD), the system is modeled as a set of point charges (particles) whose evolution in time is governed by classical mechanics. Potentials are used to simulate electrostatics, harmonic bonds, hard sphere repulsion, and other types of interactions. Simulations may also be performed using a Monte Carlo (MC) approach. As its name suggests, the MC technique relies on a randomized process, appropriately sampling an ensemble to infer statistical quantities. The physical environment modeled in a simulation further constrains the computational approach. For example, simulations may be performed under different collections of microstates (ensembles). Under the microcanonical ensemble (NVE) particle number (N), volume (V) and total energy (E) are held constant. The isochoric–isothermal canonical enemble (NVT) maintains constant temperature (T) by placing the system in contact with a thermal reservoir. This requires the implementation of a thermostat such as Anderson [], Nose-Hoover [], or Berendsen []. The isobaric–isothermal canonical ensemble (NPT) adds a piston to the canonical ensemble. The volume of the system is allowed to vary in order to maintain a constant pressure (P). Different barostats such as Berendsen [] and Parrinello-Rahman [] may be used for this purpose. In a general N-body problem, forces and potentials governing the motion of particles can be classified as short- or long-range. Short-range potentials decay faster than the growth of the volume element
in three dimensions (i.e., they approach zero faster than /r ) []. Consequently, they can be appropriately truncated for a desired level of accuracy. Long-range forces, on the other hand, do not decay as rapidly – requiring either expensive computation or algorithmic techniques for reducing the complexity. Note that a simple all-to-all computation of long-range forces would result in an O(N ) complexity per timestep. Algorithmic techniques typically rely on hierarchical decompositions to aggregate/approximate the impact of distant particle clusters, thereby reducing overall complexity. Methods such as Barnes–Hut (BH) [] and Fast Multipole Methods (FMM) [] rely on these principles to reduce the O(N ) complexity to O(N log N) and O(N), respectively.
Parallelization Challenges In parallelizing N-body simulation, two primary considerations motivate the choice of how the computation should be decomposed into different processes. First the load on each processor should be approximately equal. Second, the system should be decomposed in order to minimize interprocess on communication, which is a primary consideration for computational efficiency. A number of decomposition techniques satisfy these constraints to varying degrees. Domain Decomposition: In domain decomposition, the physical system is divided into regions of space. A constituent particle falling within a specified volume is assigned to the corresponding processor. Interprocessor communication is needed for computing forces on boundary particles, as well as to “push” particles at the end of the timestep. Since timesteps are small, the system does not change drastically from one timestep to the next. For this reason, the communication associated with pushing particles at the end of a timestep is typically low. Great care must be exercised to reduce the communication overhead. Load balancing an N-body simulation with domain decomposition is most easily achieved when the system consists of identical particles and uniform density. In this case, an equipartitioning of the space into processor subdomains ensures load balance. Since systems of interest are rarely so
N-Body Computational Methods
simple, more sophisticated domain decomposition techniques are required. One such technique is oct-tree decomposition. In this method, the simulation box is recursively divided until the desired number of particles are contained within each subdomain. The oct-tree is rebuilt when necessary to achieve desired load balance. Implementation and maintenance of the oct-tree decomposition method is much more complicated than equipartitioning of the space. Another domain decomposition technique that ensures dynamic load balance is the staggered decomposition method used by GROMACS []. Here the processor subdomains are aligned in one dimension, but staggered in the other two dimensions. Based on the cpu time taken by each processor at any timestep, processor boundaries slide to obtain an even work-load distribution. Due to sliding boundaries and staggered arrangement of processors, interprocessor communication is complicated. Particle Decomposition: In a particle decomposition, particles are assigned to processors for the duration of the simulation. This method yields good load balancing despite nonuniform density and is best suited to systems with minimal translational motion, for example, simulations of solids. Communication overhead, however, may be high when the neighborhood of atoms is rapidly changing. Interaction Decomposition: Interaction decomposition assigns specific interactions to each processor, resulting in load balance. Communication is minimized since any interacting atoms are handled by the same processor. Despite its desirable characteristics, the implementation of an interaction-based decomposition is often difficult, since it requires prior knowledge of the number and nature of interactions. Zonal Decomposition: A recently developed algorithm [] greatly reduces communication costs. In this method, spatial decomposition is the primary approach. In addition to local atoms, buffer atoms are communicated so that all interactions are computed locally. Inclusion of buffer atoms is based on interactions rather than spatial location. Note that at present time zonal methods are limited to pair-wise interactions.
N
Boundary Conditions In typical large-scale simulations, one does not simulate the entire domain, but rather a small “cutout” of the domain. A small subdomain assumed to be surrounded by infinitely replicated domains in each direction is simulated. This yields the boundary conditions for the system. The above-mentioned scenario (infinite replicated domains) is referred to as a periodic boundary condition. Periodic boundary conditions lead to problems with long-range potentials, as all (infinitely many) of the periodic images interact with the system. Specifically, electrostatic interactions which decay as /r, need to be handled with a method that accounts for infinitely many periodic images. Most commonly, the Ewald method [] is used for this purpose. Implementation of this method requires the use of a reciprocal lattice. The complexity of this computation stems from underlying Fourier transforms. Periodic boundaries also lead to difficulty in defining and computing external pressure. For pair-additive potentials, the virial expansion can be used to calculate pressure despite the implementation of periodic boundary conditions. Polarizable force fields and ab initio molecular dynamics must use a more computationally expensive approach [].
Parallel Implementation of N-body Methods A number of important domain considerations influence the choice of algorithms, resulting computational complexity, structure, and parallelization strategy of N-body methods. The main factors that influence parallelization include (i) handling of long-range potentials, (ii) computation of short-range potentials, (iii) implementation of suitable boundary conditions, (iv) time-integration of the interaction, and (v) additional processing associated with particles (analysis routines, checkpointing trajectories, etc.). This section describes in more detail the nature of these computations and their parallelization.
Interactions in N-Body Simulations A molecular dynamics simulation begins with an initial geometry for the system. Normally, assuming equipartition of energy, each atom is assigned a random velocity based on the Maxwell–Boltzmann distribution.
N
N
N-Body Computational Methods
The interaction potentials between atoms are then calculated. Forces are computed from the potentials using the equation F⃗ = −∇ϕ, where F⃗ is the force and ϕ the interaction potential. The resulting coupled differential equations are then integrated in order to determine the velocities and positions of each atom in the next timestep. Nearly –% of processing time in an N-body simulation is spent in the computation of interaction potentials and forces. Parallelization of these computations is the major challenge in large-scale N-body simulations. This subsection will focus on the methods for parallelizing the computation of interatomic interactions. These interactions are broadly classified in two types, viz bonded and nonbonded. Standard classical MD approaches treat bonds as invariants. Hence the parallelization of the computation of bonded terms is relatively straightforward, depending on the decomposition method. Nonbonded interactions, on the other hand, require more careful treatment. These terms are further classified as short range or long range. By their very nature, long-range interactions are more difficult to compute. Computation of Bonded Terms
Bonded interactions in molecular dynamics are typically modeled using a harmonic potential. Much as the compression or elongation of a spring changes potential energy, bond stretching and shrinking contributes to the overall energy of the system. These two-body terms, determined by the positions of bonded atoms and their types, must be computed at each timestep. In addition, higher order bonded terms such as three-body (angle) and four-body (torsion and dihedral) terms must be calculated. To compute bond and higher order terms, lists of all two-, three-, and four-body interactions are maintained. Using suitable indices, the position of each atom’s bonded neighbors, and thus the corresponding forces can be computed. Since conventional molecular dynamics approaches do not model the process of bond-breaking or formation, these lists are static. For computing these terms, an interaction-based decomposition of the computation is preferred, since the interacting constituents are known and constant.
Computation of Short-Range Terms
Typical coupling terms between bodies decay with distance. For example, electrostatic coupling potentials decay as /r, gravitational interaction decays as /r , Lennard-Jones potential decays as /r, etc. Depending on the rate of decay and the required accuracy, it is possible to truncate these coupling terms. For example, Lennard-Jones potentials are often truncated at some multiple of σ , the distance at which the interparticle potential becomes zero. For typical molecular dynamics simulations, this is on the order of Å. Short-range interactions, due to their dynamic nature, typically require spatial decomposition to reduce computational cost. Computing short-range interactions for a particle requires complete knowledge of all particles within the neighborhood defined by the cutoff distance. If one assumes a constant (or bounded) particle density and a constant cutoff distance, it is easy to see that the number of short-range interactions is constant (or bounded) as well. The challenge is to identify all particles within the cutoff distance from a given particle. Most interactions modeled are governed by central forces, depending only on the distance seperating two bodies. Determining the distances between particles which is the most expensive part of a short-range computation, is achieved by use of neighbor lists. Since a brute force algorithm for neighbor list generation involves an O(N ) computation, more efficient methods have been developed. For instance, a grid-based search is often utilized to reduce the complexity to O(N). The granularity of the grid has a large effect on the net efficiency of the method. The grid method can further be improved by exluding a large number of computations within the grid based on the geometry. Careful consideration must be given to the design of neighbor lists in order to maximize cache performance. In molecular dynamics, particles typically do not have large displacements. For this reason, neighbor list generation can be delayed with the use of Verlet lists. Communication overhead for neighbor list generation can be further reduced by the use of a buffer zone, known as a skin, around processor boundaries. While typical molecular dynamics simulations are consistent with a bounded particle density assumption, astrophysical N-body problems may not satisfy this
N-Body Computational Methods
assumption. For such simulations, we must generalize the cell structure into an oct-tree decomposition of the domain. With suitably optimized oct-tree structures, we can realize O(N) complexity for short-range interactions for arbitrarily unstructured particle distributions as well. For a detailed discussion of these results, please see the work of Callahan and Kosaraju []. In some atomistic simulations such as ReaxFF [], bond activity, and hence chemical reactivity, are explicitly modeled through the computation of a distancedependent bond order. In effect, a bond order takes inter-atomic distances and computes the likelihood of bonds between pairs of atoms. This bond order is subsequently corrected to account for valency. In such cases, bond lists must be updated at each timestep and bonded interactions must be treated as short-range interactions. Computation of Long-Range Terms
Coupling potentials that are proportional to /r, (e.g., electrostatic or gravitational potentials) introduce unacceptable errors when truncated. For this reason, these terms are handled separately. Naively computing long-range potentials on an all-to-all basis results in an O(n ) complexity (since each of the n particles is impacted by every other particle). For large systems, this per-timestep complexity is not tractable. Consequently, a number of approximation techniques have been proposed to reduce complexity while incurring minimal error. The basis for a family of approximation techniques is as follows: when a group of particles is sufficiently distant from an observation point, their impact can be approximated through a single interaction with the center of mass of the distant particle cluster. Indeed, when we compute the gravitational potential at a point on earth from the moon, we approximate the volume integral (over moons volume) by a point mass at the center of the moon. Since the radius of the moon is much less than its distance from earth, this approximation introduces controllable errors. Many fast algorithms use this principle to reduce the computational complexity of N-body methods. These fast algorithms use a hierarchical representation of the domain using a spatial tree data structure. The leaf nodes consist of aggregates of particles. Each node in
N
the tree contains a series representation of the effect of the particles contained in the subtree rooted at the node. These representations are typically based on Taylor or Legendre polynomials. Interactions between nodes and particles are dictated by a multipole acceptance criteria (MAC). This criteria determines whether the error in a particular interaction is within acceptable bounds or not (can the moon really be treated as a point object with respect to the earth or not?). Different algorithms use different MAC. Selection of an appropriate MAC is critical to controlling the error in simulation. Methods in this class include those due to Appel [, ], Barnes and Hut [], and Greengard and Rokhlin [, , ]. The Barnes–Hut method is one of the most popular methods due to its simplicity. It works in two phases: the-tree construction phase and the force-computation phase. In the tree-construction phase, a spatial tree representation of the domain is derived. At each step in this phase, if the domain contains more than one particle, it is recursively divided into eight equal parts. This process continues until each oct has a single element in it. The resulting tree is an unstructured oct-tree for arbitrary particle distributions. This tree is now traversed in post-order. Each internal node in the tree computes and stores an approximate representation of the particles contained in that tree. Once the tree has been computed, the force on each particle can be computed as follows: the multipole acceptance criteria is applied to the root of the tree to determine if an interaction can be computed; if not, the node is expanded and the process is repeated for each of the four (or eight) children. The multipole acceptance criteria for the Barnes–Hut method computes the ratio of the dimension of the box to the distance of the point from the center of mass of the box. If this ratio is less than some constant α (a in Fig. ), an interaction can be computed. The algorithm is illustrated in Fig. . For a balanced tree, each of the n particles needs O(log n) interactions. This results in a total computational complexity of O(n log n). However, the tree size can be made arbitrarily large by bringing a pair of particles closer. The corresponding tree needs a large number of boxes to resolve the pair into separate boxes. Due to this, the worst-case complexity of this technique is unbounded [, ]. However, using box-collapsing techniques (the box is first collapsed to the smallest box
N
N
N-Body Computational Methods
l1 1
2
l d
d1
d2 d3
3
4
if (l/d < a) compute direct force interaction with the center of mass of domain, else if (l1/d1 < a) compute direct force computation with center of mass of subdomain 1 else expand subdomain 1 further.
Center of mass of domain
Apply similar criteria to domains 2, 3, and 4.
Centers of mass of subdomains Source particle
N-Body Computational Methods. Fig. Illustration of the serial Barnes–Hut method
that contains all the particles in the subdomain), this complexity can be reduced to O(n log n) []. In addition to box-collapsing, it has been shown that using binary decomposition of the domain (instead of quad or oct-tree decompositions) yields better aspect ratios and reduced operation-count for treecodes. The fast multipole method (FMM) of Greengard and Rokhlin [] is another hierarchical technique for computing n-body interactions. FMM computes the potential due to a cluster of particles at the center of well-separated clusters. This can then be disseminated to individual particle positions to determine required potentials. FMM therefore uses cluster–cluster interactions in addition to particle–cluster interactions. The computational complexity of FMM can be shown to be O(n) for well-structured particle distributions. Several parallel formulations of hierarchical methods for long-range electrostatics have been proposed [, , ]. These methods generally rely on a spatially oriented domain decomposition for constructing hierarchical representations, as well as computing potential at each particle. Hierarchical representations of the domain are constructed in parallel as follows: each processor starts with an initial subdomain assigned (exclusively) to it. Processors construct a hierarchical representation (oct-tree) independently using only the particles assigned to it. In each processors’ oct-tree, it is possible to identify a minimal set of nodes such that all particles in the corresponding domains are exclusively assigned to a single processor. These nodes are referred
to as branch nodes. It can be seen that hierarchical representations at and below the branch nodes are accurate. To construct a consistent image of the top part of the tree, branch nodes are broadcast to all processors. These nodes are inserted into each processor’s tree and the top part of the tree is recomputed by all the processors. This results in a data structure at each processor that is accurate at and above the branch nodes, as well as below local branch nodes. Each processor can traverse this hierarchical representation to compute electrostatic potential at its assigned particles. The tree traversal phase can proceed in two different ways. As a tree is traversed, a processor may encounter nodes that are not present locally. In this case, the node may either be fetched from the remote processor (data shipping), or the particle itself may be shipped to the remote processor that holds the data ( function shipping). The schemes of Singh et al. [] and Warren and Salmon [] are based on the data-shipping paradigm, whereas the scheme presented in [] is based on a function-shipping paradigm. Function shipping simplifies the problem of addressing arbitrary nodes in a tree, since only branch nodes are remotely addressed. This is implemented using a hash table to access branch nodes. In data-shipping formulations, a processor may request an arbitrary node from a remote processor. Therefore, data-shipping formulations provide fast access to the entire data-set. This is typically accomplished by using the Morton key as a hash function.
N-Body Computational Methods
In some applications particle distributions may be highly unstructured. In such cases, a naive domain partitioning potentially results in significant load imbalance or high communication overhead. To address this problem, space-filling curves such as the Peano– Hilbert ordering (costzones) and Morton ordering (also known as the Z-curve) are used. Particles are ordered in a list according to the space-filling curve and assigned to processors by partitioning the list so that each processor gets equal load. An estimate of the particles load can be derived by keeping track of the computation associated with the particle in the previous timestep. The motivation for using space-filling curves comes from the spatial proximity and the fact that both of these orderings number all particles in an oct before proceeding to subsequent octs. Furthermore, in data-shipping message passing formulations, space-filling curves allow easy location of arbitrary nodes in a tree without expensive pointer chasing. The computation of long-range interactions is complicated by the application of periodic boundary conditions. Since these interactions cannot be easily truncated, naive computation of long-range interactions under periodic boundaries lead to an infinite number of terms. The Ewald sum is a method originally developed to calculate the electrostatic interaction in crystals. Systems with periodic boundary conditions mimic the infinite lattice of a crystal structure. Hence, the Ewald method is easily applied to molecular dynamics simulation. The Particle-Mesh Ewald summation decomposes the long-range potential into a short-range term that sums quickly in real space and a longrange term that sums quickly in the Fourier space. The latter term is computed using a Fast Fourier Transform, which contributes primarily to the computational cost of the method. Parallel techniques for computing Fast Fourier transforms have been well studied in literature []. The scalability of these methods is critically impacted by the bisection bandwidth of the parallel platform on which they are implemented. For this reason, this part of the algorithm does not easily scale to large numbers of processors. One way of circumventing this scalability bottleneck is to execute the FFT separately on a subset of processors, while other processors execute other parts of the algorithm. Indeed, as we describe, a number of existing codes use this approach.
N
Since different interactions are ideally parallelized by different decomposition methods, the nature of interactions within an N-body computation determines the prefered method. Since spatial decomposition is often preferred in situations containing nonbonded interactions, bonded interactions must often be computed under non-ideal circumstances. The communication of neighboring atoms under spatial decomposition is further complicated by the existence of bonded interactions. Since bonded interactions require atoms several bonds away, these atoms must be communicated regardless of their distance from the processor boundary. Therefore, the generation of a list of skin atoms must also incorporate atoms in a -, -, and -body interaction across the subdomain, in addition to atoms within the cutoff distance.
Some Major Public-Domain Efforts GROMACS GROMACS is a freely available, open source molecular dynamics code in C distributed under the GPL. It was originally developed at the University of Groningen and is currently maintained by research groups in Europe. Parallelization of GROMACS has been significantly improved with the release of GROMACS []. It relies on the eighth shell domain decomposition technique introduced by the D. E. Shaw group [] to reduce communication bandwidth requirements. Processor domains are aligned along the x dimension, but staggered along y and z dimensions to facilitate dynamic load balancing.
NAMD NAMD [] is a parallel molecular dynamics code designed primarily for the simulation of large biomolecular systems. It is written in C++, based on Charm++ parallel objects. NAMD works with popular biomolecular simulation force fields AMBER and CHARMM. The dynamic load-balancing procedure implemented in NAMD has a slow start, but performs well in the long run by scaling to a large number of processors. Recent benchmarks, however, show that NAMD scales worse than Desmond and GROMACS []. NAMD also offers biomolecular utilities in addition to common molecular simulation functionalities.
N
N
N-Body Computational Methods
Outstanding Challenges Long Timestep Approaches One of the key challenges in scalable N-body computations is the serialization bottleneck of synchronized timesteps. Velocity Verlet is the most widely used scheme for integrating the Newtonian equations of motion in MD simulations due to its simplicity and symplectic nature. Studies of nonlinear stability indicate that the timestep lengths using Verlet integrators are upper bounded by √Pπ [, ], where P is the periodicity of the fastest oscillation in a given system. Considering the presence of Hydrogen atoms in most simulations, we would have P ≈ fs, and so the timestep length δt should be less than . fs. In fact, timesteps of – fs are used in most classical MD simulations. Lagrange multiplier-based constraining algorithms such as SHAKE [], RATTLE [], and LINCS [] constrain the bond lengths and freeze the fastest oscillations in the target system, allowing for longer timesteps. However, these methods are not extensible to the next highest frequency oscillations due to strong vibrational coupling. Even though SHAKE and RATTLE are efficient serial algorithms, their parallel implementations do not currently scale well due to the iterative schemes needed for solving the linear systems involved. LINCS can, however, be made efficient and scalable []. Another way to achieve longer timesteps is hierarchical timestepping, also known as multiple timestep methods (MTS). These methods compute slower changing interactions less frequently, using nested loops, reducing average computation per timestep. MTS methods can be parallelized like regular MD methods, since they only change the frequency of computation of certain forces; the way these forces are computed stays the same. Accelerated Dynamics Parallel MD does not scale to long timescales because of the time-serialization. Many interesting phenomena, however, take place at timescales beyond microseconds. Accelerated dynamics methods such as parallel replica dynamics, hyperdynamics, temperature accelerated dynamics, and on the fly kinetic Monte Carlo methods aim to extend the timescale limitations of
molecular simulations by exploiting ideas from Transition State Theory (TST) []. In a system with N atoms, the potential energy surface is a function in N dimensions. This potential energy surface can be thought of as a composition of local minima separated by dividing surfaces. In a typical molecular simulation, the trajectory spends most of its time wandering around a basin making only infrequent escapes to neighboring basins (which intuitively corresponds to a significant conformational change or a reaction). Accelerated dynamics methods focus on capturing the jumps from one basin to another accurately while accelerating the trajectory through its path inside a single basin by alternate means. Parallel replica dynamics [] is the simplest and most accurate of the accelerated dynamics techniques. As the name suggests, it replicates a system across processors, goes through a short dephasing stage where correlations between these initially identical copies are annihilated, and lets each copy run independently. When an escape is detected at any of the processors, all processors except the one that detects the escape are stopped. The trajectory that made the escape is traced further to detect any correlated escapes. The method itself is inherently parallel. Communication is minimal but parallel efficiency is limited by both the dephasing stage (which does not advance the system clock) and the correlated event stage, during which only one processor computes. The system size to be simulated is bounded by the capabilities of a single node. Hyperdynamics [] is inspired by the basic concept of importance sampling. The potential energy surface in a simulation is modified by adding a nonnegative bias potential, which must be zero at/around the dividing surfaces. This way the energy barrier in front of a state transition is lowered giving rise to an enhanced rate of trajectory escapes. While the dynamics inside a basin become meaningless with the addition of a bias potential, the evolution from state to state is correct, because the relative TST rates for escape paths are not altered. The acceleration comes from the boosting of timestep lengths around the minima. In hyperdynamics, time is advanced at each step by the regular timestep length scaled by the inverse Boltzmann factor for the bias potential at that point. The ideal bias factor should be easy to compute and gives large boost
N-Body Computational Methods
factors. Different bias potentials have been reported in literature: one that is based on the lowest eigenvalue of the Hessian [], flat bias potential [], and local bias potential []. Boost factors upto , have been reported using hyperdynamics []. A more approximate method is the Temperature Accelerated Dynamics (TAD) [], which relies on harmonic TST approximation. The idea here is to run the simulation at a high temperature triggering more frequent trajectory escapes from a start basin. The times of escapes from the start basin are projected to their corresponding times at the original simulation temperature by using concepts from TST. Since the order of escapes at high temperature are different from those at the original temperature, once the earliest escape is detected with high probability, the TAD procedure is started again in the new basin. Boost factors achieved by TAD can be on the order of thus reaching seconds of simulation time for certain systems []. The most important problem with accelerated dynamics methods is that they are practical only for small systems (tens to hundreds of atoms). Consequently, the underlying challenges associated with parallelization are not interesting. The most straightforward way of parallelization would be to couple the parallel replica dynamics with other accelerated methods such as hyperdynamics and TAD, that is, to use parallel replica scheme but instead of running regular MD on each processor, use hyperdynamics or TAD. Accelerated dynamics has been used to understand diffusion dynamics in solid state systems and folding of proteins. As these methods are further developed and used, scaling with system size and parallelization issues may emerge as important problems.
Bibliography . Aluru S () Greengard’s n-body algorithm is not O(n). SIAM J Sci Comput ():– . Appel AW () An efficient program for many-body simulation. SIAM J Comput . Barnes J, Hut P () A hierarchical o(n log n) force calculation algorithm. Nature . Callahan PB, Kosaraju SR (May ) A decomposition of multidimensional point-sets with applications to k-nearest-neighbors and n-body potential fields. In: Proceedings of th annual ACM symposium on theory of computing, Victoria, BC, Canada, pp –
N
. Esselink K () The order of Appel’s algorithm. Inform Process Lett :– . Grama A, Kumar V, Sameh A () Scalable parallel formulations of the Barnes–Hut method for n-body simulations. In: Supercomputing ’ proceedings . Grama A, Gupta A, Karypis G, Kumar V () Introduction to parallel computing. Addison Wesley, Reading, MA . Greengard L () The rapid evaluation of potential fields in particle systems. MIT Press, Cambridge . Greengard L, Gropp W () A parallel version of the fast multipole method. Parallel Process Sci Comput, pp – . Greengard L, Rokhlin V () A fast algorithm for particle simulations. J Comp Phys :– . Singh J, Holt C, Hennessy J, Gupta A () A parallel adaptive fast multipole method. In: Proceedings of the supercomputing ’ conference . Warren M, Salmon J () A parallel hashed oct-tree N-body algorithm. In: Supercomputing ’ proceedings, IEEE Comp. Soc. Press, Washington, DC, pp – . Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kal L, Schulten K () Scalable molecular dynamics with NAMD. J Comput Chem ():– . Hess B, Kutzner C, van der Spoel D, Lindahl E () GROMACS : algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theory Comput : – . Bowers KJ, Dror RO, Shaw DE () Zonal methods for the parallel execution of range-limited N-body simulations. J Comp Phys :– . Schlick T, Skeel RD, Brunger AT, Kale LV, Board JA Jr, Hermans J, Schulten K () Algorithmic challenges in computational molecular biophysics. J Comput Phys :– . Andersen HC () Rattle: a velocity version of the SHAKE algorithm for molecular dynamics calculations. J Comput Phys : . Garcia-Archilla B, Sanz-Serna JM, Skeel RD () Long-timestep methods for oscillatory differential equations. SIAM J Sci Comput : . Mandziuk M, Schlick T () Resonance in the dynamics of chemical systems simulated by the implicit midpoint scheme. Chem Phys Lett : . Schlick T, Mandziuk M, Skeel RD, Srinivas K () Nonlinear resonance artifacts in molecular dynamics simulations. J Comput Phys : . Ryckaert JP, Ciccotti G, Berendsen HJC () Numerical integration of the Cartesian equations of motion of a system with constraints: Molecular dynamics of n-alkanes. J Comput Phys : . Hess B, Bekker H, Berendsen HJC, Fraaije JGEM () LINCS: a linear constraint solver for molecular simulations. J Comput Chem :– . Voter AF, Montalenti F, Germann TC () Extending the time scale in atomistic simulation of materials. Annu Rev Mater Res :
N
N
nCUBE
. Voter AF () Parallel replica method for dynamics of infrequent events. Phys Rev B : . Voter AF () Hyperdynamics: accelerated molecular dynamics of infrequent events. Phys Rev Lett :– . Voter AF () A method for accelerating the molecular dynamics simulation of infrequent events. J Chem Phys : . Steiner MM, Genilloud PA, Wilkins JW () Simple bias potential for boosting molecular dynamics with the hyperdynamics scheme. Phys Rev B : . Fang QF, Wang R () Atomistic simulation of the atomic structure and diffusion within the core region of an edge dislocation in aluminum. Phys Rev B :– . Montalenti F, Voter AF () Applying accelerated molecular dynamics to crystal growth. Phys Stat Solidi B :– . Henkelman G, Jonsson H () Long time scale kinetic Monte Carlo simulations without lattice approximation and predefined event table. J Chem Phys : . Frenkel D () Understanding molecular simulation. Academic, New York . Ewald PP () The calculation of optical and electrostatic grid potential. Ann Phys : . Mardline RA, Aarseth SJ () Tidal interactions in star cluster simulations. Mon Not R Astron Soc : . Fryxell B, Olson K, Ricker P, Timmes FX, Zingale M, Lamb DQ, MacNeice P, Rosner R, Truran JW Tufo H () FLASH: An adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophysical J (Suppl.): . Nosé S () A unified formulation of the constant temperature molecular dynamics methods. J Chem Phys :– . Andersen HC () Molecular dynamics simulations at constant pressure and/or temperature. J Chem Phys : . Berendsen HC, Postma JPM, van Gunsteren WF, DiNola A, Haak JR () Molecular dynamics with coupling to an external bath. J Chem Phys : . Parrinello M, Rahaman A () Polymorphic transitions in single crystals: a new molecular dynamics method. J Appl Phys : . Louwerse MJ, Baerends EJ () Calculation of pressure in case of periodic boundary conditions. Chem Phys Lett : . van Duin ACT, Dasgupta S, Lorant F, Goddard WA () ReaxFF: a reactive force field for hydrocarbons. J Phys Chem A :
nCUBE The nCUBE machines [] were hypercube-connected multicomputers designed and built by nCube Corporation in the s. Several models were built starting with the nCUBE/ten.
Related Entries Hypercubes and Meshes
Bibliography . CORPORATE Ncube. () The NCUBE family of highperformance parallel computer systems. In: Geoffrey Fox (ed) Proceedings of the third conference on Hypercube concurrent computers and applications: architecture, software, computer systems, and general issues - vol (CP), vol . ACM, New York, pp –
NEC SX Series Vector Computers Hiroshi Takahara NEC Corporation, Tokyo, Japan
Synonyms SIMD (Single Instruction, Multiple Data) Machines
Definition The SX Series is the parallel vector supercomputer that has been provided by NEC as its flagship highperformance computing system. After the advent of its pioneering models of SX-/ back in , NEC has continuously been enhancing this series toward the SX-, which has the world’s fastest single CPU core performance of . GFLOPS. It makes up node system of . TFLOPS peak performance, which can be configured up to nodes enabling almost petascale computing with the maximum vector performance of TFLOPS. The technology basis of the SX Series has led to the realization of the Earth Simulator at JAMSTEC (Japan Agency for Marine-Earth Science and Technology) that has earned the number one position on the Linpack benchmark during – for five times in the row, as well as its successor model (renewed Earth Simulator) put into operational use in . The SX Series has been utilized for a spectrum of applications ranging from scientific research and mission-critical weather forecasting to engineering and material design.
Discussion History of the Supercomputer SX Series In April , NEC made an entry in the supercomputer market with its simultaneous announcement for the
NEC SX Series Vector Computers
first two models in the series: the SX- and the SX- – the world’s fastest supercomputers at the time. The SX- achieved a peak performance of . GFLOPS (. billion floating-point operations per second), marking the beginning of the era of gigaflop computing. The successor model, SX-, announced in April attained new world speed with its peak performance of GFLOPS using four multiprocessors and maximum . GFLOPS per CPU. Offering a lineup of commercial application software, the SX- not only employed the shared-memory-type multiprocessor combined with the parallel processing technology, but also pioneered the era of open systems featuring a -bit SUPERUX operating system, which was developed based on UNIX. With an increase in expandability to a maximum of processors, the SX- was announced in November , achieving a maximum performance of TFLOPS. The SX- adopted CMOS for the CPU instead of conventional silicon bipolar LSI, which resulted in both reduced power consumption and higher density packaging, consequently leading to a cost reduction by allowing the user to utilize air cooling instead of previously used liquid-cooling systems. The system has enabled expansion in memory capacity by the use of DRAM instead of SRAM as the memory device and the consequent improvement in a price performance ratio combined with a comprehensive list of application software. In June , the SX-, the fourth generation in this series, was unveiled. Doubling both the clock frequency and the number of vector pipelines, the evolved model achieved a CPU vector performance of GFLOPS, resulting in the system peak vector performance of TFLOPS with a configuration of parallel processing with CPUs. The release of the fifth generation SX- in October enabled the integration in a single chip of what required LSI chips in the previous CPU model (SX-). With , CPUs each having a maximum CPU performance of GFLOPS, the SX- delivered a peak performance of TFLOPS. Equipped with a maximum of CPUs per node and a maximum of GB of shared memory combined with automatic parallelization, the SX- introduced in October realized further ease of operation.
N
Also during this period, the delivery of the Earth Simulator, for which NEC was in charge of the hardware and basic software development, was completed. The Earth Simulator began operation in March , and its performance not only stunned the world, but also has continued to make significant contributions to understanding challenging environmental issues such as global climate change and advancing a wide spectrum of scientific findings and industrial applications. Shipping of the sixth generation SX- started at the end of , achieving a CPU performance of GFLOPS and a system peak performance of TFLOPS with a maximum of , CPUs. Subsequently in October , the seventh generation SX- was announced. Featuring the world’s fastest . GFLOPS performance per single core, the large-scale shared memory up to TB, and the interconnects with a data transfer rate of GB/s, it can deliver near-PFLOP peak performance of TFLOPS. In addition, the improved LSI technology and high-density packaging technology have reduced both power consumption and required installation space to approximately a quarter of that required for conventional supercomputers. Figure shows the pictures of the major models of the SX Series. The evolution of the SX Series is summarized with its specifications in Table .
SX- Architecture In subsequent sections, the description of the SX Series is made with the particular focus on the latest model
SX-2
SX-3
SX-6
SX-8
SX-4
SX-6i
SX-8i
SX-5
SX-7
SX-9
NEC SX Series Vector Computers. Fig. Pictures of the SX Series hardware
N
N
NEC SX Series Vector Computers
NEC SX Series Vector Computers. Table SX Series system specification
Model
Year of announcement
Single CPU core performance
Memory bandwidth per CPU (GB/s)
Number of CPUs per node
Clock cycle (ns)
Single node performance (GFLOPS)
Main memory capacity (GB)
.
.
SX-
.
.
SX-
.
,
SX-
.
,
,
SX-
.
,
.
,
,
.
,
.
,
.
,
,
,
,
SX-
.
SX-R
.
SX-
.
. .
SX-. The development of the SX- aims at facilitating both high performance for real application programs and economical operation. The SX- has been developed with the following features:
. Fast single-chip vector processor Using the nm CMOS technology, the SX- realizes . GFLOPS per processor – the world’s fastest single core performance. . Multiprocessor system with excellent scalability A single SX- node has a flat shared memory configuration for a maximum of CPUs with a peak singlenode performance of . TFLOPS and a maximum memory capacity of TB. This shared memory node is interconnected with a special high-speed switch at a maximum of GB/s per node, enabling configurations of up to a maximum of nodes ( TFLOPS) to provide both shared and distributed memory systems with excellent scalability. . Superior energy-saving and installation advantages In addition to the reduced processor energy consumption and heat generation, the system adopts the highefficiency cooling technology, resulting in improved system power consumption and easy installation.
.
.
Max. system performance (GFLOPS)
SX-/
SX-
.
Memory bandwidth per node (GB/s)
.
Hardware Overview Figure shows the SX- system configuration. The product lineup ranges from a shared-memory type single-node model that tightly integrates a maximum of CPUs and the main memory unit (MMU) into a multinode system model connecting each node in a cluster configuration via a high-speed internode crossbar switch (IXS). The single-node system is available in two models: the model A with a maximum number of CPUs of (. TFLOPS), a maximum main memory capacity of TB, and a maximum input/output slots, and the model B with eight CPUs at a maximum performance of . GFLOPS, a maximum main memory capacity of GB, and a maximum of input/output slots. The multinode system interconnects – nodes in a cluster configuration via IXS with up to , CPUs. The CPU of the SX- uses eight sets of vector pipelines to achieve a peak performance of over GFLOPS with a single unit. The memory with a maximum capacity of TB can be shared by up to CPUs, the maximum data transfer rate between the CPU and the MMU is TB/s, and that between the I/O unit and the MMU is GB/s × .
Processor The CPU is composed of a vector unit and a scalar unit, both of which are connected to the MMU through the processor-memory network. The SX- features the addition of the ADB (Assignable Data Buffer) within a
NEC SX Series Vector Computers
Multinode system Internode Crossbar Switch (IXS)
node #0
node #1
node #2
node #511
I/O IXS
node CPU#0
CPU#1
CPU#15
CPU#2
Main Memory Unit
NEC SX Series Vector Computers. Fig. SX Series system configuration Vector pipeline ×8
Mask Reg.
Multiply Add Add Div/Sqrt
Cache Scalar Unit
Scalar Reg.
as the Mask and Load/Store functions. This block also includes mask registers and vector registers. The vector operation block supports the IEEE double and single-precision floating point data formats.
Vector Operation Pipeline The vector operation block consists of eight pipelines. To reduce the time to transfer the result of one vector operation to another operation, the vector unit has the enhanced data forwarding capability (chaining), improving the processing speed of programs with a short vector loop length. The vector pipelines have vector data compression and decompression capabilities for the efficient use of memory bandwidth.
Mask
Multiply Vector Reg.
Vector register The vector operation block has vector registers. Each vector register consists of bit-wide registers. The total capacity of the registers reaches KB. Logical
Load or Store
N
Mul / Add Mul / Add / Div ALU ALU
NEC SX Series Vector Computers. Fig. SX- CPU configuration
Vector mask register Some vector operations can be processed with a bit mask, which can be switched to enable/disable the operation for each element corresponding to the mask bit. Mask bits are generated in the pipeline according to the result of logical operations. There are mask registers prepared, each of which is bit-wide. Vector control block The SX- CPU has a vector control block which enables the out-of-order execution of vector instructions, contributing to hiding memory latency. Furthermore, this block controls chaining to perform vector processing with high efficiency.
Scalar Unit CPU for the selective buffering of data. Figure shows the configuration of the CPU of the SX-.
Vector Unit The vector unit is composed of the vector operation block and the vector control block.
Vector Operation Block The vector operation block has basic operations including the Logical, Multiply, Add, and Divide/Sqrt, as well
The scalar unit of the SX- employs the -bit RISC architecture that is compatible with the SX Series products and incorporates general-purpose -bit registers and two KB L caches for instructions and data. Its features are as described below.
-way super-scalar configuration The SX- uses a super-scalar configuration with a total of six execution pipelines (two for the floatingpoint arithmetic operations, two for integer arithmetic
N
N
NEC SX Series Vector Computers
operations, and two for memory operations), thus improving the performance of application codes with a high degree of instructionlevel parallelism.
Instruction execution control The out-of-order execution mechanism contributes to the high performance by executing instructions that become executable regardless of the instruction order. For the fast operation over a large domain, the SX- enables search and retrieval across up to eight branch instructions. It also adopts a speculative instruction execution mechanism that performs tentative execution of subsequent instructions before the actual execution of a branch instruction and restarts an instruction with the appropriate condition in a case when the branch prediction fails. In order to expand the instruction window in the out-of-order execution mechanism, the SX- has the reorder buffer entries enhanced to entries.
Multinode Configuration The multinode system of the SX- can connect up to nodes by grouping shared-memorytype single nodes into clusters and connecting them to the IXS ultra-highspeed crossbar switch. The multinode system features not only a wide bandwidth of internode data transfer but also reduced communication latency by capitalizing on the RCU (remote access control unit), which is the IXS connection unit for each node. Figure shows the configuration of the SX- multinode system. Each node incorporates up to RCUs, which are connected to the IXS via cables. Each RCU forms a single lane, which has two connection ports of GB/s × , offering a transfer rate of GB/s × . As a result, each node can incorporate a maximum of lanes with connection ports, and the total transfer performance reaches a maximum of GB/s × .
Main Memory Unit (MMU) ADB (Assignable Data Buffer) The ADB is located within a CPU for the selective buffering of data. The ADB enables a more efficient data transfer between the ADB and the vector unit with shorter latency than that between the processor and the memory, resulting in higher performance by storing frequently used data in the ADB and the reduced memory bank contention arising from the concurrent processing. The data stored in the ADB can be set to vector data only, scalar data only, or both vector and scalar data depending on the running application. The ADB for each vector load/store instruction is assigned for retaining the necessary data in the ADB as much as possible. A software-controlled prefetch instruction is supported.
The MMU adopts the shared memory system and is composed of MMU cards for its maximum configuration. The memory of the SX- has a large capacity thanks to the use of DDR-SDRAM (Double Data Rate-SDRAM) devices for all the MMUs. The memory capacity per MMU card is GB, and the system supports the capacity from GB up to TB. The MMU cards are capable of handling concurrent operations of memory access requests from the CPU with a data transfer rate of GB/s per MMU card, or a maximum of TB/s for the entire system. The maximum ,-way parallel memory interleaving is adopted for the efficient data transfer with DDRSDRAM.
Internode Crossbar Switch (IXS) 128GB/s ×2 (8GB/s ×16 ×2) RCU RCU CPU CPU #00 #01
MMU Node #0
RCU CPU #15
RCU RCU CPU CPU #00 #01
RCU CPU #15
RCU RCU CPU CPU #00 #01
MMU Node #1
MMU Node #511
NEC SX Series Vector Computers. Fig. SX- multinode system configuration
RCU CPU #15
NEC SX Series Vector Computers
I/O Processing Unit All of the processors in the system are allowed to access all of the I/O devices. The SX adopts the direct I/O method, with which the memory in the host bus adapter (HBA) is accessed directly. The I/O interface can be selected according to the types of packaged channel cards, facilitating the system expansion. The I/O processing unit of the SX- is composed of the IO Features (IOFs: host bridge units) and the PCI Express control units. Up to IOFs can be mounted per system and two PCI Express control units can be connected to each IOF. As each IOF is virtually represented as a singlechannel device, all of the functions incorporated in the IOFs can be set and modified at the desired timing from the software using the same access method as a general purpose channel card.
Core Technology Behind the SX- Figure summarizes some of the core technologies built into the SX Series for enhanced performance and ease of use. The SX Series has the advanced architecture of large-scale shared memory, high-speed data transfer between the CPU and the memory, and an ultra-highspeed network interconnecting nodes. One of the key
Software Technology of the SX- Outline of the SUPER-UX Operating System for the SX- The SUPER-UX is an operating system based on the UNIX System V operating system that features functions inherited from the BSD and SVR MP as well as enhancements of the functions required to support supercomputers. The SUPER-UX features the flexible resource management and the high parallel processing capabilities of kernels and I/Os in order to guarantee high scalability. Four page sizes are supported, which are KB for general commands, MB for compiler and system commands, and and MB for largescale user programs. This strategy improves the execution performance of programs that use large arrays
World fastest single-core processor (102GFLOPS) Æ 1.6TFLOPS/node High-speed interface technology
CMOS LSI with the 65nm design rule
High-speed interface for high performance optical interconnection
High performance technology • Parallel processing technology • Cluster control technology Cooling technology
technologies indispensable to achieving higher system efficiency is the interconnection between processors. NEC has developed an optical interconnection technology that achieves a Gbps signal transmission rate between two LSIs of a high-performance computer, thus surpassing the communication speed made available with the existing electrical transmission.
Technologies for the SX-9
LSI technology
N
Supercomputer
Cooling for the heat generation of 200W+
NEC SX Series Vector Computers. Fig. Technologies for the SX-
SX-9
High-density packaging technology
N
N
NEC SX Series Vector Computers
and reduces the overheads in the memory management. The SUPER-UX of the SX- has expanded the user’s virtual space to TB, enabling an efficient layout of parallel programs and MPI programs. The SX Memory File Facility (SX-MFF) is also provided for high-speed I/O by building a conventional file system on the large-capacity memory as a disk cache.
Basic OS Functions The SUPER-UX supports not only large-scale multiprocessors with efficient resource management but also high-speed input/output and the sophisticated gang scheduling for maximizing parallel processing performance. File Management Functions The SUPER-UX makes it possible to create large-scale files and file systems. While retaining the advantages of a standard UNIX system, it realizes a highspeed file system. The high-speed shared file system gStorageFS enables the efficient data transfer comparable to local file systems without involving the CPUs on remote servers. The NFS V compatibility is maintained as a user interface. Batch Processing Functions The SUPER-UX supports the concepts of jobs (sets of processes) at the kernel level. NQSII is the batch processing system that enables appropriate processing by large-scale clusters, and JobManipulator is provided as an NQSII scheduler extension to maximize system operational efficiency with backfill scheduling. NQSII incorporates functional enhancements of NQS job management, resource management, and load balancing to adapt to cluster systems with the improved single system image (SSI) and system operability. Operation Management Functions SUPER-UX supports a checkpoint/restart function for the interruption of a program in progress at a userspecified point and its smooth resumption. The unified operation management software provides the integrated management of multiple host machines on the network from a single machine, as well as automatic operation control, thus resulting in reduced operation cost. Software Development Environment The compilers for Fortran, C, and C++ provide highlevel optimization and automatic vectorization/
parallelization features for maximizing the performance. The vector data buffering function, which utilizes the ADB, is open to the user, thus enabling highly effective memory access optimization. Both the MPI library and the HPF compiler are provided for comprehensive distributed memory programming. The PSUITE tool is also available for the GUI-based program development, debugging, and tuning.
Compiler and Library The SX- provides FORTRAN/SX and C++/SX, respectively, a Fortran compiler and a C/C++ compiler; both feature vectorization and parallelization functions. HPF/SX V (a compiler for High Performance Fortran, which is a standard language for distributed parallel processing) is also provided, as well as MPI/SX and MPI/SX (fully compliant with the distributed parallel processing interfaces MPI-. and MPI-.). The system also supports a number of the Fortran features. Both FORTRAN/SX and C++/SX support the OpenMP standard API for sharedmemory type parallel processing. Two proprietary mathematical libraries are provided; one is the ASL scientific library aimed at a wide usage in scientific and engineering applications with enhanced performance on key functions, such as linear algebra and FFT. The other is MathKeisan, which collects a set of public domain mathematical library components, such as BLAS.
Application Program and Performance The application of the SX Series spans a wide spectrum of areas ranging from fundamental research such as nanomaterial sciences and environmental sciences to industrial designing, drug designing, and emerging life sciences. The SX Series realizes high performance on real application programs, as well as its wellbalanced performance from various architectural perspectives. Meantime, the Earth Simulator at JAMSTEC has inherited the parallel vector architecture, realizing the peak performance of TFLOPS with its updated model ( nodes of SX-/E). This system is ahead of the pack with respect to the HPC Challenge benchmark (HPCC), which is gaining popularity separately from the Linpack score. The updated Earth Simulator topped the Global FFT (Fast Fourier Transform), one of the measures of the HPC Challenge Awards,
NEC SX Series Vector Computers
with the performance number of . TFLOPS. The HPCC is designed to give multiple measures on performance under the US governmental project, including such factors as memory bandwidth and interconnect, as well as processor capability, all of which are suitable to the evaluation of the capabilities of a computer more comprehensively. The Earth Simulator showcases excellent performance and peak performance ratio on scientific application programs in such an area as earth environmental modeling. Figure shows the performance numbers on real scientific applications evaluated for six benchmark programs on various computer platforms, including the SX Series. The target applications are: () the computations of inter-plate earthquakes in a subduction zone, () the turbulent flow based on the direct
N
numerical simulation, () the electromagnetic analysis about various antennas, () the land mine detection with the Synthetic Aperture Radar (Maxwell’s equations), () the unsteady flows around a turbine based on the direct numerical simulation, and () the wave-particle interactions between electrons and plasma waves. Figure includes the comparison of single CPU performance of these benchmark programs for various NEC vector supercomputers and Intel multicore processors (Harpertown, Nehalem-EP, and Nehalem-EX) as indicated in Table . The Nehalem processors are directly connected to the memory system, enabling efficient data transfer. Each application program is set up to utilize all the cores for the Intel processors. As clearly seen in Fig. , the SX- processor shows much higher performance over other processors due to its high
80.0
SX-7 SX-8 SX-9 Harpertown Nehalem-EP Nehalem-EX
70.0 60.0 50.0 40.0
N
30.0 20.0 10.0 0.0 Earthquake Turbulent Flow
a
Antenna
Land Mine
Turbine
100%
Plasma
SX-7 SX-8 SX-9 Harpertown Nehalem-EP Nehalem-EX
80%
60%
40%
20%
0%
b
Earthquake Turbulent Flow
Antenna
Land Mine
Turbine
Plasma
NEC SX Series Vector Computers. Fig. Comparison of single-core performance for scientific application programs. (a) Measured performance (GFLOPS), (b) Peak performance ratio (%). Each bar represents SX-, SX-, SX-, Harpertown, Nehalem-EP, and Nehalem-EX respectively from left to right
N
NEC SX Series Vector Computers
NEC SX Series Vector Computers. Table Specifications of referenced hardware platforms for performance evaluation Computer model
Peak performance (GFLOPS)
Clock (GHz)
SX-
.
SX-
SX-
.
.
Harpertowm
.
Nehalem-EP
.
Memory bandwidth (GB/s)
Number of cores
BYTES/FLOP
Cache size
.
.
ADB: KB
–
.
–
.
.
.
L: MB
.
.
.
L: KB/core L: MB
Nehalem-EX
.
.
.
.
L: KB/core L: MB
40 200 NEC SX-9 NEC SX-8R NEC SX-8
150
30
20 100
10
50
0
sustained GFIop/s
sustained fluid cell update rate (MLUP/s)
0.11 M
1.9 M 3.3 M 7.7 M number of fluid cells
15.4 M
0 26.6 M
NEC SX Series Vector Computers. Fig. Single-core performance of Lattice-Boltzmann CFD program. (Left) Cell update ratio (in Million updates per second) (Right) Measured performance (in GFLOPS)
theoretical peak performance for all the six programs. In addition, the execution efficiency (peak performance ratio) is more than % for many application programs on the SX Series, which is significantly higher than the commodity scalar processors. The actual performance varies depending on such factors as the memory access frequency, which depends on the ratio of the CPUmemory bandwidth and the floating-point operation capability. Figure shows the performance of a Lattice Boltzmann (LB) CFD program measured with a single CPU on different models of the SX Series for various numbers of fluid cells. The LB method emerges from a highly simplified gas-kinetic description. It represents the behavior of fluid by the translational movement
and the collision of virtual particles. The left axis indicates the fluid cell update rate (per second), while the right one represents the measured performance (in GFLOPS). Taking into account the maximum theoretical performance of each CPU (SX-: . GFLOPS, SX-R: GFLOPS, SX-: GFLOPS), the computational efficiency (peak performance ratio) is proven to be high (–%) for a wide range of the number of fluid cells, thus surpassing the efficiency on conventional scalar computers [].
Concluding Remarks While the commodity-chip-based parallelism is getting pervasive, complex control mechanisms and massive
NEC SX Series Vector Computers
hardware resources are necessary for realizing high performance with the current scalar architecture, spurring the widening gap between the theoretical peak performance of a processor and the performance for real scientific application programs [, ]. The vector mechanism is a possible solution to such an issue; it employs a highly developed single instruction multiple data stream (SIMD) approach because of a larger number of arithmetic operations handled by one instruction, which is further enhanced with multiple pipelines. It provides an easy way to express the concurrency needed to take advantages of the memory bandwidth with more tolerance about the memory systems’ latency than scalar designs, facilitating the realization of high performance needed for scientific computing.
Related Entries Compilers Cray Vector Computers Earth Simulator Fortran and Its Successors Fujitsu Vector Computers HPC Challenge Benchmark HPF (High Performance Fortran) Hybrid Programming with SIMPLE Instruction-Level Parallelism MPI (Message Passing Interface) OpenMP Load Balancing Shared-Memory Multiprocessors SIMD (Single Instruction, Multiple Data) Machines
Bibliographic Notes and Further Reading The architectural aspects of the SX Series are described in [–]. The fundamental descriptions about vector processors (or SIMD approach) and the general trend of the performance of high-performance computers are available, for example, in [, ]. There are several papers that focus on the computing performance of the SX Series concerning real scientific application programs (References [, ]). As for more comprehensive performance evaluations of the SX-, for example, reference [] is suggestive. In relation to such benchmarks,
N
the descriptions of the referenced scalar computers can be found in [, ]. The elaborate descriptions of the HPC Challenge benchmark and its awards are available in [, ]. It is also useful to refer to the overview of the Earth Simulator featuring the vector-processing capabilities that inherit from the SX Series for deeper understanding of the NEC vector supercomputers (references [, ]).
Bibliography . Zeiser T, Hager G, Wellein G () Benchmark analysis and application results for Lattice Boltzmann simulations on NEC SX vector and Intel Nehalem systems. Parallel Process Lett (PPL) ():–. DOI: ./S . Gebis J, Patterson D () Embracing and extending thcentury instruction set architectures. Computer () (published by the IEEE Computer Society):– . The high-end computing revitalization task force (HECRTF) () Federal plan for high-end computing. Technical report . NEC SX- Supercomputer () http://www.nec.com/de/prod/ solutions/hpc-solutions/sx-/index.html . Supercomputer SX-/Special Issue () NEC Tech J (). http://www.nec.co.jp/techrep/en/journal/g/n/gmo.html #name- . Takahara H () HPC architecture from application perspectives. In: Resch M et al (eds) High performance computing on vector systems . Springer, Berlin, pp – . Soga T, Musa A, Shimomura Y, Okabe K, Egawa R, Takizawa H, Kobayashi H, Itakura H () Performance evaluation of NEC SX- using real science and engineering applications. SC technical paper (ACM SIGARCH/IEEE Computer Society). http:// scyourway.supercomputing.org/conference/view/pap . Kobayashi H, Egawa R, Takizawa H, Okabe K, Musa A, Soga T, Isobe Y () Lessons learnt from -year experience with SX- and toward the next generation vector computing. In: Resch M et al (eds) High performance computing on vector systems . Springer, Berlin, pp – . Intel corporation () Intel Xeon processor series product brief. http://www.intel.com/Assets/PDF/prodbrief/.pdf . Intel corporation () Intel Xeon processor series product brief. http://www.intel.com/Assets/PDF/prodbrief/.pdf . Luszczek P, Bailey D, Dongarra J, Kepner J, Lucas R, Rabenseifner R, Takahashi D () The HPC challenge (HPCC) benchmark suite. http://icl.cs.utk.edu/projectsfiles/hpcc/pubs/sc_hpcc.pdf . HPC Challenge Awards () http://www.hpcchallenge. org/custom/index.html?lid=&slid= . Oliker L et al () Leading computational methods on scalar and vector HEC platforms. In: IEEE/ACM SC, Seattle . System overview of ES (JAMSTEC) () http://www.jamstec. go.jp/es/en/system/system.html
N
N
NESL
NESL Guy Blelloch Carnegie Mellon University, Pittsburgh, PA, USA
Definition Nesl is a strongly typed functional programming language designed in the early s to make it easy to program and analyze parallel algorithms. It is loosely based on the ML class of languages and supports parallelism through operations on sequences. Its main features are nested data-parallelism, deterministic parallelism, and a built-in cost model for analyzing the asymptotic performance of algorithms.
Discussion Introduction Nesl is a high-level parallel programming language with an emphasis on concisely and clearly expressing parallel algorithms and the ability to run them on a variety of sequential, parallel, and vector machines [–]. It is a strongly typed call-by-value (strict) functional language loosely based on the ML language []. The language uses sequences as a primitive parallel data type, and parallelism is achieved exclusively through operations on these sequences. As such it is a dataparallel, or collection-oriented language []. The main features of Nesl are: . The support of nested parallelism. Nesl fully supports nested sequences, and the ability to apply any user-defined function over the elements of a sequence, even if the function is itself parallel and the elements of the sequence are themselves sequences. Such functionality is important for expressing parallel divide-and-conquer algorithms as well as irregular nested parallel loops. The support for nested parallelism differentiates it from most other data-parallel languages such as C∗ [], High Performance Fortran [], and ZPL []. Nesl was the first data-parallel language for which the implementation supported nested parallelism. More recently Data Parallel Haskell [] also supports nested data parallelism.
. The support of deterministic parallelism. Since Nesl has no side effects and hence no race conditions, its semantics is sequential and deterministic []. In particular a program with the same input will always return the same output independently of how it is run. Input and output and other nonfunctional features are only allowed at top-level and cannot be called in parallel (there are some well-specified exceptions, such as the random number generator). The philosophy behind Nesl is that it is much easier to design, analyze, and prove the correctness of algorithms if the semantics is deterministic. Deterministic parallelism is also supported by other purely functional parallel languages such as ID [], Sisal [], and Data Parallel Haskell [], and has been topic of recent interest []. . The support for a cost semantics and corresponding provable implementation bounds that maps costs derived by the semantics onto asymptotic runtime on specific machines []. The semantics define two costs, work and depth, for the evaluation of every expression, and simple composition rules for composing these costs across parallel and sequential constructs. The work corresponds to the sequential running time (the total number of instructions executed during the evaluation) and is composed by adding costs across constructs. The depth (sometimes referred to as the critical path or span) corresponds to the longest chain of dependencies and is derived by taking the maximum of costs when composing in parallel and adding costs when composing sequentially. The notions of work and depth have since been used in other languages such as Cilk []. Nesl has been implemented on a variety of architectures, including both SIMD and MIMD machines, with both shared and distributed memory. The released compiler generates a portable intermediate code called Vcode [], which runs on vector machines, distributed memory machines, and shared memory machines [, ]. The implementation supplies an interactive readeval-print loop.
Parallel Operations on Sequences Nesl supports parallelism through operations on sequences, which are specified using square brackets. For example,
NESL
{negate(a) : a in [3, -4, -9, 5]}; ⇒
[-3, 4, 9, -5] : [int]
negates each element of the sequence [3, -4, -9, 5]. This construct can be read as “in parallel for each a in the sequence {3, -4, -9, 5}, negate a.” The symbol ⇒ points to the result of the expression, and the expression [int] specifies the type of the result: a sequence of integers. The semantics of the notation differs from that of SETL, Miranda, or Haskell in that the operation is defined to be applied in parallel. This notation is referred to as the apply-to-each construct and each subcall on a sequence element as an instance. As with setcomprehensions, the apply-to-each construct also provides the ability to subselect elements of a sequence: the expression {negate(a) : a in [3,-4,-9,5] | a < 4}; ⇒
[-3, 4, 9] : [int]
can be read as, “in parallel for each a in the sequence {3, 4, 9, 1} such that a is less than 4, negate a.” The elements that remain maintain their order relative to each other. It is also possible to iterate over multiple sequences. The expression {a + b : a in [3, -4, -9, 5] ; b in [1, 2, 3, 4]}; ⇒ [4, -2, -6, 9] : [int]
adds the two sequences elementwise. In Nesl, any function, whether primitive or user defined, can be applied to each element of a sequence. So, for example, one can define a factorial function
function factorial(i) = if (i == 1) then 1 else i*factorial(i-1);
[2, 1, 9, -3]
is a sequence of four integers. In Nesl, all elements of a sequence must be of the same type, and all sequences must be of finite length. Parallelism on sequences can be achieved in two ways: the ability to apply any function concurrently over each element of a sequence, and a set of built-in parallel functions that operate on sequences. The application of a function over a sequence is achieved using set-like notation similar to set-formers in SETL [] and list-comprehensions in Miranda [] and Haskell []. For example, the expression
N
⇒
factorial : int -> int
and then apply it over the elements of a sequence {factorial(x) : x in [3, 1, 7]}; ⇒
[6, 1, 5040] : [int]
In this example, the function name(arguments) = body; construct is used to define factorial. The function is of type int -> int, indicating a function that maps integers to integers. The type is inferred by the compiler. In addition to the apply-to-each construct, a second way to take advantage of parallelism in Nesl is through a set of sequence functions. The sequence functions operate on whole sequences and all have relatively simple parallel implementations. For example, the function sum sums the elements of a sequence. sum([2, 1, -3, 11, 5]); ⇒
16 : int
Since addition is associative, this can be implemented on a parallel machine in logarithmic time using a tree. Nesl supports many other built-in functions on sequences, including scans, permuting a sequence, appending sequences, extracting subsequences, and inserting into a sequence.
Nested Parallelism In Nesl, the elements of a sequence can be any valid data item, including sequences. This rule permits the nesting of sequences to an arbitrary depth. A nested sequence can be written as [[2, 1], [7, 3, 0], [4]]
This sequence has type: [[int]] (a sequence of sequences of integers). Given nested sequences and the rule that any function can be applied in parallel over the elements of a sequence, Nesl necessarily supplies the ability to apply a parallel function multiple times in parallel; this is referred to as nested parallelism. For example, one can apply the parallel sequence function sum over a nested sequence: {sum(v) : v in [[2,1], [7,3,0], [4]]}; ⇒
[3, 10, 4] : [int]
N
N
NESL
In this expression there is parallelism both within each sum, since the sequence function has a parallel implementation, and across the three instances of sum, since the apply-to-each construct is defined such that all instances can run in parallel. Nesl supplies a handful of functions for moving between levels of nesting. These include flatten, which takes a nested sequence and flattens it by one level, for example, flatten([[2, 1], [7, 3, 0], [4]]); ⇒
[2, 1, 7, 3, 0, 4] : [int]
and collect that takes a flat sequence of pairs and collects elements with an equal first value together into a nested sequence, for example, collect([(3, 7), (0, 1), (3, 9), (1, 11), (0, 6), (3, 5)]) ⇒
[(0, [1, 6]), (1, [11]), (3, [7, 9, 5])] : [(int,[int])]
As an example of nested parallelism, consider a parallel variation of quicksort (see Fig. ). When applied to a sequence s, this version splits the values into three subsets (the elements lesser, equal, and greater than the pivot) and calls itself recursively on the lesser and greater subsets. To execute the two recursive calls, the lesser and greater sequences are concatenated into a nested sequence and qsort is applied over the two elements of the nested sequences in parallel. The final line extracts the two results of the recursive calls and appends them together with the equal elements in the correct order. The recursive invocation of qsort generates a tree of calls that looks something like the tree shown in Fig. . In this diagram, taking advantage of parallelism within each block as well as across the blocks is critical function qsort(a) = if (#a < 2) then a else let pivot = a[#a/2]; lesser = {e in a| e equal = {e in a| e greater = {e in a| e result = {qsort(v): in result[0] ++ equal ++
< pivot}; == pivot}; > pivot}; v in [lesser,greater]} result[1];
NESL. Fig. An implementation of quicksort
to getting a fast parallel algorithm. If one were to only take advantage of the parallelism within each quicksort to subselect the two sets (the parallelism within each block), the algorithm would do well near the root and badly near the leaves (there are n leaves which would be processed serially). Conversely, if one were only to take advantage of the parallelism available by running the invocations of quicksort in parallel (the parallelism between blocks but not within a block), the algorithm would do well at the leaves and badly at the root (it would take O(n) time to process the root). In both cases the parallel “depth” is O(n) rather than the ideal O(lg n) that can be achieved using both forms of parallelism, as described below.
The Cost Semantics There are two costs associated with all computations in Nesl. . Work: the total work done by the computation, that is to say, the amount of time that the computation would take if executed on a serial random access machine. . Depth: this represents the parallel depth of the computation, that is to say, the amount of time the computation would take on a machine with an unbounded number of processors. The depth cost of all the sequence functions supplied by Nesl is constant. The idea of using work and depth for analyzing the asymptotic behavior of parallel algorithms dates back at least to the work on Boolean circuit models, where work (typically referred to as size) is the number of circuit elements, and depth is the longest path in an acyclic circuit []. The power of using work and depth was demonstrated by Brent [] who showed that any circuit of size (work) W and depth D can be simulated on a P processor parallel random access machine [] (PRAM) in at most W/P + D steps. One should note that this bound is near optimal since in general the simulation must require at least max{W/P, D} steps – that is, at least D steps to follow the chain of dependencies in the circuit, and at least W/P steps to simulate each of the W operations. Work and depth costs were also used in the (VRAM) model [], a strictly flat data-parallel model.
NESL
N
qsort ¯ qsort ¯ qsort ¯ qsort
¯ qsort ¯ qsort
¯ qsort
¯ qsort ¯ qsort
¯ qsort
¯ qsort ¯ qsort
¯ qsort
¯ qsort ¯ qsort
¯ qsort
¯ qsort
¯ qsort
¯ qsort
NESL. Fig. The quicksort algorithm. Just using parallelism within each block yields a parallel running time at least as great as the number of blocks (O(n)). Just using parallelism from running the blocks in parallel yields a parallel running time at least as great as the largest block (O(n)). By using both forms of parallelism, the parallel running time can be reduced to the depth of the tree (expected O(lg n) in the Nesl cost model)
Nesl introduced the idea of including the costs into the semantics of the language to give a welldefined abstract cost model. The precise definition is given in terms of a profiling operational semantics []. The costs are combined using simple combining rules. Expressions are combined in the standard way – for both work and depth, the cost of an expression is the sum of the costs of the arguments plus the cost of the call itself. For example, the cost of the computation:
D(exp(a)), then these combining rules can be written as W({e1(a) : a in e2(b)}) = W(e2(b)) + sum({W(e1(a)) : a in e2(b)})
()
D({e1(a) : a in e2(b)}) = D(e2(b)) + max_val({D(e1(a)) : a in e2(b)}) () where sum and max_val just take the sum and maximum of a sequence, respectively. As an example, the cost of the computation:
sum(dist(7,n)) * #a {[0:i] : i in [0:n]}
(dist(a,n) creates a sequence of n as) can be calculated as:
N
([0:n] creates the sequence [0, 1, ..., n-1]) can be calculated as: Work
Depth
n
[0:i]
∑i
maxi
Total
O(n )
O()
[0:n] Parallel Calls
Work
Depth
dist
n
sum
n
# (length)
*
Total
O(n)
O()
Once the work (W) and depth (D) costs have been calculated in this way, the formula T = O(W/P + D lg P)
The apply-to-each construct is combined in the following way. The work is the sum of the work of the instantiations, and the depth is the maximum over the depths of the instantiations. Specifically, let the work required by an expression exp applied to some data a be W(exp(a)), and the depth required be
()
places an upper bound on the asymptotic running time of an algorithm on the CRCW PRAM model (P is the number of processors) []. The lg P term shows up because of the cost of allocating tasks to processors, and the cost of implementing the sum and scan operations. On the scan-PRAM [], where it is assumed that the scan operations are no more expensive than
N
NESL
references to the shared memory (they both require O(lg P) on a machine with bounded degree circuits), the equation is: T = O(W/P + D)
()
As an example of how the work and depth costs can be used, consider the qsort algorithm described in Fig. . To calculate the work one can just use the runtime from the standard sequential analysis (e.g., []). Assuming a random selected pivot, this gives O(n log n) work with high probability. To calculate the depth one needs to analyze the depth of the recursion tree, but it is not hard to show that with high probability the tree has depth O(log n). All the sequence operations that are used (subselection and appending) are defined to have constant depth. The overall depth is therefore O(log n). This gives a running time on a PRAM of: T(n) = O(
n lg n p
+ lg n)
= O( n lgp n + lg n)
Another approach to implementing Nesl is to use various scheduling techniques such as work stealing [] or parallel depth first scheduling [].
PRAM scan-PRAM
To analyze quicksort by trying to map it onto a P processor PRAM or other P processor parallel model is quite complicated. The overall goal of Nesl is therefore not only to simplify expressing an algorithm such as quicksort, but also to be able to reasonably easily analyze the asymptotic performance of the algorithm based directly on the code.
Implementation The released compiler for Nesl translates the language to Vcode [], a portable intermediate language. The compiler uses a technique called flattening nested parallelism [] to translate Nesl into the simpler flat dataparallel model supplied by Vcode. Vcode is a small stack-based language with about functions all of which operate on sequences of atomic values (scalars are implemented as sequences of length ). A Vcode interpreter was implemented for running Vcode on vector machines such as the Cray C and J, on distributed memory machines based on an MPI back end, and on any serial machine with a C compiler []. The interactive Nesl environment runs within Common Lisp and can be used to run Vcode on remote machines. This allows the user to run the environment, including the compiler, on a local workstation while executing interactive calls to Nesl programs on the remote parallel machines.
Bibliographic Notes and Further Reading The original version of Nesl was released in []. It was influenced by SETL [], CM-Lisp [], ID [], and Miranda [] among other languages. An extensive set of documentation and examples is available online (Web search, NESL). A reasonable set of examples of using Nesl for algorithm design and analysis is given in a paper on programming parallel algorithms []. The technique of using flattening nested parallelism used in the released implementation is given in [], and further formalized in []. A description of the Nesl implementation and experimental results are summarized in []. The cost model is described informally in the manual [] and more formally in a paper on the semantics and provable implementation []. Other languages that are similar to Nesl in some ways include Cilk [] (supports nested parallelism and costs based on work and depth – although they refer to them as T and T∞ ) and Data-Parallel Haskell [] (supports nested parallelism in a functional language).
Bibliography . Blelloch GE () NESL: a nested data-parallel language (version .). Technical report CMUCS-–, School of Computer Science, Carnegie Mellon University, July . Blelloch GE, Chatterjee S, Hardwick JC, Sipelstein J, Zagha M () Implementation of a portable nested data-parallel language. J Parallel Distrib Comput ():– . Blelloch GE, Greiner J () A provable time and space efficient implementation of NESL. In: Proceedings of the ACM SIGPLAN international conference on functional programming, New York, pp –, May . Blelloch GE () Programming parallel algorithms. Commun ACM ():– . Milner R, Tofte M, Harper R () The definition of standard ML. MIT Press, Cambridge, Mass . Sipelstein J, Blelloch GE () Collection-oriented languages. Proc IEEE ():– . Rose J, Steele GL Jr () C∗ : an extended C language for data parallel programming. Technical report PL-, Thinking Machines Corporation, April . Müller A, Rühl R () Extending high performance fortran for the support of unstructured computations. In: Proceedings of the th international conference on supercomputing, ICS ’, pp –,
NetCDF I/O Library, Parallel
. Chamberlain BL, Choi SE, Christopher Lewis E, Lin C, Snyder L, Derrick Weathersby W () Zpl: a machine independent programming language for parallel computers. IEEE Trans Softw Eng :– . Jones SP, Leshchinskiy R, Keller G, Chakravarty MMT () Harnessing the multicores: nested data parallelism in Haskell. In: IARCS annual conference on foundations of software technology and theoretical computer science, . Arvind, Nikhil RS, Pingali KK () I-structures: data structures for parallel computing. ACM Trans Program Lang Syst ():– . McGraw J, Skedzielewski S, Allan S, Oldehoeft R, Glauert J, Kirkham C, Noyce B, Thomas R () SISAL: streams and iteration in a single assignment language, language reference manual version .. Lawrence Livermore National Laboratory, March . Bocchino RL Jr, Adve VS, Adve SV, Snir M () Parallel programming must be deterministic by default. In: Proceedings of the first USENIX conference on hot topics in parallelism, HotPar’, Berkeley, pp –, March . Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Cilk YZ () An efficient multithreaded runtime system. J Parallel Distrib Comput ():– . Blelloch GE, Chatterjee S, Knabe F, Sipelstein J, Zagha M () VCODE reference manual (version .). Technical report CMUCS--, School of Computer Science, Carnegie Mellon University, July . Chatterjee S () Compiling data-parallel programs for efficient execution on shared-memory multiprocessors. PhD thesis, School of Computer Science, Carnegie Mellon University, October . Schwartz JT, Dewar RBK, Dubinsky E, Schonberg E () Programming with sets: an introduction to SETL. Springer-Verlag, New York . Turner D () An overview of MIRANDA. SIGPLAN Notices ():– . Hudak P, Wadler P () Report on the functional programming language HASKELL. Technical report, Yale University, April . Pippenger N () On simultaneous resource bounds. In: Proceedings of the th annual symposium on foundations of computer science, pp –, . Brent RP () The parallel evaluation of general arithmetic expressions. J ACM ():– . Fortune S, Wyllie J () Parallelism in random access machines. In: Conference record of the th annual ACM symposium on theory of computing, San Diego, CA, pp –, May . Blelloch GE () Vector models for data-parallel computing. MIT Press, Cambridge, MA . Blelloch GE () Scans as primitive parallel operations. IEEE Trans Comput C–():– . Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction to algorithms, nd edn. MIT Press and McGraw-Hill, Cambridge/New York
N
. Blelloch GE, Sabot GW () Compiling collection-oriented languages onto massively parallel computers. J Parallel Distrib Comput ():– . Blumofe RD, Leiserson CE () Scheduling multithreaded computations by work stealing. J ACM ():– . Blelloch G, Gibbons P, Matias Y () Provably efficient scheduling for languages with fine-grained parallelism. J ACM ():– . Blelloch GE () NESL: a nested data-parallel language. Technical report CMU-CS--, School of Computer Science, Carnegie Mellon University, January . Steele GL Jr, Daniel Hillis W () Connection machine LISP: fine-grained parallel symbolic processing. In: LISP and functional programming, Cambridge, pp –, . Keller G, Simons M () A calculational approach to attening nested data parallelism in functional languages. In: Proceedings of the nd asian computing science conference on concurrency and parallelism, programming, networking, and security, SpringerVerlag, pp –,
Nested Loops Scheduling Parallelism Detection in Nested Loops, Optimal
Nested Spheres of Control Transactions, Nested
NetCDF I/O Library, Parallel Robert Latham Argonne National Laboratory, Argonne, IL, USA
Synonyms High-level I/O library; Pnetcdf
Definition The serial netCDF library has long provided scientists with a portable, self-describing file format and a straightforward programming interface based on multidimensional arrays of typed variables (more detail in the History section). Parallel-NetCDF provides an API for parallel access to traditionally formatted netCDF files. Parallel-NetCDF both produces and consumes
N
N
NetCDF I/O Library, Parallel
files compatible with serial netCDF, while providing a programming interface similar, though not identical, to netCDF. Parallel-netCDF API modifications provide a more appropriate interface when expressing parallel I/O.
Discussion Parallel-NetCDF is one of a family of libraries built to meet the needs of computational scientists, as opposed to lower-level I/O libraries that deal more with the details of underlying storage. Before going into more detail about Parallel-NetCDF, let us see how it fits in the bigger picture of scientific computing and application I/O.
Parallel-NetCDF and the I/O Software Stack Today’s leading computing platforms contain hundreds of thousands of processors. In order to efficiently utilize these computers, applications use simplifying abstractions provided by tools and libraries. MPI libraries provide a standard programming interface for parallel computation on a wide array of hardware. Math libraries such as BLAS hide the details of CPU-specific optimizations. These lower-level math and communication libraries contribute to an overall software stack built to allow developers to focus on the science behind their applications. I/O performance on high-end computing platforms follows a similar story. Performance comes from aggregating many storage devices. In fact, CPU performance has consistently improved at a rate much faster than that of storage devices. This discrepancy makes the need for parallel I/O across multiple devices even more acute. Yet as the number of devices involved in I/O grows, coordinating I/O across those devices becomes more of a challenge. In the same way that communication and math libraries assist the parallel computation side of scientific simulations, an “I/O software stack” assists data management. The large number of storage devices mentioned above forms the foundation of the software stack. Several additional layers provide abstractions to help make the task of extracting maximum performance accessible to applications programmers. The programming API and data model that Parallel-NetCDF presents to the scientific programmer are conceptually closer to the tasks common to scientific applications
and hide many lower-level details. A discussion of the lower levels of the software stack, though hidden by Parallel-NetCDF, will provide a better appreciation of Parallel-NetCDF’s role at the upper level. ●
Storage systems contain thousands of individual devices to increase the potential bandwidth. The parallel file system manages those separate devices and collects them into a single logical unit. The file system does lack some important features: typically file systems have no mechanism for coordinated I/O, and the data model is still fairly low level for computational scientists to use directly. ● The I/O forwarding layer typically exists only on the largest computer systems. Applications do not interact with this layer directly. Rather, this layer simplifies the scalability challenge: a large number of compute nodes communicate with a smaller number of I/O forwarding nodes, and these nodes in turn talk to the file system. ● The middleware, or MPI-IO [], layer introduces more sophisticated algorithms tailored to parallel computing. This layer coordinates I/O among a group of processes, introducing collective I/O to the stack. MPI datatypes allow applications to describe arbitrary I/O access patterns. Applications can use the MPI-IO layer directly, but the programming model is still rather low level and not a perfect fit to the needs of common applications. MPI-IO libraries, however, provide an excellent foundation for higher-level libraries such as Parallel-NetCDF. Parallel-NetCDF sits atop these lower-level abstractions; it belongs to the high-level I/O library layer. Libraries in this layer introduce concepts such as multidimensional arrays of typed data, which are better suited to scientific applications. For example, a climate code can represent temperature in the atmosphere with a three-dimensional array – latitude, longitude, and altitude – containing the number of degrees Celsius. In addition to array storage, applications can also provide descriptive information in the form of attributes on variables, dimensions, or the dataset itself. These libraries also define the layout of data on disk. These annotated, self-describing file formats make exchanging data with colleagues now and in the future much easier.
NetCDF I/O Library, Parallel
High-Level I/O Library Maps application abstractions onto storage abstractions and provides data portability.
High-Level I/O Library
HDF5, Parallel netCDF, ADIOS
I/O Middleware
I/O Middleware Organizes accesses from many processes, especially those using collective I/O.
I/O Forwarding
MPI-IO
I/O Forwarding Bridges between app, tasks and storage system and provides aggregation for uncoordinated I/O.
N
Application
Parallel File System I/O Hardware
IBM ciod
Parallel File System Maintains logical space and provides efficient access to data. PVFS, PanFS, GPFS, Lustre
NetCDF I/O Library, Parallel. Fig. The I/O software stack. Parallel-NetCDF sits at the top, allowing applications to express data manipulations more abstractly while lower levels deal with coordination among multiple processes and maximize performance of storage devices
History The netCDF library has long provided a straightforward programming API and file format for serial applications. In the summer of , Argonne National Laboratory and Northwestern University started a joint effort to build a parallel I/O library while keeping the best parts of netCDF. Parallel programs had been using netCDF for some time, though the serial library forced programmers to use serial approaches when writing out netCDF datasets. In one approach, each process writes out its own data set. This “N-to-N” model can put significant strains on both the file system and any postprocessing tools, especially since large systems today can have hundreds of thousands of processes. Another approach sends all data to a master I/O process, and that process in turn writes out the dataset. This “send-to-master” technique imposes an obvious bottleneck because all data must be funneled through a single process. Without changing the existing netCDF file format, Parallel-NetCDF introduced a new programming API. This parallel-oriented API is not identical to serial netCDF, but is comfortable to anyone familiar with the existing serial netCDF API. The Parallel-NetCDF API allows computational scientists to write out datasets in parallel: all processes can participate in an “N-to-” operation, producing a single dataset while leveraging a host of parallel I/O optimizations. The Parallel-NetCDF API and File Format section contains examples and more discussion of the Parallel-NetCDF API.
Parallel-NetCDF has become an important tool for high-performance I/O in the climate and weather domains. These domains have long-established netCDFbased workflows, using netCDF datasets for archiving, analysis, and data exchange. Since Parallel-NetCDF retains the same file format as netCDF, researchers in climate and weather can make changes to their simulation codes while leaving other components of their workflow the same. Parallel-NetCDF is useful in other domains as well: the programming API and file format make it possible to deliver good parallel I/O performance without having to master all of MPI-IO.
The Parallel-NetCDF API and File Format Discussions about Parallel-NetCDF need to cover two main points: the programming model and the on-disk data format. The API simplifies many MPI-IO concepts while retaining several key optimizations. The on-disk format affects I/O performance, but it also matters when exchanging data with collaborations and colleagues. If applications were to use MPI-IO directly, the “basic type” they would build on would be a linear stream of bytes. In contrast, Parallel-NetCDF offers a more application-friendly model based on multidimensional arrays of typed data. Operations on these arrays could be as simple as reading the entire array, or more complicated such as having each process write a subcube representing a chunk of the Earth’s atmosphere. In addition to operations for manipulating data, the API contains routines to annotate the data in the file or the
N
N
NetCDF I/O Library, Parallel
file itself. Such annotations may include timestamps, machine information, experiment or workflow information, or other provenance information to better understand how an application produced the data in this file. The Standard Interface section goes into more detail about the programming interface as well as how Parallel-NetCDF interacts with the underlying MPI-IO library. In addition to the programming interface, the file format plays a major role in how useful ParallelNetCDF can be to application groups. Parallel-NetCDF applications create a self-describing, portable file with support for embedded metadata. API routines allow programs to query the contents of data files without any prior knowledge of the contents. The structure of the datasets extends to the representation of bytes on disk: even if a dataset is moved to a different machine, the library will still be able to understand that file. The IBM Blue Gene/P system, for example, has a -bit, big-endian architecture, while the Cray XT has a -bit little-endian architecture. Despite these differences, files created on one can be read on the other. Figure depicts a simple Parallel-NetCDF application. A climate code wants to create a checkpoint file to store some key pieces of information. Atmospheric temperature is stored as a three-dimensional array of double-precision floating point values. Barometric pressure at the Earth’s surface requires only two Application data structures
dimensions and only single-precision floating points. Further, the application wants to ensure that future consumers of this dataset know which units these variables use. It therefore stores this information as an attribute on each variable. On disk, the dataset contains a small header describing the size and type of both variables and any attributes, followed by the actual data for the variables. This fairly simple file format comes with some restrictions, but the trade-off results in efficient file accesses for parallel I/O. Parallel-NetCDF uses the same file format as netCDF, making it easier for application groups to adopt Parallel-NetCDF. A programmer can change the computationally intensive simulation code to use ParallelNetCDF while the other components of the workflow (e.g., visualization, analysis, archiving) can continue to use serial netCDF. Annotated Examples
Parallel-NetCDF shares many netCDF concepts but uses a slightly different API. The following simple example programs should make the differences clear.
Standard Interface Figure contains a full, correct Parallel-NetCDF program to create a simple dataset and write data into that dataset in parallel. This program, though brief, demonstrates many key features of the library.
netCDF File “checkpoint07.nc” Variable “temp” { type = NC_DOUBLE, dims = {1024, 1024, 26}, start offset = 65536, attributes = {“Units” = “K”}}
Double temp 1,024
1,024 Float surface_pressure
Offset in file
26
Variable “surface_pressure” { type = NC_FLOAT, dims = {512, 512}, start offset = 218103808, attributes = {“Units” = “Pa”}} < Data for “temp” >
512 < Data for “surface_pressure” >
NetCDF header describes the contents of the file: typed, multidimensional variables and attributes on variables or the dataset itself.
Data for variables is stored in contiguous blocks, encoded in a portable binary format according to the variable’s type.
512
NetCDF I/O Library, Parallel. Fig. Left: how an application views netCDF objects. Right: how netCDF and Parallel-NetCDF store those objects on disk
NetCDF I/O Library, Parallel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
N
# include < mpi .h > # include < pnetcdf .h > int main ( int argc , char ** argv ) { int ncfile , nprocs , rank , dimid , varid1 , varid2 , ndims =1; MPI_Offset start , count =1; char buf[13] = "Hello World/n"; MPI_Init (& argc , & argv ); ncmpi_create ( MPI_COMM_WORLD , " demo . nc " , NC_WRITE | NC_64BIT_OFFSET , MPI_INFO_NULL , & ncfile ); MPI_Comm_rank ( MPI_COMM_WORLD , & rank ); MPI_Comm_size ( MPI_COMM_WORLD , & nprocs ); ncmpi_def_dim ( ncfile , " d1 " , nprocs , & dimid ); ncmpi_def_var ( ncfile , " v1 " , NC_INT , ndims , & dimid , & varid1 ); ncmpi_def_var ( ncfile , " v2 " , NC_INT , ndims , & dimid , & varid2 ); n c m p i _ p u t _ a t t _ t e x t ( ncfile , NC_GLOBAL , " string " , 13 , buf ); ncmpi_enddef ( ncfile ); start = rank ; n c m p i _ p u t _ v a r a _ i n t _ a l l ( ncfile , varid1 , & start , & count , & rank ); n c m p i _ p u t _ v a r a _ i n t _ a l l ( ncfile , varid2 , & start , & count , & rank ); ncmpi_close ( ncfile ); MPI_Finalize (); return 0; }
NetCDF I/O Library, Parallel. Fig. Basic Parallel-NetCDF program using the standard interface. Each process collectively writes its MPI rank into two variables. For brevity, error checking/handling has been omitted
Because Parallel-NetCDF relies on MPI-IO, the program must include the mpi.h header file (line ) and initialize (line ) and finalize (line ) the MPI library. The ncmpi_create routine (line ) adds two arguments to netCDF’s creation function: an MPI communicator and an MPI Info object. These two arguments do expose a bit of the underlying MPI library, but they give the programmer some flexibility and the ability to tune operations if needed. In this example, the communicator is just MPI_COMM_WORLD, or all processes in this MPI program. In some situations, though, the application may wish to have only a subset of processes participate in a dataset operation. The Info object in this example is empty, but the Tuning ParallelNetCDF section will cover some situations where the object is used. Parallel-NetCDF, like netCDF, has a bimodal interface: when creating a dataset, an application starts in define mode, where it must first describe what variables it will write, the size and type of those variables, and any attributes. This example defines a single dimension (line ) and assigns that dimension to two variables (lines and ).
Datasets can contain a great deal of metadata. Both variables and dimensions have human-readable labels. In addition to those labels, attributes may be defined and placed on dimensions, variables, or the dataset itself. Line places an attribute on the entire dataset. In the example, line calls ncmpi_enddef to switch from define mode to data mode. By asking the programmer to predefine variables and attributes, the library can allocate space with little overhead. Reentering define mode can potentially trigger an expensive rewriting of the dataset. For a large class of problems, however, one knows ahead of time which information will be written out. Lines - do the actual work of storing information into the dataset. Parallel-NetCDF provides one set of functions for uncoordinated independent I/O and another parallel set for coordinated collective I/O. This example uses the collective version (as shown by the _all suffix). Every MPI process participates in these writes, each writing out its MPI rank. In a traditional serial netCDF program, each rank would send data to a master process, and this process would in turn write out the information. This “send-to-master” model
N
N
NetCDF I/O Library, Parallel
quickly becomes untenable, however, as the number of MPI processes increases and as the amount of memory available to each MPI process gets smaller. The collective Parallel-NetCDF routines in turn call collective MPI-IO routines. MPI-IO collective routines can use powerful optimizations such as two-phase I/O [] or data shipping []. It is possible to use independent I/O with Parallel-NetCDF, but the opportunities for optimization are greatly limited in that case. Once executed, this example will produce a file “demo.nc.” This dataset will contain two variables, as can be verified with either netCDF or Parallel-NetCDF utilities. The Parallel-NetCDF standard interface shares much in common with the serial netCDF interface. The function name explicitly states whether the
Y
X
NetCDF I/O Library, Parallel. Fig. Writing a subarray of a two-dimensional array 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
operation is a read or write (_put or _get), the type of access (e.g., _var, _vara), and the type (e.g., _int, _float, _double). For example, from the name alone a programmer knows that the function ncmpi_put_vara_int writes integers into a subarray.
A more realistic example
This simple example, operating on a one-dimensional array, may not give the most insight into all the work going on behind the scenes. Figure depicts a twodimensional array of which this example program wants to only write a smaller subarray. Perhaps this array represents one frame of a movie and a parallel program renders portions of each frame. Figure contains the Parallel-NetCDF code needed to write the indicated region. First, the program must describe the entire variable: lines , , and describe the overall shape of the variable and the type of data it will contain. Lines and set up the shape of the desired selection or subregion of the variable and pass these arguments as parameters to the call at line . These few lines of Parallel-NetCDF code trigger a lot of activity behind the scenes. Even though the desired region is logically contiguous, when the array is stored on disk, this selection will end up scattered across the file in a noncontiguous access pattern. Fortunately for Parallel-NetCDF, MPI-IO makes it easy to describe these accesses with a file view. Storage systems yield the best performance with a few large, contiguous accesses. The MPI-IO library will
... # define NDIMS 2 int dims [ NDIMS ] , varid1 , ndims = NDIMS ; MPI_Offset start [ NDIMS ] , count [ NDIMS ]; ncmpi_def_dim ( ncfile , " y " , 4 , &( dims [0])); ncmpi_def_dim ( ncfile , " x " , 6 , &( dims [1])); ncmpi_def_var ( ncfile , " frame " , NC_DOUBLE , ndims , dims , & varid1 ); ncmpi_enddef ( ncfile ); start [0] = 0; start [1] = 4; count [0] = 2; count [1] = 2; n c m p i _ p u t _ v a r a _ d o u b l e _ a l l ( ncfile , varid1 , start , count , data ); ...
NetCDF I/O Library, Parallel. Fig. Writing the selected portion of the above array
NetCDF I/O Library, Parallel
apply several optimizations to transform a noncontiguous access into one more likely to give good bandwidth. The Parallel-NetCDF call in this example is collective, which means that the lower MPI-IO layer can in turn use collective I/O optimizations: the MPI-IO library can communicate with the other processes participating in this I/O call and rearrange the requests. To the file system, this subarray access will end up looking like a more performance-friendly contiguous request. While this discussion glosses over many of the details, the important point is that a few ParallelNetCDF functions convey a great deal of information to the lower levels of the I/O stack. Applications using Parallel-NetCDF thus get the benefits of a self-describing portable file format and a relatively straightforward API while still achieving performance without requiring the application programmers to master MPI-IO.
The Flexible Interface Parallel-NetCDF provides an additional set of routines to allow for even more freedom in describing an application’s data model. In Fig. , a write operation from Fig. is converted to the Flexible Data Mode interface. This interface allows an application to use MPI datatypes to describe how information is laid out in memory. The datatype in this example is simple (MPI_INT), but a more complicated MPI datatype might, for example, write out all the data in a multidimensional array while skipping over the “ghost cells.” Ghost cells, copies of data belonging to other MPI processes, are used to optimize communication in nearest-neighbor simulations and do not actually need to be written to disk. In many cases, the Flexible Data Mode interface permits 1 2 3 4 5 6 7 8 9 10 11 12 13
N
an application to avoid an additional buffer copy before making a Parallel-NetCDF call.
Nonblocking Interface Parallel-NetCDF further extends the traditional netCDF API through the introduction of nonblocking I/O routines. Programmers familiar with MPI nonblocking communication routines will note strong similarities between MPI nonblocking routines and those in Parallel-NetCDF. A program posts one or more operations and then waits for completion of those operations. In Fig. , the example from Fig. has been modified yet again, this time to use Parallel-NetCDF nonblocking operations. Throughout the I/O software stack, more information about I/O activity results in more opportunities for optimization. In this example, even though the calls operate on separate variables, those variables are still in the same dataset and appear to the MPI-IO layer as parts of a single file. Parallel-NetCDF can take these nonblocking requests and optimize them into a single, larger I/O operation []. Parallel-NetCDF can optimize these nonblocking requests in this way because the library makes no guarantees about when work will occur (sometimes called “make progress” in the MPI context). It may appear to the program that I/O work happens in the background when calling these nonblocking routines. Actually, when the operation is posted (lines and ), no work happens. Instead, the library defers all work until the application makes an additional “wait for completion” function cal (line ). Once Parallel-NetCDF has a list of all the outstanding operations, it can then construct a single I/O request encompassing all operations. Typically, storage systems perform better with larger
/* f u n c t i o n p r o t o t y p e */ int n c m p i _ p u t _ v a r a _ a l l ( int ncid , int varid , /* start and count describe access in file */ const MPI_Offset start [] , const MPI_Offset count [] , /* ’ buf ’ ’ bufcount ’ and ’ datatype ’ describe data in memory */ const void * buf , MPI_Offset bufcount , MPI_Datatype datatype ); ... start = rank ; nc m pi _ pu t _v a ra _ al l ( ncfile , varid1 , & start , & count , & rank , count , MPI_INT ); ...
NetCDF I/O Library, Parallel. Fig. Prototype for and usage of one of the Flexible Data Mode routines
N
N
NetCDF I/O Library, Parallel
1 2 3 4 5 6 7 8 9
int requests [2] , statuses [2] , data [2]; ... data [0] = rank + 1000; data [1] = rank + 10000; n c m p i _ i p u t _ v a r a _ i n t ( ncfile , varid1 , & start , & count , &( data [0]) , &( requests [0])); n c m p i _ i p u t _ v a r a _ i n t ( ncfile , varid2 , & start , & count , &( data [1]) , &( requests [1])); ncmpi_wait_all ( ncfile , 2 , requests , statuses );
NetCDF I/O Library, Parallel. Fig. The “implicit” nonblocking interface: no progress occurs in the background, but operations can be batched together when completed
I/O requests, so coalescing operations in this way can yield good performance improvements.
Tuning Parallel-NetCDF Extracting good performance out of Parallel-NetCDF comes down largely on following a handful of best practices: perform collective I/O to a small number of files. Doing so provides the most information to the lower I/O software stack layers, and allows for those layers to make as many optimizations as possible. For even more fine-tuning, the library does provide some additional tuning mechanisms for applications. For the most part, these tuning knobs help tweak parameters for layers farther down the software stack. The MPI Info parameter passed to the create and open calls can direct the optimizations the underlying MPI-IO library chooses to make. For example, on Lustre file systems file locks are exceedingly expensive. By setting the MPI Info hint romio_ds_write to “disable”, the ROMIO implementation of the MPI-IO library, the most common implementation on systems with Lustre installed, will avoid using locks. Knowing more about the characteristics of the storage system and the application workload will make the selection of appropriate hints easier and more productive. The available MPI-IO tuning parameters vary based on the MPI-IO implementation and file system used. Users should consult their MPI-IO library’s documentation. Often, the most effective way for a computational scientist to improve the I/O performance of a program is to enlist the aid of the Parallel-NetCDF community (see the Bibliographic Notes and Further Reading section). If the scientist can remove the simulation and science aspects of the program, leaving only the representative data structures, the resulting “I/O kernel” becomes a
valuable resource for exploring all tuning options available. I/O kernels leave no doubt as to what the scientist requires from the Parallel-NetCDF library and storage system. Because these small programs have few if any dependencies on additional libraries, they can become part of the library’s correctness and performance tests, ensuring that modifications and improvements made to Parallel-NetCDF also benefit applications. A discussion about tuning Parallel-NetCDF would not be complete without a few words concerning record variables. In Parallel-NetCDF and netCDF, variables can have an unlimited dimension. Often, variables that change over time use this feature: each iteration of the simulation can append data along the unlimited “time” dimension. When more than one variable contains an unlimited dimension, those variables are stored on disk in an interleaved fashion. The writing and reading of data interleaved in this way will often yield poor performance. Programmers must weigh the flexibility of record variable storage against the cost of performance. With an I/O kernel, Parallel-NetCDF experts can likely find a set of tuning parameters to mitigate some of the performance loss, and the project will be looking at more sophisticated ways of dealing with record variables in the future.
Conclusion The Parallel-NetCDF library provides a more natural programming interface for high-performance storage systems. By implementing parallel I/O concepts in terms of multidimensional arrays, many computational science simulations are able to naturally express their data structures. Parallel-NetCDF provides a further benefit in that it encapsulates a great deal of I/O expertise. Computer scientists working on high-performance storage may not know a great deal about weather, climate, combustion,
NetCDF I/O Library, Parallel
or other scientific domains but know much about how to get the best performance out of storage devices, the file system, and MPI-IO. Application scientists in turn specialize in modeling phenomena, developing numerical methods, and exploring other topics more relevant to computational science. Parallel-NetCDF and other high-level I/O libraries provide a middle ground where scientists and storage experts can come together to maximize productivity.
Related Work The HDF library [] was the first high-level I/O library to be built on top of MPI-IO. HDF provides a large and flexible API. For example, HDF applications do not need to enter a define mode to describe the variables and dimensions. HDF writes all metadata for the file on an as-needed basis. The trade-off for this flexibility is some additional overhead if the application is creating many new variables or other metadata-intensive workloads. HDF has the further benefit of allowing variables to grow without bound in any dimension. The HDF library allocates space for variables on demand. Consider the storage of a sparse matrix: only those values actually present in the matrix would be stored on disk. When multiple processes are updating variables in this way, however, this on-demand space allocation can result in some performance and consistency challenges. HDF currently forces all parallel metadata updates through rank , though the group is working on more sophisticated techniques. More recently, the serial netCDF developers released netCDF- [], a library with the netCDF programming API on top of the HDF file format. This project brought several HDF features to netCDF, including parallel I/O support, support for variables with multiple unlimited dimensions, and support for much larger variables. While introducing a new file format, it still can operate on older netCDF datasets. The netCDF- API does not have some of the more sophisticated Parallel-NetCDF features like the Flexible Mode interface or the Nonblocking interface.
Related Entries Benchmarks Distributed-Memory Multiprocessor HDF MPI (Message Passing Interface)
N
I/O MPI-IO
Bibliographic Notes and Further Reading For more technical insight into Parallel-NetCDF, Jianwei Li et al.’s SC paper [] presents the design choices made in developing the library; experimental results are also presented. The multivariable I/O optimizations are covered further in []. The Parallel-NetCDF website is www.mcs.anl.gov/ parallel-netcdf. The site hosts pointers to additional Parallel-NetCDF papers and projects, as well as links to production releases and the latest code. The Parallel-NetCDF community of users and developers communicates on the parallel-netcdf@mcs. anl.gov mailing list. The quality of discourse on the list has been high from the beginning. Any topic related to Parallel-NetCDF is open for discussion. Parallel-NetCDF developers regularly give tutorials and workshops covering the library.
Acknowledgment This work was supported in part by the U.S. Dept. of Energy under Contract DE-AC-CH.
Bibliography . Gao K, Liao WK, Choudhary A, Ross R, Latham R () Combining i/o operations for multiple array variables in parallel netcdf. In: Proceedings of the workshop on interfaces and architectures for scientifc data storage, held in conjunction with the IEEE Cluster Conference, New Orleans, Louisiana, September . Li J, Liao WK, Choudhary A, Ross R, Thakur R, Gropp W, Latham R, Siegel A, Gallagher B, Zingale M () Parallel netCDF: a high-performance scientific I/O interface. In: Proceedings of SC: high performance networking and computing, Phoenix, AZ, November . IEEE Computer Society Press, Los Alamitos . Prost JP, Treumann R, Hedges R, Jia B, Koniges A () MPIIO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In: Proceedings of SC, November . Thakur R, Gropp W, Lusk E () Data sieving and collective I/O in ROMIO. In: Proceedings of the Seventh Symposium on the Frontiers of Massively Parallel Computation, IEEE Computer Society Press, pp –, February . The HDF Group () HDF. http://www.hdfgroup.org . The MPI Forum. MPI-: extensions to the message-passing interface, July . http://www.mpi-forum.org/docs/docs.html . Unidata. netCDF. http://www.unidata.ucar.edu/software/netcdf/ index.html
N
N
Network Adapter
Network Adapter Network Interfaces
Network Architecture Collective Communication, Network Support For Cray XT and Seastar -D Torus Interconnect Ethernet InfiniBand Myrinet Networks, Direct Quadrics
Network Interfaces Holger Fröning University of Heidelberg, Heidelberg, Germany
Synonyms NI (Network Interface); NIC (Network Interface Controller or Network Interface Card); Network adapter
Definition A network interface – or network interface controller, network interface card, network adapter – is a hardware component that is designed to allow computers to access an interconnection network for communication and synchronization purposes. Therefore, a network interface typically provides two different kinds of interfaces, one toward the computer (host) side and one toward the network side. The network interface translates the protocol of the host interface to the network protocol and vice versa, and translates between the different physical media. From the network’s point of view, network interfaces are the end points of the network, as here the network packets are injected into, respectively retrieved from the network. As all end points of a network must be uniquely addressable, each network interface is assigned a unique number. The major task of a network interface is to insert packets into the network on the sending node, and to receive them from the network on the target node. The accomplishment
of this task, however, is highly dependant on the network interface architecture. In particular, the network interface architecture defines which subtasks are to be performed by software layers and which are performed by hardware modules.
Discussion Introduction The main task of a network interface is to provide a computer access to an interconnection network in a performant, transparent, and safe manner []. A welldesigned network interface is performant by providing a high bandwidth and low latency. Similarly, it is transparent because it only adds minimal overhead to the performance properties of the network, which are again bandwidth and latency. A safe network interface should prevent erroneous processes from disabling the network or reducing the network’s performance. Typically, a well-designed network interface is unobtrusive. Opposed to this, a badly designed network interface will turn into a bottleneck, preventing a client from leveraging the network’s full performance. Considering the Open System Interconnection Reference Model (OSI Reference Model) [], which divides the task of communication into seven layers, the main task of a basic network interface is to handle the lowest two layers. These are the Physical Layer (PHY) and Data Link Layer (DDL). In the Data Link Layer in particular the Media Access Control (MAC) is handled by the network interface. However, more sophisticated network interfaces can also inherit task from upper layers. Then CPU off-loading takes place, and tasks are handled by the network interface instead of a software instance running on a CPU. Implementing only the PHY and MAC layer on a network interface is typically true for indirect networks []. Here, central switches are used to connect an arbitrary number of network nodes, and routing only takes place in these switches. In a direct network the switching resources are distributed over the network nodes; thus, each network interface additionally includes the switching and routing modules. The network degree defines the number of links per node, and packets are routed among these links (forwarding of packets) and the host interface (injection and retrieval of packets). Therefore, a network interface for a direct network implements beside the PHY and MAC layer
Network Interfaces
also the Network Layer, which is responsible for switching and routing. Considering for instance a case in which the network (and the network interface) also provides Quality of Service (QoS), the network interface has also to implement parts of the Transport Layer.
Basic Architecture Network interfaces can be classified on how messages (or more generally: packets) are injected and retrieved from the network – which is the basic and unique task of all network interfaces. The most basic approach uses one register located on the network device, the so-called register-based interface. A process running on a Central Processing Unit (CPU) generates the message by consecutively writing into this registers: as messages are usually separated into header, payload, and tail part, these three parts are written word after word into the device register. The network interface collects the data written and assembles a message out of this information. The message is forwarded over the network to the destination, where a process can receive the message by reading it word after word from a device register. This approach requires a fixed message format in terms of header, payload, and tail size, which restricts the use of it. A solution to overcome this is to use a set of registers instead, where the different registers serve for different message parts. In the register-based interface, the CPU (i.e., a process running on the CPU) is writing data into a register, respectively, reading data from a register, for each word of the message. This Programmed I/O (PIO) is a viable solution for small size messages []. For larger messages the CPU is occupied for a long time by copying data to and from the network. A solution to off-load this task from the CPU to the network interface is Direct Memory Access (DMA). Here, the CPU describes the message to be sent – including header and tail information and location of the data payload in the main memory – in a descriptor. Only this descriptor is transferred by the CPU to the network interface, typically using PIO. Using the information contained in this descriptor, the network interface is able to independently send or receive a message. In the case of sending it accesses the main memory to fetch the data payload and assembles a message. In the case of receiving it writes the data payload to the main memory location
N
specified in the receive descriptor. As such a descriptorbased network interface is working asynchronously, the CPU has to be explicitly informed if a task has been completed. The descriptor-based approach implicates a higher overhead compared to the register-based approach: the descriptor must be constructed, it must be handed over to the network interface, and – as an indirection stage – the network interface must access the main memory to fetch or store the payload. Because for small payloads this overhead is a large fraction of the communication time, the descriptor-based approach is typically used for large payloads. In this case, the overhead is small compared to the complete communication time. Additionally, communication and computation can overlap because the CPU is not actively handling messages.
Notifications Notifications are an important part of the communication architecture, as they trigger certain operations in the CPU or network interface. Basically, for every action the CPU has to inform the network interface and vice versa. This is typically done using explicit notifications. An exception are device registers, because accessing these on-device structures can result in side effects which equal an explicit notification. Basic notifications are possible using device registers, where the CPU can read out the current state of the network interface, for example, if a message in a register set has been consumed or if a new message has been received and is available in a register set. Similarly, the network interface uses side effects to get notified of new requests in the register sets. However, side effects cannot be applied to the CPU because here no side effects exist. The usual solution to notify a CPU is either an interrupt that is asserted by the network interface, or to ensure that a CPU is periodically checking certain memory or register locations for changes (polling). While the polling method offers the lowest latency possible it consumes a lot of CPU time. On the other hand, interrupts do not consume CPU time when no event is present, but the latency of the notification is much higher due to the interrupt handling which has to be done by the Operating System (OS).
N
N
Network Interfaces
Queues While the basic architecture presented above is sufficient in principle, it is desirable to allow more outstanding send or receive requests for an improved decoupling between CPU and network interface. The register (set) – either being used for the message itself or for a descriptor – allows only one outstanding request: the CPU writes the information into the register set, and then waits until this data has been consumed by the network interface. Only then the register (set) can be used for the next request. The solution for this problem are queues, which are used as interface between CPU and network interface. They provide a decoupling between CPU and network interface. By storing either a message or a descriptor as an entry in a queue, as many requests as queue entries exist can be outstanding. For sending a packet, the CPU inserts a new entry in the appropriate queue and the network interface consumes it. For receiving, the queue is used vice versa. Queues can be located in an arbitrary space – for instance main memory, on-chip memory of the network interface, or DRAM on the network interface card. For the send queue, it has to be ensured that the network interface is informed about new entries in the queues. This is in particular true for queues located in main memory, while for on-device queues side effects can eventually be leveraged. For the receive queue the CPU has to be notified about new entries. The only exception is a polling approach with guaranteed change of data in a queue entry. Queues are applicable for all interactions between CPU and network interface, including notifications. Several approaches do exist where these queues are located. So-called memory-less network interfaces use queues in main memory, thus they do not require oncard memory. The reason is that main memory is one of the cheapest memories available, whereas on-card or on-chip memory is more expensive. In addition, main memory is also more scalable. Advanced Architecture The network interface is typically a single instance located in between highly parallelized environments. On the host side the degree of parallelism is increasing due to the multi-core trend []. More and more CPU cores are integrated into CPU sockets and – in the worst case – all of them want to communicate using
the network interface. The resurgence of interest in Virtual Machine (VM) environments [, ] is increasing the number of running processes potentially accessing the network. On the network side, a large number of remote network nodes are potentially accessing local resources over the local network interface. Only a network interface that adds only minimal overhead allows a performant network connection and does not turn into a bottleneck. Many network interfaces rely on the OS to check requests and multiplex the network interface between competing processes. Only if requests are not erroneous they are forwarded to the network interface where they can be processed safely. Such network interfaces are typically of very low complexity. However, the involvement of the OS requires system calls (or inter process communication, IPC), which results in an increase of overhead, in particular CPU time and latency. An example work flow for a multiplexed access is shown in Fig. . For each process, work packets and notifications are held in exclusive data structures located in main memory. A privileged process supervises the network access and multiplexes the network device among the competing user-level processes. For this purpose, each user process notifies the privileged process and the network device updates its status to the privileged process. If the network device is unoccupied, the privileged process can schedule a new work request from one of the user processes or a receive request to the network device. For a Virtual Machine environment, the privileged process is typically part of the Hypervisor, as a guest OS does not have access to hardware. User-level communication [] avoids the involvement of the OS (OS bypass) and allows user level processes to directly communicate with the device. In this case, however, it is up to the network interface to check requests for errors. Only then a safe operation is guaranteed and erroneous packets cannot disable the network or reduce its performance. Because the OS is bypassed, no multiplexing between processes can take place. This multiplexing must now be performed by the device, either in a time sliced manner, using module replication [] or device virtualization. Device virtualization [] is an approach that virtualizes the network interface in way, that an arbitrary number of processes can use the network interface directly and without involvement of the OS. This
N
Network Interfaces
User level process P0
(6)
notification queue
IO interface
(9) (7)
notification payload
(5)
work request queue
work request queue
(4) PP
(11)
(1)
notification queue
(2)
notification payload
Context memory
P1
(10)
work payload
(3)
User level process
work payload
Privileged process
(8)
WP 1 of P0 FU
WP 2 of P0
(9)
WP 3 of P1 Transmit Network Receive
Network Interface (device)
Network Interfaces. Fig. Work flow for multiplexed access. Several processes are accessing the device by storing an optional payload () and a work request () in the appropriate exclusive data structures. Once a process has prepared such a work packet (WP), it notifies the supervising process or software instance using IPC or system calls (). In the case of an unoccupied network device, the privileged process selects one user process and fetches the corresponding context information (). It reads the work request entry from the queue () and checks it for correctness. It then forwards it to the network device’s functional unit (FU) (). The work request is enriched with information from the context, so the FU receives all information needed for processing. Also, if referenced it fetches the payload from the main memory data structure (). It performs the processing (i.e., generation of a network packet) and then optionally writes back a notification (), either with or without an associated payload (). It also notifies the privileged process that it is free to receive a new work request (). The user process consumes the notification (), which optionally includes a reference to the payload ()
typically relies on a context-switching approach, where a number of contexts are held on the device and upon an incoming request the corresponding context is used to configure the device for this process. The work flow for a virtualized network interface is shown in Fig. : two processes are accessing a device for work processing, while an incoming packet from a remote process also requires processing. The interface between processes and device is here implemented by exclusive queues in main memory, only the issue queue on the device is shared. The combination of exclusive queue sets with only the issue queue being shared allows
bypassing any kind of privileged software instance. On the network interface, the scheduling/dispatching unit selects work packets either from local processes (i.e., from the issue queue) or from remote processes by incoming packets. It configures an unoccupied functional unit with the appropriate context, which can then process the work request. Generally, a network interface must behave according to the properties of the whole network. For instance, if a network ensures reliability – that is, it guarantees that packets once injected are delivered without errors within finite time – the network interface must
N
Network Interfaces
User level process P0
notification queue
work payload
notification queue
work payload
work request queue
(11)
(1)
work request queue
P1
(10)
notification payload
(2)
User level process
notification payload
N
(3)
WP 1 of P0 (9)
IO interface
WP 2 of P0
(8) (7) (6)
WP 3 of P0
Network Interface (device)
WP 1 of P1
FU0 (4) issue queue
Scheduler/ Dispatcher
WP 1 of P1
FU1 (5)
WP 1 of Premote FU2 FU3
Context memory Transmit
Network Receive
Network Interfaces. Fig. Work flow for a virtualized device []. Several processes are accessing the device by storing optional payloads () and work requests () in appropriate exclusive data structures located in main memory. Once a process has prepared such a work packet (WP), it notifies the device by enqueueing an entry in the issue queue (), which is consumed by the scheduler/dispatcher unit (). This unit then configures a free functional unit (FU) for this process (). The configured FU fetches the work request and optionally the payload. It performs the processing (i.e., generation of a network packet) and then optionally writes back a notification (), either with or without an associated payload (). The user process consumes the notification (), which optionally includes a reference to the payload ()
also ensure this. Similar applies for in-order delivery of packets: if packets are not allowed to bypass within the network, the same applies for the network interface. Other examples are Quality of Service or congestion management techniques. The latter is of particular interest, as many congestion management techniques use throttling of the injection rate to avoid or reduce congestions [].
Considering the requirements above, the implementation of a high-performance network interface requires advanced architectural concepts. In particular, with regard to the highly parallelized environment a parallelization of the network interface’s functional units seems appropriate. Concepts from modern processor architectures can be leveraged [], for instance superscalar functional units, Simultaneous
Network Interfaces
Multi-Threading (SMT) [] or Chip Multi-Processing (CMP) [].
Programming Paradigm Architecture and functional behavior of a network interface are in particular dependent on the used programming paradigm. Conceptually, there are several paradigms how communication and synchronization takes place in such a parallel computing system. The first paradigm is based on the exchange of messages, and this exchange is based on SEND and RECEIVE operations. For the purpose of communication or synchronization, or both, one process embeds the data as payload into a message and sends out this message to a remote process, which receives the message and unpacks the payload. The receiver decides where the payload is stored. Note that it is not possible to read remote locations directly, instead a message has to be sent in order to trigger a SEND operation with the data from the desired location. As both on the sending and receiving node processes are actively involved, this is also known as two-sided communication. The second paradigm is also based on the exchange of messages, but here PUT/GET operations are used. Here, the sender specifies where the data has to be stored, respectively read from. Thus, no receiving process is actively involved and one-sided communication takes place. Usually, notifications are used to inform target processes of completed operations. The last paradigm does not rely on the exchange of messages; instead, a global address space is set up. Each network node contributes to this global address space by allowing direct remote access to parts of its local memory. For communication and synchronization purposes, LOAD and STORE instructions on remote memory locations are used. Compared to the exchange of messages this approach is considered to be more intuitive for programmers, as it is similar to the multithreaded programming paradigm used for multi-core architectures. Depending on the used programming paradigm – either message passing or shared memory – a network interface must provide the appropriate functionality. A message passing network interface allows a process to send and receive messages, which includes for instance routing calculation (for source-path routed networks) and payload assembly (for descriptor-based interfaces).
N
Note that each operation – either send, receive, put, or get – is process dependent, because an appropriate set of queues or registers must be selected. Thus, support for multiple processes has to be implemented separately, for instance by hardware replication or by device virtualization. Opposed to this, a shared memory network interface (or “shared memory mapper”) has to perform an address translation from source local address to global address, and from global address to target local address. The destination is typically determined using parts of the global address. Then the operation is forwarded to the target node, where it is either completed or an appropriate response is sent back to the source node. As neither a LOAD nor a STORE instruction is process dependent, virtualization is provided inherently.
Bibliographic Notes and Further Reading An introduction to network interfaces covering the basics of different network interface types is provided by Dally and Towles in []. Practical examples for network interfaces can be separated into message passing and shared memory network interfaces: a well-documented example for a message-passing IO-attached network interface is Myrinet [], which uses a network processor for protocol handling. Other examples for message passing network interfaces include Quadrics [] and Infiniband []. Various other do exist, but due to commercial reasons they are mostly fairly documented. Further approaches to virtualize a network interface are described in [, , ]. Compared to the vast amount of message-passingbased networks, the number of shared memory networks is rather small. One of the first shared memory research projects is the Stanford DASH/FLASH [, ]. Examples for commercial projects implementing a global coherency include the SGI Origin [] and SGI Altix []. A research project targeting a shared memory system with globally noncoherent but locally coherent memory is described by Yalamanchili et al. in []. Similar to this, the HyperTransport High Node Count specification [] allows performing memory transactions on remote nodes, thereby forming a noncoherent shared memory system.
N
N
Network Obliviousness
Bibliography . Dally W, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc . Tanenbaum AS () Computer networks. Prentice Hall International . Duato J, Yalamanchili S, Ni L () Interconnection networks: an engineering approach. Morgan Kaufman Publishers Inc . Litz H, Fröning H, Nüssle M, Brüning U () VELO: A novel communication engine for ultra-low latency message transfers. In Proceedings of the th international Conference on Parallel Processing (September – , ). ICPP. IEEE Computer Society, Washington, DC, – . Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K () The case for a single-chip multiprocessor. SIGPLAN Notices , (Sep. ), – . Goldberg RP () Survey of virtual machine research. In Computer, pp –, June . Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A () Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA, October – , ). SOSP ’. ACM, New York, NY, – . Felten EW, Alpert RD, Bilas A, Blumrich MA, Clark DW, Damianakis SN, Dubnicki C, Iftode L, Li K () Early experience with message-passing on the SHRIMP multicomputer. SIGARCH Computer Architecture News , (May ), – . Fröning H, Nüssle M, Slogsnat D, Haspel PR, Brüning U () Performance evaluation of the ATOLL interconnect. In Proceedings of IASTED Conference: Parallel and Distributed Computing and Networks (PDCN), February – , , Innsbruck, Austria . Fröning H, Litz H, Brüning U () Efficient virtualization of high-performance network interfaces. In Proceedings of the Eighth international Conference on Networks (March –, ). ICN. IEEE Computer Society, Washington, DC, – . Duato J, Johnson I, Flich J, Naven F, Garcia P, Nachiondo T () A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In Proceedings of the th international Symposium on High-Performance Computer Architecture (February – , ). HPCA. IEEE Computer Society, Washington, DC, – . Fröning H () Architectural improvements of interconnection network interfaces. Doctoral Thesis, University of Mannheim, Germany . Tullsen DM, Eggers SJ, Levy HM () Simultaneous multithreading: maximizing on-chip parallelism. In Years of the international Symposia on Computer Architecture (Selected Papers) (Barcelona, Spain, June – July , ). G. S. Sohi, Ed. ISCA ’. ACM, New York, NY, – . Boden NJ, Cohen D, Felderman RE, Kulawik AE, Seitz CL, Seizovic JN, Su W () Myrinet: A gigabit-per-second local area network. IEEE Micro , (Feb. ), – . Petrini F, Feng W, Hoisie A, Coll S, Frachtenberg E () The quadrics Network (QsNet): High-Performance Clustering Technology. In Proceedings of the the Ninth Symposium on High
. .
.
.
Performance interconnects (August – , ). HOTI. IEEE Computer Society, Washington, DC, Shanley T () Infiniband network architecture. addisonWesley Longman Publishing Co., Inc Liu J, Huang W, Abali B, Panda DK () High performance VMM-bypass I/O in virtual machines. In Proceedings of the Annual Conference on USENIX ’ Annual Technical Conference (Boston, MA, May – June , ). USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, – Raj H, Schwan K () High performance and scalable I/O virtualization via self-virtualized devices. In Proceedings of the th international Symposium on High Performance Distributed Computing (Monterey, California, USA, June – , ). HPDC ’. ACM, New York, NY, – Lenoski D, Laudon J, Gharachorloo K, Weber W, Gupta A, Hennessy J, Horowitz M, Lam MS () The stanford dash multiprocessor. Computer , (Mar. ), –
. Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo, K, Chapin J, Nakahira D, Baxter J, Horowitz M, Gupta A, Rosenblum M, Hennessy J () The Stanford FLASH multiprocessor. In Years of the international Symposia on Computer Architecture (Selected Papers) (Barcelona, Spain, June – July , ). Sohi GS, Ed. ISCA ’. ACM, New York, NY, – . Laudon L, Lenoski D () The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the th Annual International Symposium on Computer Architecture . Saini S, Jespersen DC, Talcott D, Djomehri J, and Sandstrom T () Application-based early performance evaluation of SGI altix systems for SGI systems. In Proceedings of the th Conference on Computing Frontiers, Ischia, Italy . Yalamanchili S, Young J, Duato J, Silla F () A Dynamic, Partitioned Global Address Space Model for High Performance Clusters. In CERCS Technical Reports, Georgia Institute of Technology, Georgia . Duato J, Silla F, Holden B, Miranda P, Underhill J, Cavalli M, Yalamanchili S, Brüning U, Fröning H () Scalable Computing: Why and How. HyperTransport Consortium Whitepaper
Network Obliviousness Kieran T. Herley University College Cork, Cork, Ireland
Synonyms Architecture independence
Definition A parallel algorithm is network-oblivious if it is formulated in a manner that is completely independent of any machine-specific parameters (such as the number of
Network Obliviousness
processors and communication characteristics) of any parallel machine that might ultimately be employed to execute it. The objective is to design algorithms which, although oblivious to machine parameters, nonetheless exhibit good performance across a range of parallel machines with differing computing and communication characteristics.
Discussion A central issue in the study of parallel algorithms is the search for a model of computation that strikes a balance between the three desirable qualities of useability, the suitability of the model to provide a simple, expressive, and convenient basis for the design and development of algorithms; effectiveness, its ability to capture enough of the essence of specific parallel machines to ensure that the algorithms produced are likely to perform well on such machines; and portability, the quality that allows an algorithm to execute with good performance across a range of machines with differing computing and communication characteristics. The large and varied menagerie of existing models for parallel computation attests to the lack of a universal consensus on a model that blends all three qualities in satisfactory way. The fact, for example, that many texts on the subject of parallel algorithms, such as that of Leighton[], are organized by machine architecture (mesh algorithms, hypercube algorithms, and so on), demonstrates that, in many cases at least, the most efficient algorithms are highly architecture-specific, essentially pointing to a conflict between effectiveness and portability. The well-known PRAM (Parallel Random Access Memory) model is admirably usable, but is often criticized for its lack of effectiveness. Since the model does not explicitly model communication costs, it does not discourage profligacy with regard to communication, with the result that PRAM algorithms often perform poorly when related to real parallel machines. Valiant’s BSP (Bulk Synchronous Parallel) model [] arguably strikes a better balance between usability and effectiveness. But, while it shields the algorithm designer from the finer details of the communication fabric of the machine being modeled, the model does force the algorithm designer to tune his algorithm specifically to a number of machine characteristics,
N
namely the number of processors and the bandwidthlatency parameters that BSP uses to model communication costs. In addition, BSP assumes a symmetric model of communication that takes no account of the fact that for machines involving a large number of processors, communication between “nearby” processors is likely to be less costly than that between more distant ones. A natural question is whether it is possible to devise a framework that balances usability on the one hand with effectiveness and portability on the other. First, the model should allow algorithms to be formulated in a manner that frees the algorithm designer from any consideration of machine characteristics such as the number of processors, the interconnection structure, and the communication capabilities of the machine ultimately used to execute it. Second, the model should ensure that the algorithms produced exhibit good performance across a range of different machines with widely varying characteristics. Somewhat surprisingly, at least for certain problems and certain classes of machine architectures, the answer to the question posed above is yes. The results sketched in the remainder of this article consider only machine architectures that can be modeled by processor networks. In this model, a set of processor-memory nodes are interconnected by means of a set of channels that link pairs of nodes, and nodes may exchange messages directly only with their immediate neighbors. Algorithms will be described in terms of a highlevel model that takes no account of the number of processors or the specific interconnection structure. In that sense the algorithm is network-oblivious in the same spirit as the cache-oblivious algorithms studied by Frigo et al. []. It will be shown that network-oblivious algorithms can be devised for a variety of problems that while formulated to be completely independent of any machine characteristics, exhibit good (often optimal) performance for a wide range of machine architectures of widely varying characteristics.
Cache-Oblivious Algorithms The term “network-oblivious” echoes the concept of cache-oblivious algorithms in the sequential computing setting. Since the two concepts share certain ideas, the cache obliviousness concept is summarized briefly in this section.
N
N
Network Obliviousness
It has long been recognized that the interplay between the execution of a sequential algorithm and the cache(s) on the underlying machine can have a significant effect on performance, yet the familiar RAM model traditionally used to formulate and evaluate sequential algorithms completely ignores all caching effects. The cache-oblivious algorithm-design framework, formulated by Frigo et al. [], allows the algorithm designer to formulate his algorithm within the familiar RAM framework, but the algorithm’s performance is evaluated using to an idealized cache model that captures the amount of cache activity that the algorithm entails. Informally, the model assumes a processor connected to a single-level cache √ of capacity Z words, partitioned into lines of L ≤ Z words each. The cache is in turn connected to the main memory, also partitioned into lines. A cache miss occurs whenever a request is issued for a word w that is not present in the cache. In this case, the line including w is copied from memory into the cache, replacing another line to make room. Algorithm performance is evaluated in terms of the work involved (i.e., the number of processor steps, as for the RAM model) and also of the number of cache misses. It should be stressed that the algorithm designer must formulate his algorithm in a manner that is completely independent of the cache parameters Z and L. The objective is for the algorithm to have optimal or near-optimal performance regardless of the cache characteristics of the machine ultimately used to execute it. To illustrate the idea, consider the problem of mul√ √ tiplying two square n × n matrices A and B that are stored in row major order in memory. The following identity, based on a partition of the matrices into √ √ quadrants of size n / × n /, suggests a simple divide-and-conquer algorithm:
√ √ The idea being that the eight n / × n /-sized subproblems A, B, , A, B, , ⋯, A, B, that feature on the right-hand side of Eq. are completed recursively, and then the results are combined as indicated to form the product A⋅B. Although this simple algorithm is not explicitly tuned to cache parameters Z and L, it turns out that it uses the cache optimally. No algorithm for this problem, including those specifically tuned to the cache parameters, can have an asymptotically better performance in terms of the work or number of cache misses. In other words, the algorithm, though completely oblivious to the cache characteristics, is competitive with those formulated in a cache-aware manner. Moreover, this algorithm is also optimal (under certain conditions) in relation to machines with multiple levels of caches. The above algorithm exhibits a certain “self-tuning” quality. The recursive divide-and-conquer structure, entailing a series of subproblems of geometrically √ √ decreasing sizes n , n /, ⋯, implicitly exposes some spatial/temporal locality within the computation that the caching mechanism exploits to yield good performance. Once the subproblem size is sufficiently small that all the data related to it can fit in the cache, the subproblem can be completed without triggering any further cache misses. Efficient cache-oblivious algorithms are also known for a variety of other problems such as matrix transposition, FFT (Fast Fourier Transform), sorting, among others. In most cases, the algorithm has the form a divide-and-conquer algorithm with highly regular structure. Some negative results are also known. Frigo et al. have also shown that algorithms that are efficient in the cache-oblivious sense often also exhibit good performance when executed on real machines.
A Model for Network Obliviousness ⎛A A ⎞ ⎛B B ⎞ ⎜ , , ⎟ ⎜ , , ⎟ ⎟⎜ ⎟ ⎜ ⎝ A, A, ⎠ ⎝ B, B, ⎠ ⎛C = A B + A B ⎞ , , , , , C, = A, B, + A, B, ⎟ =⎜ ⎜ ⎟ ⎝ C, = A, B, + A, B, C, = A, B, + A, B, ⎠ ()
To explore the concept of network obliviousness, Bilardi et al. [] introduced the following framework for the formulation and evaluation of parallel algorithms. Algorithms are formulated in terms of a specification model and their quality assessed with respect to an evaluation model. The objective is that the framework provided by the pair of models offers a space for algorithm design and development that is conceptually simple, expressive
Network Obliviousness
and easy to use, yet it gently contrains design in a manner that fosters the development of algorithms that are effective and portable. The specification model, denoted M(n), is an abstract machine embodying n numbered nodes P , ⋯, Pn− , each equipped with a processor and a private memory module. The computation consists of a sequence of supersteps in which each node may first complete some local computation involving data held within the node and then transmit messages to one or more other nodes. An i-superstep is a superstep in which nodes are constrained to send/receive messages only to/from nodes whose node number shares the same i most significant bits. Note that the numbering of nodes induces a natural decomposition of the nodes of M(n): the n/i nodes that share the same i most significant bits form a cluster, referred to as an i-cluster, and each i-cluster contains two (i + )-clusters, each containing two (i + )-clusters and so on. The log n-clusters are individual nodes. It must be stressed that n is an algorithmic parameter, typically related to the size of the problem instance (e.g., matrix size in the case of a matrix algorithm) and not intended to reflect the number of processors of the machine ultimately used to execute the algorithm. The evaluation model, denoted M(p, B), incorporates two parameters: p denoting the number of processing elements (each consisting of a processor plus local memory and referred to subsequently simply as a “processor” for brevity) and B the block size. The model shares the same clustering structure as M(n) and as well as a superstep-based mode of algorithm execution. In the case of the M(p, B) model, however, messages are aggregated into blocks of size B for transmission at the end of each superstep. Consider the problem of matrix multiplication on √ √ the M(n) model. Suppose that two n × n matrices A and B are distributed in row-major fashion among the nodes of M(n). The product C = A⋅B may be computed in a manner akin to that employed in the cacheoblivious setting seen earlier. Let P(i, j) denote the node that holds entries A[i, j] and B[i, j] (and ultimately also C[i, j]). . Partition the nodes of M(n) into eight clusters of size n/. Distribute the elements of matrices A and B, so that each cluster has the matrix elements
N
√ √ required for one of the eight n / × n / subcomputations listed on the right-hand side of Eq. . . Recursively complete the eight subcomputations within the individual clusters. . Distribute the elements of the results of the eight subcomputations so that P(i, j) receives the data it requires to compute C[i, j].
Note that the algorithm is expressed purely in terms of the problem size parameter n and is completely oblivious to any machine characteristics. Note that an algorithm A for M(n) may be interpreted as an M(p, B) algorithm A′ by assigning groups of n/p consecutive nodes of the former to each processor of the latter, whose role is to simulate the actions of the M(n) nodes assigned to it. For communication steps, i-supersteps where i ≥ log p are handled internally within the M(p, B) processors (as the nodes of each i-cluster of M(n) are mapped to a single M(p, B) processor). Supersteps where i < log p are handled as processor–processor communication, but with messages bound for a common destination bundled together into blocks of size B prior to transmission. The communication complexity of a superstep is defined to be the maximum, over all processors, of the number of message blocks transmitted/received. The communication complexity of an algorithm is the sum of the complexities of its constituent supersteps. It turns out that the communication complexity of the matrixmultiplication algorithm on M(p, B) is O(n/Bp/ ) (provided that B ≤ n/p). An algorithm A for M(n) is optimal if the communication complexity C for its simulation A′ on M(p, B) is optimal, i.e., every M(p, B) algorithm for the same problem has a communication complexity that is Ω(C). The O(n/Bp/ ) communication complexity for the matrix multiplication algorithm is in fact optimal (subject to certain technical assumptions). So the algorithm, though not specifically tuned to number of processors, is optimal with respect to the M(p, B) model. To explore the effectiveness and portability of the algorithm, consider the performance of this algorithm on a network of processors p with a two-dimensional √ √ p × p mesh interconnection, in which each processor can exchange messages directly only with its immediate neighbors.
N
N
Network Obliviousness
Step of the algorithm involves a communication step for the n elements of matrix A in which each processor is the source of n/p elements and the destination of (n/p) (since each matrix quadrant is replicated √ twice). This can be accomplished in O((n/p) p ) = √ O(n/ p ) time on the mesh. The same bound applies to the distribution of matrix B and the communication involved in Step . Overall the communication cost of the algorithm is captured by a recurrence of the form √ C(n, p) = C(n/, p/) + O(n/ p ), which has a solution T(n, p) = O(n/p/ ). (Note that once a cluster size of one is reached, subproblems can be completed internally with a processor and no further communication is required.) It turns out that this O(n/p/ ) complexity is optimal for the matrix multiplication problem for the square mesh in the network of processors model. In fact the Ω(n/p/ ) lower bound holds (subject to some technical restrictions) even for the M(p, ) model that takes no account of the nonconstant cost of communication between distant processors in a network of processors such as the mesh. Thus the algorithm, developed in a manner that is completely oblivious to the structure and communication cost of the target architecture (in this case the two-dimensional mesh), is nonetheless optimal with respect to that architecture. Intuitively, the optimality of the algorithm stems from communication locality implicit in the hierarchical decomposition into subproblems of geometrically decreasing size and the mapping of these subproblems to smaller and smaller clusters (submeshes) in a way that exploits the lower communication costs within smaller submeshes. It turns out that this phenomenon of efficient network-oblivious algorithms yielding efficient algorithms when executed on specific machine architectures is not unique to matrix multiplication or to the two-dimensional mesh architecture. The phenomenon extends to a variety of algorithmic problems and a range of different network architectures as well. Optimal network-oblivious algorithms are also known for matrix transposition, FFT and sorting, for example. In each case, the optimality in the networkoblivious framework guarantees good performance √ √ when executed on a p × p mesh. Efficient networkoblivious algorithms have also been explored for a
variety of other problems such as prefix sum, various dynamic programming algorithms, some variants of Gaussian elimination, among others. Furthermore, the effectiveness of optimal networkoblivious algorithms applies to a wide class of network architectures, loosely speaking ones that exhibit a hierarchical structure in their pattern of interconnections and in their communication costs. Bilardi et al. [] used the DBSP model (Decomposable BSP), a hierarchical variant of the well-known BSP model, to address the issues of effectiveness and portability of networkoblivious algorithms. The DBSP model shares both the superstep-based structure of algorithm execution and hierarchical clustering structure of the M(n) and M(p, B) models, but has a more a refined model for the costs of communication that better captures the communication characteristics of various network architectures. Specifically, it captures a notion of proximity: that communication with “nearby” processors should be cheaper than with those that are more distant. An i- superstep, in which processors may transmit messages to other processors within the same i-cluster, has a cost of ⌈δ/Bi ⌉gi , where the parameters gi and Bi model the bandwidth-latency characteristics of communication within an i-cluster and δ denotes the maximum number of words transmitted/received by any processor during the superstep. The hierarchical nature of both the processor clustering and cost model for communication allows the DBSP model, for suitable choices of parameters (g , g , ⋯, glog p− ) and (B , B , ⋯, Blog p− ), to model a wide variety of architectures, such as meshes of various dimensions as well as fat trees. For example, a d-dimensional mesh architecture can be modeled by a p-processor DBSP with gi = (p/i )/d and Bi = (p/i )/d . It should be noted that meshes of various dimensions are a common choice for interconnection on modern parallel machines. It can be shown that, under some reasonable assumptions on the pattern of communication embodied within the algorithm, an optimal network-oblivious algorithm yields an optimal DBSP algorithm and so optimal network-oblivious algorithms translate into efficient algorithms on architectures modeled by the DBSP, such as multidimensional arrays.
Network of Workstations
Future Directions To broaden the range of problems analyzed within the network-oblivious framework is an obvious line of further enquiry. An investigation of how nearoptimal network-oblivious algorithms perform on various architectures as well as negative results (that certain problems do not admit optimal network-oblivious algorithms) might be valuable in casting light on what algorithm characteristics foster or frustrate network obliviousness. Clearly, further work that broadens the space of architectures to which these results apply would also be worthwhile. Work that extends these ideas into other settings, such as the multicore work referred to below, also looks promising.
.
.
.
.
Related Entries Bandwidth-Latency Models (BSP, LoGP) FFT (Fast Fourier Transform)
.
Models of Computation, Theoretical Parallel Prefix Algorithms PRAM (Parallel Random Access Machines)
. .
Bibliographic Notes and Further Reading For a survey of parallel algorithms for various processor network architectures, see Leighton’s text []. For a sample of algorithms designed from the PRAM model, see JáJá []. The BSP model was introduced by Valiant []. The decomposable BSP variant DBSP was first introduced by de la Torre and Kruskal in [], but the variant used here was presented in Bilardi et al. [, ]. The latter paper also includes an interesting discussion of the interply between usability, effectiveness, and portability. Cache obliviousness was introduced by Frigo et al. in []. The concept of network obliviousness was first formulated and explored by Bilardi et al. in []. A recent paper by Chowdhury et al. [] develops a concept of multicore obliviousness in the same sprit of cache-/network- obliviousness. The paper also presents a number of network-oblivious results.
Bibliography . Bilardi G, Pietracaprina A, Pucci G () A quantitative measure of portability with application to bandwidth-latency models for
.
N
parallel computing. In: Euro-Par ’: Proceedings of the th international Euro-Par conference on parallel processing, Toulouse. Springer, London, pp – Bilardi G, Pietracaprina A, Pucci G () Decomposable BSP: a bandwidth-latency model for parallel and hierarchical computation. In: Reif J, Rajasekaran S (eds) Handbook of parallel computing: models, algorithms and applications. CRC Press, Boca Raton Bilardi G, Pietracaprina A, Pucci G, Silvestri F () Networkoblivious algorithms. In: Parallel and distributed processing symposium (IPDPS ), Long Beach. IEEE International, Piscataway, pp –, Mar Chowdhury RA, Silvestri F, Blakeley B, Ramachandran V () Oblivious algorithms for multicores and network of processors. Proceedings of the IEEE th international parallel and distributed processing symposium (IPDPS), Atlanta, – Apr de la Torre P, Kruskal CP () Submachine locality in the bulk synchronous setting (extended abstract). In: Euro-Par ’: Proceedings of the second international Euro-Par conference on parallel processing, vol II, Lyon. Springer, London, pp – Frigo M, Leiserson CE, Prokop H, Ramachandran S () Cacheoblivious algorithms. In: FOCS ’: Proceedings of the th annual symposium on foundations of computer science, Washington, DC. IEEE Computer Society, Los Alamitos, p JáJá J () An introduction to parallel algorithms. Addison Wesley Longman, Redwood City Leighton FT () Introduction to parallel algorithms and architectures: array, trees, hypercubes. Morgan Kaufmann, San Francisco Valiant LG () A bridging model for parallel computation. Commun ACM ():–
Network of Workstations Network of workstations (abbreviated as NOW) are clustered commodity workstations intended to function together as a cost-efficient parallel processing system. Their effectiveness in parallel processing is often facilitated by special parallel communication and management software layers and enhanced by highperformance system area networks.
Related Entries Clusters
Bibliography . Anderson TE, Culler DE, Patterson DA, the NOW team () A case for NOW (Networks of Workstations). IEEE Micro (): –
N
N
Network Offload
Network Offload
Functional unit
Functional unit
...
Functional unit
Collective Communication, Network Support For Input channels
Output channels
Networks, Direct
Distributed switched networks; Hypercube; Interconnection networks; K-ary n-cube; Mesh; Network Architecture; Ring; Router-based networks; Torus
Definition A direct network is a network for interconnecting a set of nodes in such a way that no external switching components are required. These nodes are themselves usually programmable computers. A switch is an integral part of the nodes, so that the node itself can be directly connected to a set of neighbor nodes. Nonneighbor nodes can be reached through multiple hops in the network. The integration of the switch with the node implies that there is a set of network topologies associated with direct networks that is distinct from the set of topologies associated with indirect networks.
Discussion Introduction The preferred way to connect a small set of compute nodes to each other is to attach them all to a shared medium Bus. The main limitation of this approach lies in scalability. As the number of attached nodes increases, the total aggregate bandwidth of the Bus remains constant; thus, the Bus becomes a bottleneck. A popular way to get around this scalability problem is to integrate switching capabilities into the compute nodes. This will allow a compute node to be directly connected to a subset of the other nodes in the system. Nodes that are directly connected to each other are generally referred to as neighbors. A generic architecture of a node is depicted in Fig. . The Functional units depicted will typically be processors or memory, but other type units, such as graphic
Switch
...
Synonyms
...
Olav Lysne, Frank Olaf Sem-Jacobsen The University of Oslo, Oslo, Norway
Networks, Direct. Fig. A generic architecture of a node for direct networks
processors or specialized vector processors are thinkable. Sometimes output channels and input channels are paired into bidirectional channels. In these cases, the channel interfaces denote the endpoint of bidirectional links. Still, the architectural layout of the node remains the same regardless of whether the links are unidirectional or bidirectional. The direct network is constructed from a set of such nodes by repetitively coupling output channels of one node with input channels of another node until a topology of connected nodes has been built. There are many interconnection patterns that can be used in doing this, and each of them results in a defined class of topologies. The most well known classes of topologies are defined in a later section. It is common to refer to the topologies generally used for direct networks as “direct network topologies,” and this chapter will follow the same convention. Although this wording hardly ever leads to misunderstandings, it is a bit inaccurate. All the topologies that can be applied for direct networks can equally well be the basis of the design of indirect networks (The opposite is, however, not always the case. Although it is possible to construct, e.g., a multistage network from switches that are integrated with nodes, the routing limitation of multistage networks makes it hard to obtain a fully connected and deadlock-free routing strategy for these cases.). Indeed, at the time of writing, there is a trend in industry to build switches as separate units, and in accordance with some communication standard, e.g., InfiniBand. Even if these switches are not integrated with a compute node, they are used to build systems with both direct-and indirect-network topologies.
Networks, Direct
N
Direct networks provide very good flexibility, both with respect to scalability and topology. The extension of a system with additional nodes is in many cases cost effective, as the needed extension in communication infrastructure is embedded in the node itself, and because the necessary connection points in the existing system are provided by the nodes already in the system.
application that the network should support. The main strength of some direct-network topologies is that they support very well some applications that require communication only with neighboring nodes. In such a scenario, the bandwidth of a single link is important, but the aggregate bandwidth across the bisection is of less importance.
Topologies
Rings The simplest extendible direct network topology is the ring (Fig. ). The ring may be unidirectional or bidirectional. For rings, the nodes only need one input and one output channel. The forwarding logic can be made quite simple, in that every node compares the address of each packet passing by. If the address matches the node’s own address, the packet is delivered to the internal parts of node itself. Otherwise, it is forwarded to the next node in the ring. Some technologies, like SCI, are tailor-made for ring structures, and thus take advantage of easy forwarding and low-node degree. The bisection bandwidth and the network diameter of rings are generally considered as the weak points of these structures. The network diameter grows linearly with the number of nodes, and the bisection bandwidth remains constant whatever the size of the ring. This severely limits the size of the systems built as a single ring. Systems of rings can, however, be constructed to improve on this weakness. The most prominent example of this is the Torus-class of topologies. These are discussed below.
The network topology defines the interconnection pattern among nodes. The usual way of modeling network topologies is as a graph G(N, C) where the vertices, denoted by N, represent the set of processing nodes, and the edges C represent the set of links. The links may be defined to be unidirectional or bidirectional, based on the properties of the technology under study. In some cases it is necessary to model a bidirectional link of some technology as a pair of unidirectional links. In particular, this is necessary for the analysis of deadlock properties of routing algorithms. In the study of topologies, there is a set of notions that are central. These stem partly from the requirements that the topology places on the nodes, and partly on the properties of the topology itself. Node Degree is a notion from the first category. This is the (minimal) number of external input and output channels that the topology requires of each node. Another example is Forwarding Logic. Topologies that are of a regular structure can be supported by a specialized logic for efficient and simple forwarding of packets tailored for that particular topology. Network diameter is a direct property of the topology itself. It is the maximal length of the minimal path between any pair of nodes measured in number of intermediate nodes that have to be traversed. Much early research on topologies focused on solutions that minimized the diameter. The introduction of Cut Through Switching in the early s did, however, make packet latency in a network to a large extent independent of the number of hops. Bisection Bandwidth is a notion that captures the ability of a topology to handle bandwidth requirements. It is defined as the minimum aggregate bandwidth of a set of links that when they are removed divide the network in two equal-sized halves. Although this notion is important, its relevance is not independent of the
Meshes The n-dimensional mesh topology can be explained by imagining an n-dimensional grid, with a node in each cross-point. Each node has one bidirectional link connecting it to each of its closest neighbors in each dimension. This means that the nodes have at most n
Networks, Direct. Fig. This is an example of a ring with eight nodes. Note that the nodes on each side of the ring are interleaved, so that the length of the cable between each pair of nodes becomes uniform
N
N
Networks, Direct
neighbors, thus the switching element of the nodes must have at least n ports to support an n-dimensional mesh. Note also that the nodes that are placed at the extreme of a dimension will have fewer than n neighbors. The nodes that are at the extreme of all dimensions will have exactly n neighbors. A more formal definition is as follows: An ndimensional mesh has k × k × . . . × kn− nodes, with ki nodes along each dimension i. Let ki be the radix of dimension i. Each node X in the topology is uniquely identified with a tuple of n coordinates identifying its position in the grid. If a node X in the topology is identified by {x , x , . . . , xn− }, then for all i, ≤ xi < ki . Two nodes are neighbors, and thus are directly connected to each other through a link, if and only if their identifying tuples of coordinates are identical in all places, except for exactly one coordinate i where they differ with exactly one (Fig. ). The bisection bandwidth of a mesh equals the link speed multiplied with the product of the number of nodes in each dimension and divided by the number of nodes in the dimension with the highest number of nodes. This means that when the radix of the dimensions is balanced, M is the number of nodes and n is the number of dimensions, the bisection bandwidth of a mesh grows with M divided by the n′ th root of M. The diameter of the mesh is the radix of the dimensions multiplied with the n′ th root of M. If specialized forwarding logic for meshes is implemented in a node, this generally means that the addressing scheme is based on the n-tuple that identifies the position of each node in the mesh. Typically then, each
forwarding step consists of moving the packet one step closer to its destination along one dimension. This is done by first identifying one digit in the n-tuple where there is a mismatch between the destination address of the packet and the address of the node that currently handles it. In the dimension identified by the digit, the packet is forwarded in the direction that decreases the mismatch. When all mismatches have been nullified, the packet has reached its destination.
Torus Torus topologies are extensions of meshes with additional links. In the same way as for meshes, one can describe the placements of the nodes by referring to cross-points in a grid, and in the same way as for meshes, the nodes can de identified by tuples defining their placement in each of the n dimensions of the grid. The only topological difference is that there are additional links in a torus compared to a mesh. These links connect the first and the last node in a dimension, so that each sequence of nodes along each dimension forms a ring (Some authors will define a torus such that the extra links are not there when the size of the dimension is two. Following that definition, the definition of torus and meshes coincide for this case. Our later discussion of bisection bandwidth for tori would not be valid for dimension-size of two under that definition.) (Fig. ). The implication of this is that in a torus, each node has exactly the same number of neighbors, equaling twice the number of dimensions in the torus. It also means that the bisection bandwidth of the torus is
Networks, Direct. Fig. A two-dimensional and a three-dimensional mesh
Networks, Direct
N
This means that for each node that has been designed, the number of ports implemented in the node limits the size of the network that can be built. On the other hand, the bisection bandwidth grows linearly with the number of nodes, meaning that the bisection bandwidth per node remains constant. Furthermore, the diameter of the network equals the number of dimensions in the system, giving a diameter that is logarithmic to the number of nodes in the system.
Networks, Direct. Fig. Example of a -dimensional torus
two times that of the mesh. The diameter of the torus approaches half that of the mesh when the network is large (There is a subtlety in that the diameter of each ring in the torus is calculated differently depending on whether the number of nodes in the ring is odd or even. The effect of this subtlety approaches zero when the rings grow large.). Specialized forwarding logic is possible for tori in the same way as for meshes. The notion of distance will, however, need to be more involved to capture the possibility for following the shortest path along wraparound links.
Hypercube The hypercube can be defined in several ways. One way is to define it as a mesh where the radix of each dimension is exactly two. The following recursive definition is more formal: ● ●
A single node is a hypercube. Two identical hypercubes, where each isomorphic pair of nodes from the two hypercubes has a link between them, is a hypercube.
The construction of hypercubes from this definition is illustrated in Fig. . Since hypercubes are a special case of meshes, the discussion of port requirements in the nodes, of forwarding logic in the nodes, the diameter, and the bisection bandwidth follows from the discussion above. Still, it is worth reflecting over the implication of limiting the radix to two. First, it means that the only way to extend the network is to add another dimension.
Other Topologies In addition to rings, meshes, tori, and hypercubes, a plethora of topologies have been defined in the literature. Most of these additional topologies were designed with the goal of minimizing the network diameter for a given number of nodes and/or a given node degree. With the introduction of pipelined switching techniques, like wormhole routing and Virtual Cut Through, end to end packet network latency is close to independent of network diameter. None of these topologies are therefore likely to be implemented. For a short introduction to some of these topologies, we refer to Sect. Popular Network Topologies of [].
Routing Functions The routing functions in use for these topologies are quite similar to each other. At the base there is shortest path routing that follows one dimension at a time. Basically, the set of dimensions are ordered in a sequence, and the dimensions are traversed in this order. For meshes and hypercubes, this guarantees deadlock-free routing. For rings and tori, however, the cyclic structures in the topologies creates additional challenges, as shortest path routing will create cycles of dependencies between the buffers and the channels, and thus there is a danger of deadlock. In these cases, the solution is to introduce virtual channels, such that the cycle of dependencies between the different resources in the network is broken. More information on this is given in the entry on routing in this encyclopedia. As the size of the networks grows, the importance of the system to remain in operation regardless of whether some of the nodes fail becomes important. Several routing strategies that cope with faulty nodes in direct network have been proposed, but very few of them have so far been implemented. More information on this can be found in the entry on fault-tolerant networks.
N
N
Networks, Direct
Networks, Direct. Fig. This is an illustration of the development of hypercubes from the recursive definition. Top left is a one-dimensional hypercube consisting of two nodes connected together. Top right is a two-dimensional hypercube, bottom left a three-dimensional hypercube, and bottom right is a four dimensional hypercube
Direct Networks Today For smaller networks, the choice of topology is relatively easy. Most solutions will work. Presently, onedimensional meshes (or daisy chains) are typically used for HyperTransport, and simple rings are used for SCI and interconnecting the processing elements in the Cell Processor and the Intel Larrabee. The discussion does not really get involved until the system size increases to interconnect more than – nodes. The properties of Hypercubes with respect to diameter and bisection bandwidth are very appealing. Still, real machines have to be built in the three available dimensions. This has limited the use of Hypercubes to medium-sized systems. For the recent systems that have been built with thousands of nodes, the preferred direct network topology has therefore been a threedimensional torus. When looking at the Top list of supercomputers in the world, one finds both direct topologies, and indirect topologies – typically multistage networks. It is a well-established fact that for arbitrary traffic, multistage networks give superior price/performance compared to the direct topologies. Still, tori and meshes appear in the list. The reason for this is that many of the computational problems that run on such computers analyze physical systems in a threedimensional world. A torus topology allows the physical world to be divided into three-dimensional cubes of equal size, and let these cubes be mapped isomorphically onto the compute nodes in the system. Since the physical influence between these cubes are between
neighbors, the communication in the computer system will mainly go between neighboring nodes, and thus the locality in the torus/mesh topology fits ideally with communication needs. Therefore, the current trend is that for multi-purpose installations, multistage topologies are used. For installations that are specialized to run a limited set of applications modeling the physical world, tori or meshes are still the topology of choice.
Future Directions There are few new developments in network topologies after the turn of the millennium. Still, the recent changes in technology have revitalized the field somewhat. The development in processor technology has made it possible to put many compute cores onto one chip. This has spawned the need for an on-chip interconnection network. Because of its square layout in two dimensions (or three when one considers different metal layers) have made meshes and tori look as obvious candidates for such networks. The quest for simple integration of different components onto the same chip seems to indicate that the integration of switching capacity and functionality onto the same logical unit will increase. From this observation it is a reasonable guess that direct topologies will be preferred. There are, however, several reasons why our knowledge of direct network topologies will have to be extended to support on-chip networks. The first reason is that the different functional units that are to be integrated onto the same chip will in many
Networks, Direct
cases not be uniform. They will have different characteristics, and different need for footprint on the chip. For that reason, it is not at all obvious that the onchip network will be regularly structured in the same way as tori and meshes are. The second reason is that the traffic requirements will not be uniform either. Onchip networks will frequently be used in single application scenarios. The quest for minimizing power and area utilization will make designers deviate from regular structures whenever the communication needs will allow it. Finally there are currently initiatives working in the direction of streamlining on-chip, on-board, and between-board communication. These initiatives may result in common addressing solutions. Still, the different requirements of these domains will most likely mean that there will be different network topologies in each of them. Efficient co-working of several networks with different topologies is therefore likely be a topic addressed in the future.
Related Entries Buses and Crossbars Cray XT and Seastar -D Torus Interconnect Ethernet Hypercubes and Meshes InfiniBand Myrinet Networks, Fault-Tolerant Network Interfaces Networks, Multistage Routing (Including Deadlock Avoidance) SCI (Scalable Coherent Interface) Switch Architecture Switching Techniques Topology Aware Task Mapping Quadrics
Bibliographic Notes and Further Reading There have are several sources to the state of the art in direct networks. The two most important books on interconnection networks both contain good overviews [, ]. Furthermore, there is a treatment of topologies for interconnection networks in an appendix of the book by Hennessy and Patterson [].
N
Direct networks were used on some of the earliest parallel computers [, ]. Later, in particular inspired by the Caltech Cosmic Cube [], there were a series of approaches using hypercube-like structures [, ]. William Dally did much seminal work on direct networks in the late s. The one publication most relevant to the discussion on topology here was [], where he made the case for low dimensional, high radix networks to have better scalability properties than the high dimensional hypercubes. Other important contributions were done together with Charles Seitz on mechanisms for deadlock-free routing on tori [, ]. From this point, two and three-dimensional meshes and tori have been the direct network of choice [, ]. A recent example that demonstrates the lingering relevance of these topologies is the Red Sky machine at Sandia National Labs. This machine was delivered in November and made number on the Top list then. Its interconnection network is a threedimensional torus.
Bibliography . Duato D, Yalamanchili S, Ni L () Interconnection networks – an engineering approach. Morgan Kaufmann, San Francisco . Dally WJ, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco . Duato D, Pinkston T () Appendix E: Interconnection networks. In: Hennessy JL, Patterson DA (eds) Computer architecture – a quantitative approach, th edn. Morgan Kaufmann, San Francisco . Dally WJ () Performance analysis of k-ary n-cube interconnection networks. IEEE Trans Comput (): – . Dally WJ, Seitz CL () Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans Comput C-():– . Dally WJ, Seitz CL () The torus routing chip. J Distributed Syst ():– . Barnes GH () The ILLIAC IV computer. IEEE Trans Comput ():– . Slotnick DL, Borck, WC, McReynolds RC () The Soloman Computer. In: Proceedings of the AFIPS Spring Joint Computer conference, vol , Spartan Books, New York, pp – . Seitz CL () The cosmic cube. Commun ACM ():– . Close P () The iPSC/ node architecture. In: Proceedings of the conference on hypercube concurrent computers and applications, Pasadena, USA, pp –, Jan . Palmer JF () The NCUBE family of parallel supercomputers. In: Proceedings of the international conference on computer design, pp –, San Diego, USA,
N
N
Networks, Fault-Tolerant
. Kessler RE, Schwarzmeier JL () Cray TD: a new dimension for Cray Research. In: Proceedings of the IEEE Computer Society International Conference (COMPCON), San Francisco, USA, pp –, Feb . Scott SL, Thorson GM () The cray TE networks: adaptive routing in a high performance D torus. In: Proceedings of the symposium on hot interconnects, Standord , pp –
Networks, Fault-Tolerant María Engracia Gómez Requena Universidad Politécnica de Valencia, Valencia, Spain
Synonyms Reliable networks
Definition A fault-tolerant network is a network that can provide full connectivity among all the processing nodes connected to the network without losing any messages despite the presence of faults in the network elements.
Discussion Introduction Large parallel computers with thousands of nodes have been and are being built as can be seen in the top list []. In such systems, the high number of components significantly increases the probability of failure. Each individual component can fail with a certain probability and, thus, the probability of failure of the entire system increases dramatically as the number of components increases. In addition, the failure probability of each individual component increases as transistor integration density increases. Hence, in these systems, it is critical to keep the system running even in the presence of faults. Moreover, faults in the interconnection network may isolate a large fraction of the machine, containing healthy processing nodes. Therefore, faulttolerant mechanisms for the interconnection network are critical design issues for large parallel computers. Fault-tolerance in an interconnection network refers to the availability of the network to function in the presence of component faults. However, techniques used to implement faulttolerance in the interconnection network are often at
the expense of considerable performance degradation, which is not desirable. A good fault-tolerance mechanism should degrade network performance gradually with the number of faults. Also each fault-tolerance mechanism can tolerate a certain number of faults. Beyond this number, an additional fault may disconnect the network. The solution adopted to tolerate faults in a given system depends on the types of faults that can occur and the assumed fault-model. The pattern of faults and expectations about the behavior of processors and switches in the presence of these faults determines the approaches to provide fault-tolerance. This information is gathered in the fault model.
Fault Models In the design of fault-tolerant techniques for the interconnection network, a fault model is assumed. In what follows, the most commonly-used fault models are presented []. First, there are two main kinds of faults in the interconnection network: transient and permanent. Permanent faults remain in the system until it is repaired. On the other hand, transient faults are dynamic in nature as integrated circuit sizes continue to decrease and speeds continue to increase. As it can be read below, the approaches followed to deal with both fault types are completely different. Another consideration is the level at which components are detected as faulty. Usually, detection mechanisms differentiate between switch faults and link faults since most types of faults will simply manifest as link or switch faults. In the case of a switch fault, all links incident on the faulty switch are also considered as faulty. The fault model also specifies the area of the fault information available at each node to tolerate the faults. On one hand, there are techniques in which each node only has fault information about adjacent nodes. On the other hand, in other techniques, every node knows the fault status of every other node. The advantage of the latter techniques is that messages can be forwarded along the shortest feasible path in the presence of faults. However, in practice it is difficult to have global fault information in a timely manner. These techniques require storage and computation time which have a significant impact on performance, and the occurrence
Networks, Fault-Tolerant
of faults during update periods requires complex synchronization protocols. In case of local information, nodes only have fault information about adjacent nodes; so routing decisions are simpler and can be computed quickly, and updating the fault information about adjacent nodes can be performed in a very simple way. However, in this case, routing decisions are not optimal and messages may be routed to network zones with faults, leading to longer paths. In practice, fault-tolerant techniques are between purely local and purely global fault status information. Fault diagnosis deals with developing algorithms for getting and maintaining timely fault information about neighbor nodes. On the other hand, depending on the actions performed after the fault detection, two different types of fault models can be defined: static or dynamic. In a static fault model, once a fault is detected, all the processes in the system are halted, the network is emptied, and a management application is run in order to deal with the faulty component. This application detects where the fault is and computes the information required by the nodes in order to tolerate the fault. This information is distributed by the management application and, then, the system is rebooted and the processes are resumed. This fault model needs to be combined with checkpointing techniques in order to be effective. Applying checkpointing minimizes the fault’s impact on applications, because they are restarted from the latest checkpoint. In a dynamic fault model, once a new fault is found, actions are taken in order to appropriately handle the faulty component while the system keeps running. For instance, a source node that detects a faulty component through a path can switch to a different path that does not use the failed component.
Transient and Permanent Faults As commented above, faults can be classified as transient and permanent [, ]. Transient faults are usually due to electromagnetic interferences and alpha particle impacts and they are usually corrected by retransmitting the packet either at the link level or at the end-to-end level. A link level treatment is performed by encoding redundant information in the packet using an error control code (ECC). Simple parity is enough to detect a single error of a bit. However, most links use a cyclic-redundant check (CRC) of sufficient length in order to reduce the
N
probability of an error of several bits. The error check can be performed at different levels of granularity: flit or packet. When the sender transmits the packet or flit, it retains a copy until correct reception is notified by the receiver. The redundant information included in the packet or flit is checked when it is received. If the packet or flit pass the test, the receiver sends an acknowledgment indicating that it has been received correctly. Each packet or flit is tagged to facilitate its identification and this tag is used in the acknowledgment to identify the packet or flit. The acknowledgment also indicates the reception status, that is, received correctly or received in error. Upon receiving this acknowledgment, the transmitter discards its copy of the packet. If a packet is received in error, the sender retransmits it. In most cases, the error is transient and the packet is received correctly on the second attempt. Transient faults can be also treated by end-to-end mechanisms that operate in a similar way to that of link-level retransmission, except that packets are retransmitted over the whole network path up to the final destination []. In this case, as in the link-level mechanism, the sender must store a copy of each packet pending of its correct reception. The destination sends an acknowledgement when the packet arrives correctly. It contains the tag assigned to the packet. In this case, due to having longer paths, a timer is usually used at the sender node for each sent packet. If the timer of a given packet expires before an acknowledgement of that packet arrives to the destination or if a negative acknowledgement is received, then the source resends the packet. However, two copies of the same packet can be received at the destination if the acknowledgement arrives at the source delayed after the timer has expired. This can be identified in an easy way since both packets have the same tag, and the second packet is discarded. On the contrary, permanent faults are due to components that stop working within specifications and cannot be recovered by retransmitting packets with the help of some higher-layer software protocol. In order to tolerate this type of faults, an alternative physical path must exist in the network topology. Three main groups of techniques are used to deal with permanent faults in the interconnection network: resource replication, network reconfiguration, and fault-tolerant routing.
N
N
Networks, Fault-Tolerant
Replication Replication provides tolerance to permanent faults by replicating network components. Those spare components are switched on in case of fault and the faulty components are switched off (or bypassed). As an example, the ServerNet interconnection network is designed with two identical switch fabrics. Only one of the switch fabrics is available at any given time. In case of fault in one of them, the other fabric is used. This technique is used by a large amount of faulttolerant proposals for MINs. They add either links between switches in the same stage, more links between stages, more switches per stage or extra stages to increase the number of alternative paths. In [], a new topology that consists of two parallel fat-trees with crossover links between the switches in the same position in both networks is proposed. In this topology, when a packet encounters a fault in its path, it is forwarded through the crossover link to the other parallel fat-tree. The main drawbacks of this approach are the high extra cost of spare components and the non-negligible probability of failure of the circuits required to switch to spare components. This technique can also be implemented without using spare resources, leading to a degraded mode in the system operation after a fault. The IBM Blue Gene/L supercomputer, for instance, offers the ability to bypass faulty network resources while keeping its topology and routing algorithm. In case a fault is detected, all the nodes included in several midplane boards that contain the faulty node/link are bypassed to isolate the fault. In this way, the network topology remains the same and routing is not changed. The main drawback of this technique is the relatively large number of healthy resources (i.e., midplane node boards) that may be switched off after a fault.
Network Reconfiguration Network reconfiguration relies on the use of routing tables as the routing mechanism. It is a more general technique to deal with changes in the network topology due to failures or to some other cause. When a fault appears, the routing tables are reconfigured to be adapted to the new topology resulting after the network topology change. In order to perform the network reconfiguration, the fault-free portions of the topology
must first be discovered, then the new routing tables must be computed and finally they are distributed to the corresponding network components (i.e., switches and/or end node devices). Network reconfiguration requires the use of programmable switches and/or network interfaces, depending on how routing is performed. It may also make use of generic routing algorithms (e.g., up∗ /down∗ routing) that can be configured for all the possible network topologies that may result after faults. This technique relieves the designer from having to supply alternative paths for each possible fault combination at design time and is extremely flexible because it allows tolerating faults while there is physical connectivity in the network. The drawbacks of this technique are that it uses generic routing algorithms that are valid for any topology, but these routing algorithms achieve poor performance when they are used in regular networks, and that programmable network components have higher cost and latency. Most standard and proprietary interconnection networks for clusters and SANs – including Myrinet, Quadrics, InfiniBand, Advanced Switching, and Fibre Channel – incorporate software for (re)configuring the network routing in accordance with the new topology. Another issue related to network reconfiguration is whether the interconnection network supports hot swapping. That is, whether the interconnection network can continue operation while a new component is added to or removed from the network. Most networks allow for hot swapping which is usually supported by performing a dynamic network reconfiguration without stopping network traffic. The main problem with network reconfiguration is guaranteeing deadlock-free routing while routing tables are updated in switches and/or end node devices, as more than one routing algorithm may be used in the network at the same time. Most WANs solve the possible deadlocks by dropping packets, but dynamic network reconfiguration is much more complex in lossless networks.
Fault-Tolerant Routing Fault-tolerant routing takes advantage of the multiple paths usually available in the network topology between each source-destination pair, in order to route packets through the fault-free paths. Adaptive routing is usually
Networks, Fault-Tolerant
used in this case. The faulty components are made unavailable to the adaptive routing algorithm and all the packets are routed using the remaining components. Nevertheless, it is also possible to achieve network faulttolerance with deterministic routing by routing a packet in two or more phases and storing it at some intermediate nodes. The main difficulty of these routing algorithms is guaranteeing that the routing algorithm will remain deadlock-free and livelock-free after the fault appears given that arbitrary fault patterns may occur. The solution adopted to provide fault-tolerance depends on the type and pattern of the component faults and the network topology. In what follows, different routing algorithms for regular networks are presented and grouped depending on the type of the regular topology, that is, direct or indirect.
Fault-Tolerant Routing in Direct Networks A large number of fault-tolerant routing algorithms for multiprocessor systems have been proposed, especially for mesh and torus network topologies. In multidimensional direct topologies, if the number of faulty components is less than the degree of a node, then the topology provides at least a physical path between each source-destination pair, and therefore the routing algorithm may communicate any two nodes of the system. In direct networks, guaranteeing that the routing algorithm remains deadlock-free after the fault appears and when using the alternative paths to avoid the fault is especially difficult since the faults can comprise its regularity. Some approaches used in direct networks use global status information of the network, whereas others use only local status information. In case of local information, faulty components can be avoided by using a dimension orthogonal to the dimensions leading to the faulty components. Chien and Shin [] propose such an algorithm for hypercubes. When all of the shortest paths to the destination from an intermediate node are blocked by faults, the message is transmitted using an extra dimension that is not in the minimal path. The algorithm is n- faulttolerant. Another alternative approach is the use of randomization to produce probabilistically livelock-free,
N
non-minimal, fault-tolerant routing algorithms. The idea is to avoid repetitive sequence of link traversals in order to avoid livelocks. The Chaos router incorporates randomization. In this case, messages are normally routed along any profitable output port. When a message is stored in an input buffer for too long, blocked by busy or faulty output buffers, it is removed from the input buffer and stored in a local buffer that has a higher priority and allows routing through non-minimal paths. The previously presented approaches assumed the availability of only information about neighboring links and switches. An approach to maintain global fault information is by encoding the node state. Lee and Hayes [] introduce the concept of unsafe node. A non-faulty node is unsafe if it is adjacent to two or more faulty or unsafe nodes. Each node must know the state of its neighbors, so each node must transmit its state information to the neighboring nodes. The unsafe state indicates to the routing algorithm that going through another path is a better routing option, that is, going through a safe neighbor node. When maximum flexibility must be ensured, routing algorithms based on graph search techniques are used. A message can be routed to traverse a set of links corresponding to a systematic search of all possible paths between a pair of nodes. The routing overhead is significant, but the probability of delivering messages is maximized. A main issue in this case is that of passing through a node more than once which results in an unnecessary overhead. Most approaches maintain information in the header to avoid passing through the same node several times. The aforementioned fault-tolerant routing algorithms are defined for SAF or VCT switching. When wormhole switching is used, new challenges appear in the design of fault-tolerant routing algorithms. The use of small buffers makes that blocked messages span multiple nodes, hence the fault affects several nodes. In this case, the most common approach is to define fault regions and route packets avoiding those regions. Adjacent faulty links and faulty switches are joined into fault regions. The idea is not to introduce new dependencies in the routing algorithm in order to guarantee deadlock-freedom. The difficulty of routing messages around faulty regions depends on the shape of the region. The fault tolerant routing algorithms usually impose constraints
N
N
Networks, Fault-Tolerant
to the regions. In particular, they usually assume that the fault regions are convex, since concave regions present more difficulties. Moreover, in some algorithms convex regions are further constrained to be block fault regions, that is, regions whose shape is rectangular. Given a set of faults in a direct network, rectangular fault regions can be constructed by marking some healthy nodes as faulty to fill out the entire region. In this case, the fault-tolerant routing algorithm usually needs additional resources as virtual channels to avoid deadlocks. Assuming block fault regions, when a message encounters a block fault region, an alternative nonminimal path is followed by the message. The alternative path should not introduce new dependencies in the routing algorithm. A solution to avoid deadlocks in -D meshes was introduced by Chien and Kim [] and is to use planar adaptive routing, which adds an additional virtual channel. The technique can also be applied to multidimensional meshes and tori. In the case of torus topologies, this solution requires six virtual channels for each physical link. Moreover, it presents problems when fault regions are adjacent or include nodes on the mesh boundary. Both situations are equivalent to the occurrence of a concave fault region. Ensuring that fault region remains convex will require marking more healthy nodes as faulty, which is not desirable. One solution to reduce the number of healthy nodes that are included in the fault region is based on the idea of fault rings. A fault ring is the sequence of links and nodes that are adjacent to and surround a fault region. If the fault region is rectangular, then the fault ring is also rectangular. If a fault region includes boundary nodes, then a fault ring cannot be defined and a fault chain is used. This fault rings and fault chains are used to misroute messages that are blocked due to faults in their paths. Meshes with non-overlapping fault rings and with no fault chains require two virtual channels to avoid deadlocks. If there are overlapped fault regions or fault chains then four virtual channels are required. Some fault-tolerant routing algorithms are able to work with non-convex fault regions. In particular, some of them assume that faults are covered by solid fault regions. A solid fault region in an n-dimensional mesh is a region where any -D cross section of the fault region produces a rectangular fault region. In this case, the concept of fault ring is used to route messages around
the fault region. In this case, four virtual channels are required to ensure deadlock-freedom. In [], the authors propose a new fault-tolerant routing methodology based on the use of intermediate nodes. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Deadlocks are avoided by moving packets to a different virtual channel when crossing the intermediate node. Fully adaptive routing is used along both subpaths. In order to increase the level of fault-tolerance intermediate nodes are used together with disabling adaptive routing and/or using misrouting on a perpacket basis. The methodology uses a static fault model but tolerates a reasonably large number of faults without disabling any healthy node and only requires an additional virtual channel. The preceding fault-tolerant routing algorithms rely on the use of virtual channels; however, virtual channels affect the speed and complexity of the switches. Software-based fault-tolerant routing is an approach that is not based on the use of virtual channels. When a packet encounters a fault, it is ejected from the network and is later injected and routed through an alternative path. This mechanism is very flexible and supports many fault patterns, without requiring assuming healthy nodes as faulty nor requiring additional virtual channels. However, some packets may suffer high latencies due to the packet ejection and reinjection, and memory bandwidth at the node where the injection/reinjection is performed is consumed.
Fault-Tolerant Routing in Indirect Networks The fault-tolerant proposals can be classified depending on whether they are applicable to unidirectional or bidirectional MINs. In UMINs, if they use the minimal number of stages, there is a unique path for each source-destination pair. In BMINs, the topology provides several paths for each source-destination pair. In particular in BMINs topologies, if the number of faulty components is smaller than the arity of the switches, then the topology provides at least a path between each source-destination pair, and therefore the routing algorithm may communicate any pair of nodes. This is true except for switch and link failures of the first and last stages, where the nodes directly
Networks, Fault-Tolerant
attached to the faulty components get disconnected from the rest of the network. So defining deadlockfree fault-tolerant routing algorithms is easier in BMINs topologies. Some fault-tolerant routing algorithms for UMINs are based on misrouting packets by routing them in multiple passes [, ]. The packets that would cross the faulty elements are routed through other nodes of the network. In this way packets can avoid the faults. The drawback of using such technique is that these algorithms rely on a static fault model, inject the packet into the network several times, provide non-minimal paths, and increase the average packet latency. This technique can also be applied in BMINs. In MINs, there are also proposals of disabling healthy nodes or network elements with poor network connectivity with the aim of making the routing easier after the occurrence of the faults. This approach is followed in [] where the authors propose to create a new, but equivalent, topology from the faulty one, thus preserving the routing algorithm at the cost of disabling some healthy nodes and network elements. The previous proposals are based on a static faultmodel. But there are other proposals that use a dynamic fault model. In particular, some proposals exploit the high number of alternative paths of BMINs to provide fault-tolerance, such as the proposal made in [, ]. In these papers, when a fault is detected, the network reconfigures the routing information at the switches or nodes, and paths that would use any of the faulty network elements are disabled. This provides a good fault-tolerance at almost no overhead in cost. There are several works that mix this proposal with replication of links to increase the number of routing options, increasing the cost of the network. A more recent technique is based on misrouting the packets when they encounter a fault without ejecting the packet from the network [] and without using additional hardware. The authors provide a wide analysis to ensure that the misrouted packets cannot form a deadlock.
Related Entries Flow Control Interconnection Networks Networks, Direct Networks, Multistage
N
Routing (Including Deadlock Avoidance) Switch Architecture Switching Techniques
Bibliographic Notes and Further Readings The main bibliographic source for the elaboration of this entry has been the book []. Most of the discussion about fault-tolerant routing in direct networks is inspired by this book. Other two main sources have been the appendix [] and the book []. Both of them have been especially useful for writing the entry beginning. The references [–] are basic papers for faulttolerant routing in direct regular networks and references [–] are basic papers for fault-tolerance in indirect regular networks.
Bibliography . http://www.top.org/ . Duato J, Yalamanchili S, Ni L () Interconnection networks: An engineering approach. Morgan kaufmann Publishers, San Francisco, California . Dally W, Towles B () Principles and practices of interconnection networks. Morgan kaufmann Publishers, San Francisco, California . Pinkston T, Duato J Appendix E: Interconnection networks. Computer Architecture: A Quantitative Approach . Chien M, Shin K () Adaptive fault-tolerant routing in hypercube multicomputers. IEEE Trans Parallel Distrib Syst ():–, December . Lee T, Hayes J () A fault-tolerant communication scheme for hypercube computers. IEEE Trans Comput C-():–, October . Chien A, Kim J () Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In: Proceedings of the th international symposium on computer architecture. pp –, May . Karlin A, Nelson G, Tamaki H () On the fault tolerance of the butterfly. In: Proceedings of the th ACM symposium on theory of computing. pp – . Shen J, Hayes J () Fault-tolerance of dynamic-full-access interconnection networks. IEEE Trans Comput ():– . Leighton F, Maggs B, Sitamaran R () On the fault tolerance of some popular bounded-degree networks. SIAM J Comput – . DeHon A, Knight T, Minsky H () Fault-tolerant design for multistage routing networks. In: Proceedings of the international symposium on shared memory multiprocessing. April . Gómez C, Gómez M, López P, Duato J () FTEI: a dynamic fault-tolerant routing methodology for fat trees with
N
N
Networks, Multistage
exclusion intervals. IEEE Trans Parallel Distrib Syst (): – . Sem-Jacobsen F, Skeie T, Lysne O, Duato J () Dynamic fault tolerance with misrouting in fat trees. In: Proceedings of the international conference on parallel processing. pp – . Sem-Jacobsen F, Skeie T, Lysne O, Torudbakken O, Rongved E, Johnsen B () Siamese-twin: a dynamically fault-tolerant fattree. In: Proceedings of the international parallel and distributed processing symposium. Denver, Colorado . Gómez ME, Nordbotten N, Flich J, Lopez P, Robles A, Duato J, Skeie T, Lysne O () A routing methodology for achieving fault tolerance in direct networks. IEEE Trans Comput (): –
Networks, Multistage Olav Lysne, Frank Olaf Sem-Jacobsen The University of Oslo, Oslo, Norway
Synonyms Butterfly; Crossbar; Distributed switched networks; Fat tree; Interconnection networks; k-ary n-fly; k-ary n-tree; MIN; Multistage interconnection networks; Router-based networks
Definition A multistage network is a network for interconnecting a set of nodes through a switching fabric. These nodes can either be programmable computers or memory blocks. The switching fabric consists of a set of switches interconnected to form a topology with defined connection points for the nodes. The switches are organized in stages. A switch may have zero, one or more connection points, thus differentiating multistage networks from direct networks where every switch must have at least one connection point and indirect networks in general which have a less strict switch organization.
Discussion Introduction Multistage networks are an approximation to the ideal network in terms of bisectional bandwidth, communication latency, and the number of concurrently communicating pairs in the network. In order to best understand this it is helpful to view the multistage networks in terms of a telephone exchange system, which is where the concept first emerged.
An old-fashioned telephone switchboard had a number of input channels, output channels, and internal switching points (the leads used to connect to telephone lines). As long as the switchboard had enough internal switching points it was possible to connect any two free phone lines together and allow them to communicate. In that case the switchboard was nonblocking. However, without enough leads to connect to telephone lines two parties wishing to initiate a phone call might have had to wait until one of these leads were available before the call could be initiated. In that case the switchboard was blocking. As the telephone network grew a single switchboard was not sufficient to connect all the endpoints without becoming prohibitively large. The solution was to create a network of the switchboards so that a telephone call might have to traverse multiple switchboards. As the size of the switchboards and the number of switchboards increased the challenge became to maintain the nonblocking capabilities of the network to as large an extent as possible. This led to the advent of multistage networks (clos networks) which will be discussed below. The concepts from the telephone network can be more or less directly transferred to computer networks, with the important addition of packet switching in addition to the line/circuit switching used by the telephone network. The rest of this chapter focuses on packet switched multistage networks, as they are the ones most commonly found in current computer systems. Multistage networks are commonly found in tightly coupled networks such as networks used in computer clusters, within switches, or in single chips. Here the nonblocking concept is often modified to statistically nonblocking as discussed further down. It is in these domains that multistage networks have been most extensively studied.
Topologies As for direct networks, the topology of a multistage network can be modelled as a graph G(V, C) where V is the set of switches and C is the set of links (unidirectional or bidirectional) interconnecting the switches. The processing nodes are usually not included in the model. In addition, a regular multistage network can be modelled as an N × M network with N inputs and M outputs. It is not uncommon that M equals N.
Networks, Multistage
Depending on the size of N and M there are practical limitations on how these can be interconnected in a topology. The ideal topology is the crossbar (discussed below), and for small N and M this is indeed a practical implementation. However, as N and M increase, implementing a crossbar becomes impractical. A common solution to this problem is to implement the network in multiple stages as a multistage network.
Crossbar The crossbar is perhaps the most ideal network topology. Given N nodes to be interconnected, an N × N crossbar allows N one-to-one interconnections without contention, i.e., it is nonblocking. A node can send data to any other node that is not already receiving data without having to wait for capacity in the crossbar. This lack of contention allows the connection points in the crossbar to be implemented as very simple arbiters. The crossbar has very high bisectional bandwidth that scales linearly with the size of the crossbar, and because of its simplicity the time used to cross it is very low. A simple crossbar is visible in Fig. . The high degree of connectivity of the crossbar makes it a quite expensive topology to implement. The cost of it is O (N ), indicating that it does not scale well beyond small sizes. Because of this, the crossbar is often used to interconnect the input and output ports of a switch itself, and the switches are again interconnected
Input Output
Networks, Multistage. Fig. An × crossbar
N
to create larger networks to overcome the scaling problem of the crossbar. The largest single-chip crossbar switches in commercial production at the time of writing typically have around ports.
Multistage Networks In order to solve the problem of scalability of the crossbar and still maintain its nonblocking properties, Charles Clos introduced nonblocking switching networks in . While the work was originally intended for the telephone networks, the concept of nonblocking switching networks has, together with the crossbar, been transferred to the packet-switched computer networks domain. The types of networks he introduced are often called clos networks. The basic idea behind clos networks is to connect switches in an odd number of switching stages, with one or more switches at each stage. This idea has been greatly extended since its inception, resulting in a class of networks called Multistage Networks. The common way of denoting a multistage network is by giving the number of stages in the network, the sizes of the switches that make up the stages, and the permutations that determine the interconnections between the stages. Going back to the original work by Clos, the minimum number of stages for a clos network is three; an input stage, an output stage, and a middle stage. The number of switches and the number of ports on each switch for the three different stages determines the number of nodes that can be connected and the communication properties of the network. For the three stage networks, every switch at the input side of the multistage network has n input ports and m output ports. There are m middle stage switches with r input ports and r output ports. Finally, the output stage consists of r switches with m input ports and n output ports. It is the relationship between these variables that determines the blocking characteristics of the multistage network as discussed in the next section. Figure shows a three-stage clos network with four middle- stage switches and n = , m = , r = r = , and n = . The permutation defines how the switches are interconnected between the stages. This affects scheduling of the packets to cross the topology, and also any degree of redundancy that may be available for use for fault tolerance.
N
N
Networks, Multistage
Unidirectional MINs
Networks, Multistage. Fig. A three stage clos network
For a fixed switch size (e.g., × ) increasing the number of inputs and outputs supported by the topology is achieved by increasing the size of the middle-stage switches. By recursively replacing the middle-stage switches with three-stage clos networks based on smaller switches, the large switches can be replaced by × switches at the cost of increasing the number of stages in the network. Even though the number of stages in the network is increased this makes sense from a cost perspective because × switches are quite inexpensive and fast. However, the overall latency of the network will in many cases increase.
Blocking Characteristics
The switch sizes, the number of stages, and the number of switches at the middle stage also determine the blocking characteristics of the MIN. Pure clos networks are strictly nonblocking, i.e., regardless of any preexisting paths set up through the network, it is always possible to find a path between two idle connection points. In terms of the parameters defined above it is required for the network to be strictly nonblocking that m ≥ n − , n = n = n . As the size (in terms of switches and switch sizes) of the middle stage decreases, the degree of blocking in the network increases. Some MINs are rearrangable (e.g., Benes networks) where it is always possible to set up a connecting path between two idle nodes by moving pre-existing paths around in the network. Other blocking characteristics also exist based on how and when pre-existing paths through the topology are moved around to accommodate further communication pairs.
Multistage networks can either be unidirectional or bidirectional. This section considers the unidirectional topologies. A unidirectional MIN consists of a series of two or more switching stages where packets always move from an injection point at the left side to an ejection point at the right side. To allow bidirectional communication between nodes each injection and ejection pairs are connected to a single node, usually by a short link to the ingress and a long link to the egress. MINs are usually designed in such a way that it is possible to move from any injection point to any ejection point in a single pass through the network. A large number of connectivity permutations (different ways of interconnecting switches at neighboring stages) have been proposed for unidirectional MINs. This includes the shuffle-exchange, the Butterfly, Baseline, Cube, Omega, and the Flip to name a few. All of the topologies mentioned here are isomorphic, or topologically equivalent. This means that any one topology can be converted to any of the other topologies simply by moving the switching elements and the endpoints up and down within their stages. Aside from these a large number of other unidirectional MIN topologies also exist, such as the Benes network, which provides more than one path between each pair of nodes. Each of the existing topologies represents a tradeoff between implementation complexity, blocking versus nonblocking, the ability to route from a single source to multiple destinations in one pass through the topology, partition ability, fault tolerance, and the number of switching stages required to realisze these properties. An important characteristic of unidirectional multistage networks is that the path length to any destination is constant. This is an important characteristic with regards to scheduling packets as it helps to make the network very predictable. Butterfly The Butterfly, or k-ary n-fly, is one of the most
commonly mentioned topologies when discussing multistage networks. k and n represent the number of ports in each direction of every switch and the number of switching stages, respectively. The connection between any two switches at two neighbouring stages is defined
Networks, Multistage
0
1
2
N
3
000 001 010 011 100 101
Networks, Multistage. Fig. A common fat tree, a -ary -tree
110 111
Networks, Multistage. Fig. A four stage butterfly
by the following function. The interconnection pattern is illustrated in Fig. . Let every outgoing switch port be identified by a n-digit number (tuple) where each digit x can have the values from to k − , {xn- , xn- , . . . , x }. {xn− , xn− , . . . , x } corresponds to the switch number and x is the port number on that switch. The connection from a port ai− at stage i- (enumerated from on the left) to a switch bi = {bn− , bn− , . . . , b } at stage i is given by exchanging the first digit (x ) with the i’th digit xi . For instance, port , which is port on switch at stage , is connected to switch at stage . Port , which is port on switch at stage , is connected to switch at stage , and so on. In this manner, a Butterfly network of any size may be constructed.
Bidirectional MINs
Unidirectional MINs are a very old development, and most modern-day systems use bidirectional channels. The bidirectional channels can be used to construct bidirectional multistage networks. These topologies have usually been derived from unidirectional MINs by “folding.” In simple terms, all end nodes are connected to only one side of the topology. In order to reach any other end node packets is moved to the right in the network until they reach a switch that is also reachable by the destination node. At this point the direction
reverses and the packet travels to the left, towards its destination. The consequence is that, contrary to unidirectional MINs where all path lengths were equal, the path length from a specific source to different destinations may be different. In fact, the destinations can be grouped by locality with respect to a specific source, yielding a hierarchical representation of the distance of the destinations. Fat Tree The most commonly used class of multistage
networks are fat trees. The fat tree was introduced as a topology by C. Leiserson in . In its original inception, the fat tree was presented as a tree topology with a single root (a tree topology is a multistage network). What made the topology unique was that the aggregate bandwidth of all links between any two tiers in the tree was constant. This was achieved through simply increasing the link bandwidth at every tier. Increasing the link bandwidth and switch capacity is not a viable approach for large bandwidth systems, so fat trees are in practice most often implemented as folded clos topologies. Contrary to unidirectional networks, fat trees/folded clos networks are most often represented in a vertical manner instead of horizontal, with the end nodes connected to the bottom of the network. Depending on the exact interconnection pattern and the number of switches at the different tiers, the connections of a fat tree are often described as a k-ary n-tree or m-port n-tree, where m is the number of switch ports and k = m/. n is the number of tiers in the tree. An example topology is depicted in Fig. .
N
N
Networks, Multistage
The fat tree is a very popular topology in modern supercomputers because of its universal nature. Averaged over all possible traffic patterns, the fat tree shows the best mean performance. This means that for general purpose computers designed to solve a multitude of different problems, the fat tree is well-suited. There exists, however, a number of specific problem domains such as physical simulations that are better handled by direct networks such as meshes or tori (see the Chap. Direct Networks).
Routing Functions Many of the multistage networks are self routing. This means that the path through the network to the destination is completely determined by the destination address. At each stage the output port is determined by the corresponding tuple in the destination address. This allows for very simple routing logic and helps to maintain the high performance of the networks. Many direct networks must follow routing restrictions in order to guarantee deadlock freedom. This means that it is difficult to achieve the maximum theoretical performance from the topology. On the other hand, all the types of MINs described here support minimal deadlock-free routing without any limitations. This means that as long as a packet moves towards the destination, any output port can in principle be selected without risking deadlock. For modern large-scale implementations, it is common to use routing tables since the systems often are built using commodity hardware that are routing tablebased. So, even though any path is deadlock free, the exact path chosen for a given destination is determined by a specific routing algorithm in use, whether it is adaptive or deterministic, and in the latter case, the deterministic path that has been set up for the destination.
Multistage Networks Today Multistage networks are a very old research topic. Unidirectional MINs have been extensively studied in order to create topologies that support all desired properties for any given application. This interest has abated significantly in the last decade, mainly because the industrial forces have not shown enough interest in the developments in this field.
In contrast to this, the interest for bidirectional MINs, and specifically the fat tree, has grown in recent years. This is especially evident when looking at the list of the fastest supercomputers in the world. The systems that are not specifically tailored towards applications that require tori-like structures are typically fat trees. This has led to a sustained research interest into how to most efficiently route traffic through the fat tree in a manner that achieves good balancing for a wide variety of traffic patterns. Another challenge that is actively researched is the best way to efficiently utilize the large degree of redundancy in fat trees. This may be exploited either for increased performance through multipath routing or to achieve fault tolerance.
Future Directions Fat trees will continue to be an important topology for high-performance computing. As the number of ports supported by commodity off-the-shelf switching hardware increases, it becomes possible to build larger and larger fat trees without increasing the number of tiers of the topology. Alternatively, the increased switch sizes can be exploited to keep the same topology size and to reduce the number of tiers. This is equivalent to a reduction in latency, which is an important parameter in high-performance computing. Another area where fat trees lend themselves well is in hierarchical communications schemes. A typical example of this is memory systems where a computing core is connected to different levels of cache. Each cache level is larger than the previous and is located further away. At the very end there are typically the RAM chips. By connecting the processing cores to one subset of the tree, and the memory modules to other subsets of the tree that are sequentially further and further away, the hierarchical nature of the memory is replicated to some degree. As the processing scale decreases and more and more memory fits onto a single chip, there might be a need for such approaches to allow scalability and flexibility in the memory management.
Related Entries Buses and Crossbars InfiniBand Myrinet
Non-Blocking Algorithms
Networks, Direct Networks, Fault-Tolerant Network Interfaces Quadrics Routing (Including Deadlock Avoidance) SCI (Scalable Coherent Interface) Switch Architecture Switching Techniques
Bibliographic Notes and Further Reading The first presentation of multistage networks was done by Charles Clos []. There exists a huge number of publications on typical multistage networks, and a simple overview of the characteristics of clos networks is given in []. Other good introductions to the topic of multistage networks can be found in the books by Duato et al. [] and Dally et al. []. The seminal paper on fat trees was published by Charles Leiserson []. This also spawned an extensive amount of work focusing on both reducing the redundancy to achieve cheap topologies with the same characteristics [], and on utiliszing the redundancy for fault tolerance [] and high-performance. The fat tree structure is originally quite strict. This has been extended [] to support a more flexible structure. This is important since most actual implementation of this topology deviates from the strict fat tree definition. The common multiple root definitions of fat trees are k-ary n-trees [] and m-port n-trees [].
N
. Petrini F, Vanneschi M () K-ary N-trees: high performance networks for massively parallel architectures. Retrieved from citeseer.ist.psu.edu/petrinikary.html . Valerio M, Moser LE, Melliar-Smith PM () Fault-tolerant orthogonal fat-trees as interconnection networks. In: Proceedings st international conference on algorithms and architectures for parallel processing, vol , pp – . Valerio M, Moser LE, Smith PM () Recursively scalable fattrees as interconnection networks. In: Proceeding of th IEEE annual international phoenix conference on computers and communications . Øhring SR, Ibel M, Das SK, Kumar MJ () On Generalised Fat-trees. In: Proceedings of the th international symposium on parallel processing, p
NI (Network Interface) Network Interfaces
NIC (Network Interface Controller or Network Interface Card) Network Interfaces
Node Allocation Job Scheduling
Bibliography . Clos C () A study of non-blocking switching networks. Bell Syst Tech J :– . Dally W, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco . Duato J, Yalamanchili S, Ni L () Interconnection networks: an engineering approach. Morgan Kaufmann, San Francisco . Jajszczyk A () Nonblocking, repackable, and rearrangeable clos networks: years of the theory evolution IEEE Commun Mag ():– . Leiserson CE () Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput C-: – . Lin X, Chung Y, Huang T () A multiple LID routing scheme for fat-tree-based infiniband networks. In: Proceedings of IEEE international parallel and distributed processing symposiums
Non-Blocking Algorithms Danny Hendler Ben-Gurion University of the Negev, Beer-Sheva, Israel
Synonyms Lock-free algorithms
Definition Non-blocking algorithms are shared-memory multiprocessor algorithms that can avoid race conditions and guarantee correctness without using mutual exclusion
N
N
Non-Blocking Algorithms
locks. They are in general more resilient to asynchronous conditions than lock-based algorithms since they guarantee progress even if some of the threads participating in the algorithm are delayed for a long duration or even fail-stop. (The definition of the terms “non-blocking” and “lock-free” has changed in recent years, creating some terminology confusion. In the past, an algorithm was called “non-blocking” if it guaranteed global progress and “lock-free” if it did not rely on mutual exclusion locks. In recent years, most authors call an algorithm “lock-free” if it guarantees global progress and “non-blocking” if the delay of some threads cannot delay other threads indefinitely. This entry adopts the latter terminology since it is more widely used nowadays. Precise definitions are provided later in the sequel.)
Discussion Introduction To ensure correctness, shared-memory multiprocessor algorithms (also called concurrent algorithms) must employ inter-thread synchronization for precluding concurrency patterns that result in race conditions – situations in which correctness depends on the order in which steps by different threads are scheduled. The traditional approach for inter-thread synchronization is to use blocking (lock-based) mechanisms, such as mutex locks, condition variables, and semaphores, for ensuring that conflicting critical sections of code do not execute concurrently. Synchronization based on only a small number of locks (coarse-grained locks), each protecting a large portion of data, may restrict parallelism too much and is not scalable. On the other hand, when numerous locks are being used (fine-grained locks), it becomes difficult to avoid correctness issues such as deadlocks, convoying, and priority inversion. Moreover, the failure of a thread that holds a lock may prevent other threads from making progress. Non-blocking algorithms take a different approach and can ensure correctness without using locks. Nonblocking synchronization is more resilient to asynchrony conditions than blocking synchronization since thread delays or failures do not prevent other threads from making progress.
Progress and Correctness Guarantees for Non-blocking Algorithms Non-blocking algorithms can be characterized according to the type of progress they ensure. The key progress guarantees provided by non-blocking algorithms are the following: . Wait-freedom [] ensures that a thread will always complete its operation in a finite number of its steps regardless of the progress of other threads. Wait-free algorithms thus guarantee the individual progress of any non-failed thread. . Lock-freedom [] ensures that some thread will complete its operation after a finite number of steps is taken by the participating threads. Lock-free algorithms thus guarantee global progress, but individual threads may starve (they might never complete their operation). . Obstruction-freedom [, ] ensures that any thread that runs in isolation (i.e., it is the only thread taking steps) from some point on will complete its operation in a finite number of its steps. If no thread runs solo long enough, then obstruction-free algorithms may suffer from live-lock, where threads constantly interfere with one another and none of them completes its operation. It is easily seen that wait-freedom implies lockfreedom which, in turn, implies obstruction-freedom. The converse implications are false: an obstruction-free algorithm is not necessarily lock-free, and the latter is not necessarily wait-free. The correctness condition of a concurrent data structure specifies how to derive the semantics of concurrent implementations of the data structure from the semantics of the corresponding sequential data structure. This requires to disambiguate the expected results of concurrent operations on the data structure. Two common ways to do so are sequential consistency [] and linearizability []. Both require that the values returned by the operations appear to have been returned by a sequential execution of the same operations; sequential consistency only requires this order to be consistent with the order in which each individual thread invokes the operations, while linearizability further requires this order to be consistent with the real-time order of nonoverlapping operations.
Non-Blocking Algorithms
The vast majority of the non-blocking synchronization literature considers linearizable (also called atomic) implementations.
Base Objects and Primitive Operations Concurrent algorithms in general, and non-blocking algorithms in particular, are used to implement concurrent objects such as concurrent counters, queues, hash-tables, and so on. Concurrent objects are typically implemented from simpler base objects. The simplest base objects in a given system are its memory locations. The system’s hardware or operating system provide primitive operations that can be applied to memory locations. The familiar atomic read and write primitive operations are assumed to be provided by all systems. In the shared-memory nomenclature, a memory location that supports the read and write operations only is called a register. It has been shown [] that registers alone cannot support non-blocking implementations of even the most basic data structures, like queues and stacks. This and other evidence on the relative weakness of reads and writes have led much of the research on nonblocking algorithms to assume the existence of stronger primitive operations. The most notable of these is compare and swap (often denoted CAS), which takes three arguments: a memory address, an expected value, and a new value (Fig. ). If the address stores the expected value, it is replaced with the new value; otherwise it is unchanged. The success or failure of this operation is then reported back to the program. It is crucial that this operation is executed atomically; thus an algorithm can read a datum from memory, modify it, and write it back only if no other thread modified it in the meantime. CAS can be used, together with reads and writes, to implement any object in a wait-free manner []. It has consequently become a synchronization primitive
compare-and-swap(w,expected,new) if(∗w=expected) then ∗w=new,return success else return failure
fetch-and-add(w,Δ) curVal=∗w ∗w=curVal+Δ return curVal
Non-Blocking Algorithms. Fig. The compare and swap and fetch and add primitive operations
N
of choice, and hardware support for it is provided in many multiprocessor architectures. Another widely available primitive operation is the fetch and add primitive (denoted FAA), which takes two arguments: a memory address storing an integer value and an integer Δ. The FAA operation atomically increases the integer stored at the argument address by Δ and returns its previous value. The pseudo-code of the CAS and FAA primitive operations is shown below. To exemplify the use of the CAS and FAA operations, consider the implementations of a concurrent counter object shown in Fig. . A counter supports a single operation fetch-and-inc, which atomically increments the counter value by and returns its previous value. The fetch-and-inc operation can be trivially implemented in a wait-free manner by applying the FAA operation with Δ = . Using CAS, the fetch-and-inc operation can be implemented by reading the current counter value (this is the expected value), adding , and then using CAS for attempting to update the counter. If no other thread modifies the counter between the read and CAS operations, the CAS will succeed, and the counter will be updated. However, if a concurrent modification does occur, the CAS will fail, and the thread will repeatedly retry the update (by first reading the new value of the counter) until it is successful (if it ever is). This algorithm is not wait-free since other threads may keep incrementing the counter and make a specific thread repeatedly fail. It is lock-free since a failure of a CAS operation implies that a CAS operation by a different thread succeeded between when the expected value was read and the failing CAS was attempted.
Theoretical Foundations As shown above, the FAA operation can be used to implement a counter in a wait-free manner. Can a waitfree counter be implemented by using only read and write operations? Can the CAS operation be implemented in a wait-free manner by using only read, write, and FAA operations? It turns out that the answer to both these questions is negative. The foundations of the rich theory underlying non-blocking synchronization were laid down in by a seminal paper of Herlihy []. Quoting from []: “The fundamental problem of wait-free synchronization can be phrased as follows: given two concurrent objects X and Y, does there exist
N
N
Non-Blocking Algorithms
class Counter { private int val=0 int fetch-and-inc(){ return fetch-and-add(&val,1) } }
class Counter { private int val=0,exp int fetch-and-inc(){ do {exp=val} until CAS(&val,exp,exp+1)=success }
Non-Blocking Algorithms. Fig. Implementing a counter object from fetch and add and from compare and swap class Consensus {//Code for thread i∈{0,1} private Counter c initially 0 private register vals[0..1] boolean decide(Vv){ vals[i]=v if c.fetch-and-inc()=0 return v else return vals[1-i] } }
Non-Blocking Algorithms. Fig. A two-thread consensus implementation from counter objects and registers
a wait-free implementation of X by Y?” Herlihy introduced powerful techniques for obtaining the answers to such questions. These techniques are based on reductions to the consensus object. A consensus object supports a single operation called decide, which every thread may call with its input at most once. The decide operation must satisfy the following requirements. ●
Agreement: all terminating calls of decide (that is, all calls made by threads that do not fail while executing the decide operation) must return the same value. In other words, all threads agree on the same value. ● Validity: the common decision value is the input of some thread. A consensus number is associated with each object type, specifying the maximum number of threads for which the object type can implement consensus in a wait-free manner. More specifically, if object type X has consensus number CN(X), then a wait-free consensus object, shared by CN(X) or fewer threads, can be implemented by objects of type X and registers; however, a wait-free consensus object that is shared by more than CN(X) threads cannot be implemented by objects of type X and registers.
For instance, Herlihy proves [] that the consensus number of a register is . This means that registers can (trivially) implement wait-free consensus for only a single thread but cannot be used to implement waitfree consensus for two or more threads. As shown in Fig. , a counter object can implement wait-free consensus for two threads. Herlihy proves also [] that wait-free consensus shared by or more threads cannot be implemented by any number of counter objects and registers. Thus, the consensus number of a counter is . It follows that a wait-free implementation of a counter from registers, shared by two threads or more, is impossible. An object type X is called universal if any object can be implemented from instances of X and registers in a wait-free manner. Herlihy proves that an object is universal in a system of n threads if and only if it has consensus number at least n. The universality of consensus is established by presenting a universal construction algorithm (using consensus objects and registers) that can implement any object in a wait-free manner. Consensus is not the only universal object: Herlihy shows that the CAS object has consensus number ∞. It follows that CAS can be used, together with read and write, to implement any object (shared by any number of threads) in a wait-free manner. The CAS operation has consequently become a synchronization primitive of choice, and hardware support for it is provided in many multiprocessor architectures. The concepts of consensus-numbers and universality imply that there is a hierarchy of objects (called the wait-free hierarchy), where objects are placed in hierarchy levels according to their consensus numbers, such that () an object at level l cannot implement objects at level m > l, shared by more than l threads, in a wait-free manner, and () an object at level l (together with registers) is universal in l-thread systems.
Non-Blocking Algorithms
The table below presents the consensus numbers of several concurrent objects. Consensus number
Objects
Read/write registers
Fetch and add, swap, test and set, stack, queue
⋮
⋮
m
m-enqueues augmented queue
⋮
⋮
∞
Compare and swap, load-link/store-conditional, augmented queue
CAS, the load-link/store-conditional pair of operations (a weak version of which is implemented on some IBM architectures), and the augmented queue object (a queue that supports a peek operation that returns the value of the first queue item, in addition to enqueue and dequeue operations) all have consensus number ∞ and are therefore universal in any system. It is noteworthy that all levels of the hierarchy are populated (An menqueues augmented queue is an augmented queue that allows at most m enqueue operations and starts returning a response after the (m + )’st enqueue operation is applied to it []). The impossibility and universality results presented in [] and summarized above hold also for lockfree implementation but do not hold for obstructionfree implementations: there are obstruction-free implementations of any object from read and write operations only []. An objects hierarchy is called “robust” if objects at lower levels cannot be used to implement objects at higher levels. Jayanti [] showed that the wait-free hierarchy is not robust if consensus numbers are defined so that implementations cannot use registers or can only use a single object of the lower type.
Algorithms and Design Considerations Non-blocking synchronization has been the focus of intense research in recent years, and numerous nonblocking algorithms have been published. A few such algorithms are described in the following, mainly for highlighting the design principles underlying them. A register is called safe if a read operation that does not overlap a write operation returns the value
N
written by the most recent write operation (or the initial value if there are no preceding write operations) but may return any value if there is a concurrent write operation. A register is single-writer single-reader if it can only be used for unidirectional communication between two specific threads, one of which may write the register and the other may read it. A register is multi-writer multi-reader if multiple threads may read and write to it. Lamport [, ] defined various register types based on the number of threads that are permitted to read and write them and their correctness guarantees. A series of constructions established the rather surprising result that binary safe single-writer single-reader registers can be used to implement the much stronger atomic multi-writer multi-reader registers in a wait-free manner. (They are simply refered to as atomic registers in the rest of the text.) The atomic snapshot object [] allows a thread to take an instantaneous snapshot of an array of registers. It supports two operations: the update operation writes a value to the calling thread’s register, and the scan operation returns an instantaneous snapshot, that is, a state of the array that existed at some point during the execution of the scan operation. The ability to obtain an instantaneous snapshot in a wait-free manner can simplify the design of non-blocking algorithms. Indeed, the snapshot object is used as a building block of non-blocking implementations of key objects such as k-set consensus, approximate agreement, and renaming (see [], Chapter ).
Helping A wait-free implementation of the snapshot object from atomic registers is given in []. In this implementation, update operations help scan operations by performing an embedded scan themselves and storing the resulting snapshot value along with the value they write. This is an example of a general design principle underlying many wait-free implementations, called helping: operations may need to assist other operations to prevent starvation. A helping mechanism is also used in Herlihy’s universal wait-free construction [], where each thread, before performing its own operation, chooses another thread and checks if it requires help; if it does, it first performs the other thread’s operation and only then performs its own operation.
N
N
Non-Blocking Algorithms
The ABA Problem and Safe Memory Reclamation Many non-blocking implementations of concurrent objects based on strong synchronization operations have been presented, including (but not limited to) counters, stacks, queues and double-ended queues, linked lists, sorting networks, hash tables, skip lists, priority queues, and renaming objects (See [] for a comprehensive discussion of such implementations). Most of these implementations use unary atomic operations (i.e., operations that are applied to a single memory location) such as CAS, FAA or test-and-set, but some implementations are based on multi-word operations such as double compare and swap. Since the CAS operation is available on most modern multiprocessor architectures, and because it can be used to implement any object in a wait-free (hence also lock-free) manner, CAS is used by many non-blocking algorithms. CAS-based algorithms often have to deal with a problem of CAS called the ABA problem. The ABA problem occurs when a thread t reads some value A from a memory address m, then some other threads change m’s value from A to a different value, say B, and back again to A, and finally t attempts to swap m’s value from A to some other value by using CAS; the CAS will then succeed. The correctness of the implementation may be compromised if it depends on the assumption that the CAS must fail if m’s value changes between the read and the CAS applied to it by t. A simple solution to the ABA problem is to associate a reference counter with the application value and to increment it whenever the application value is modified by CAS. This requires packing the application value and reference counter within a single machine word, reducing the number of bits that can be used for storing application values. For some applications, and especially on -bit machines, this may be a satisfactory solution; for other applications, however, reducing the domain size of application values is infeasible. The ABA problem is closely related to the problem of safe memory reclamation. A thread must not recycle (i.e., initialize and then re-use) a dynamic memory structure, even after it was logically removed from the data-structure, while other threads may still be holding references to this structure, which they may later use to access it. Violating this rule may compromise the implementation’s correctness. Specifically, unsafe
memory reclamation may result in the ABA problem occurring for CAS operations that attempt to swap a pointer value; in such scenarios, the swap may succeed instead of failing and corrupt the implementation’s data-structure. General schemes for safe non-blocking memory reclamation exist [, ]. Michael [] proposed a wait-free safe memory reclamation technique that uses hazard pointers. (The scheme described in [] uses similar ideas.) Informally, an access to a dynamic memory structure is called hazardous if the structure may be concurrently recycled by other threads. With Michael’s scheme, each thread owns some (typically small) number of hazard pointers. Before making a hazardous access to structure s, a thread t sets one of its currently unused hazard pointers to point to s; then, t verifies (in some application-dependant manner) that s is still logically part of the algorithm’s data-structure. If the verification succeeds, t can safely access s as the scheme now guaranteed that s will not be recycled as long as t’s hazard pointer points to it. If the verification fails, implying that s is no longer part of the data-structure, t proceeds to handle this situation in an application-dependant manner. After a dynamic memory structure is logically removed from the datastructure, it is designated as a candidate for memory recycling but will only be recycled when no hazard pointers point to it.
Memory Contention and Adaptive Algorithms The number of steps performed by a thread in the course of performing an operation on the implemented data-structure is not the only factor that contributes to the time complexity of non-blocking algorithms. In practice, the performance of non-blocking algorithms is often limited by memory contention [], the extent to which multiple threads access widely-shared memory locations simultaneously. The degradation in performance that is caused by contention is the result of limitations on the bandwidth of both memory and processor-to-memory interconnect. Important properties of a non-blocking algorithm are therefore () the extent by which threads executing the algorithm may encounter memory contention as they execute their steps, and () whether the algorithm’s step complexity is a function of the total number of processes that may participate in the execution or
Non-Blocking Algorithms
a function of the number of threads that actually participate in the execution (as captured by the memory contention encountered by an operation). There are several ways to measure contention during the interval of an operation op. The total contention [] of an execution is the number of threads that ever took steps in the execution. The interval contention [] of op is the number of threads whose operations are concurrent with op. The point contention [] of op is the maximum number of operations simultaneously concurrent with op. Finally, the step contention [] of op is the number of threads that take steps during op’s execution interval. An implementations is adaptive to (total, interval, point or step) contention if the step complexity of an operation is a function of the contention during the operation interval. By definition, algorithms with adaptive step complexity are wait-free. Yet another measure of contention is hot-spot contention [], defined as the maximum number of threads simultaneously poised to access the same memory location. Reducing memory contention and designing algorithms that are adaptive to different measures of contention has been the focus of the design of many non-blocking data structures. Some examples include: Counting networks [, ], which are highly concurrent wait-free implementation of a counter that eliminates sequential bottlenecks and reduce hot-spot contention; adaptive non-blocking implementations of Collect [, , , ], and adaptive wait-free implementations of Renaming [].
Non-blocking Algorithms: Pros and Cons Non-blocking algorithms are one approach for implementing concurrent data-structures, but alternative approaches exist as well. In this section, the key alternative paradigms for concurrent multi-threaded programming are briefly described and the advantages and disadvantages of the non-blocking approach in comparison with these alternatives are discussed. The traditional approach for inter-thread synchronizing is to use mutual-exclusion locks. Fine-grained lock-based algorithms can be highly efficient but are error-prone and difficult to program. Coarse-grained lock-based algorithms, on the other hand, are much easier to program but, in general, are not scalable. The key advantage of non-blocking synchronization in comparison to lock-based programming is its
N
resilience to asynchrony, since, in general, threads executing a non-blocking algorithm can make progress even when other threads are delayed or even fail. On the other hand, the design of wait-free and (to a lesser extent) lock-free algorithms is notoriously hard. Due to their design difficulty, obtaining efficient wait-free or lock-free implementation of basic data-structures is an active research topic in recent years. Another disadvantage of wait-free and (to a lesser extent) lock-free algorithms is that they often incur high overhead in terms of time-complexity. As shown by Attiya et al. [], there are problems for which the stepcomplexity of wait-free implementations is inherently higher than that of non-wait-free implementations. Wait-free and lock-free algorithms often use strong synchronization operations, such as CAS, that are slower than read and write operations, which also hurts their performance. Designing obstruction-free algorithms, on the other hand, is easier and they have the potential of being faster in the lack of contention. However, obstructionfree algorithms may be susceptible to live-lock scenarios and require contention-management mechanisms for overcoming them. Transactional memory (TM) [, ] is an emerging concurrent programming abstraction. Memory transactions allow a thread to execute a sequence of shared memory accesses whose effect is atomic: similarly to database transactions, a TM transaction either has no effect (if it fails) or appears to take effect instantaneously (if it succeeds). Some software transactional memory (STM) implementations are non-blocking, while others are lock-based. A key advantage of the TM programming abstraction is its simplicity: due to the atomicity provided by transactions, programmers can apply sequential reasoning and only need to consider executions that consist of transaction sequences for ensuring correctness. On the other hand, since TM is a generic mechanism, it is likely that optimized direct non-blocking implementations of specific data-structures will be more efficient than TM-based implementation (assuming both implementations guarantee the same progress and correctness guarantees) but will be much harder to derive. It is now accepted that no single programming methodology is a panacea for multi-threaded
N
N
Non-Blocking Algorithms
programming. Instead, multiple approaches must be considered and possibly combined, depending on the application type and the target architecture.
Related Entries Asynchronous Iterative Algorithms Atomic Operations Processes, Tasks, and Threads Race Conditions Shared-Memory Multiprocessors Synchronization Transactional Memories
Further Reading The textbooks by Attiya and Welch [], Herlihy and Shavit [], Lynch [] and Taubenfeld [] contain a wealth of information about Distributed Computing in general, and the theory and practice of non-blocking algorithms in particular. Numerous lower bounds and impossibility results for both deterministic and randomized non-blocking algorithms, under various measures, were published over the last years. One of the key developments of this very active area of research is the use of algebraic topology tools for proving impossibility results on the computability of wait-free decision tasks from read and write operations [, , ]. A detailed description of this research area is out of the scope of this entry, and the interested reader is referred to a comprehensive survey by Fich and Ruppert [].
Bibliography . Afek Y, De Levie Y () Efficient adaptive collect algorithms. Distrib Comput ():– . Afek Y, Attiya H, Dolev D, Gafni E, Merritt M, Shavit N () Atomic snapshots of shared memory. J ACM ():– . Afek Y, Stupp G, Touitou D () Long-lived adaptive splitter and applications. Distrib Comput (): . Aspnes J, Herlihy M, Shavit N () Counting networks. J ACM ():– . Attiya H, Fouren A () Adaptive and efficient algorithms for lattice agreement and renaming. SIAM J Comput ():– . Attiya H, Fouren A () Algorithms adaptive to point contention. J ACM ():– . Attiya H, Welch J () Distributed computing: fundamentals, simulations and advanced topics. Wiley, Hoboken . Attiya H, Lynch NA, Shavit N () Are wait-free algorithms fast? J ACM ():–
. Attiya H, Fouren A, Gafni E () An adaptive collect algorithm with applications. Distrib Comput ():– . Attiya H, Guerraoui R, Hendler D, Kuznetsov P () The complexity of obstruction free implementations. J ACM ():– . Borowsky E, Gafni E () Generalized FLP impossibility result for t-resilient asynchronous computations. In: STOC , San Diego, pp – . Dwork C, Herlihy M, Waarts O () Contention in shared memory algorithms. J ACM ():– . Fich F, Ruppert E () Hundreds of impossibility results for distributed computing. Distrib Comput (–):– . Gafni E, Merritt M, Taubenfeld G () The concurrency hierarchy, and algorithms for unbounded concurrency. In: PODC , Newport, pp – . Herlihy M () Wait-free synchronization. ACM Trans Program Lang Syst ():– . Herlihy M, Moss E () Transactional memory: architectural support for lock-free data structures. In: ISCA , San Diego, pp – . Herlihy M, Shavit N () The topological structure of asynchronous computability. J ACM ():– . Herlihy M, Shavit N () The art of multiprocessor programming. Morgan Kaufman, Amsterdam . Herlihy M, Wing J () Linearizability: a correctness condition for concurrent objects. ACM Trans Program Lang Syst (): – . Herlihy M, Shavit N, Waarts O () Linearizable counting networks. Distrib Comput ():– . Herlihy M, Luchangco V, Moir M, Scherer III WN () Software transactional memory for dynamic-sized data structures. In: PODC , Boston, pp – . Herlihy M, Luchangco V, Moir M () Obstruction-free synchronization: double-ended queues as an example. In: ICDCS , Providence, pp – . Herlihy M, Luchangco V, Martin PA, Moir M () Nonblocking memory management support for dynamic-sized data structures. ACM Trans Comput Syst ():– . Jayanti P () Robust wait-free hierarchies. J ACM (): – . Lamport L () How to make a multiprocessor that correctly executes multiprocess programs. IEEE Trans Comput C():– . Lamport L () On interprocess communication. Part I: basic formalism. Distrib Comput ():– . Lamport L () On interprocess communication. Part II: algorithms. Distrib Comput ():– . Lynch N () Distributed algorithms. Morgan Kaufman, San Francisco . Michael M () Hazard pointers: safe memory reclamation for lock-free objects. IEEE Trans Parallel Distrib Syst ():– . Saks ME, Zaharoglou F () Wait-free k-set agreement is impossible: the topology of public knowledge. SIAM J Comput ():– . Shavit N, Touitou D () Software transactional memory. Distrib Comput ():–
NUMA Caches
. Taubenfeld G () Synchronization algorithms and concurrent programming. Pearson/Prentice Hall, Harlow
Nondeterminator Race Detectors for Cilk and Cilk++ Programs
Nonuniform Memory Access (NUMA) Machines NUMA (Non Uniform Memory Access) Machines NUMA Caches
NOW Clusters
NUMA Caches Alessandro Bardine, Pierfrancesco Foglia, Cosimo Antonio Prete, Marco Solinas Università di Pisa, Pisa, Italy
Synonyms Nonuniform memory access (NUMA) machines
Definition NUMA is the acronym for Non-Uniform Memory Access. A NUMA cache is a cache memory in which the access time is not uniform but depends on the position of the referred block inside the cache. Among NUMA caches, it possible to distinguish: () the NUCA (Non-Uniform Cache Access) architectures, in which the memory space is deeply sub-banked, and the access latency depends on which sub-bank is accessed; and () the shared and distributed cache of a tiled Chip Multiprocessor (CMP), in which the latency depends on which cache slice has to be accessed.
N
Discussion Introduction For the past decades, microprocessors’ overall performance has been improved thanks to the continuous reduction of transistor size obtained in silicon fabrication technology. This scaling contributed in both () allowing designers to put on a chip more and more transistors, and thus to implement on the same die more and more complex microarchitectures, up to arrive to single Chip Multiprocessor (CMP) systems, and () increasing processors’ clock frequencies. As the memory bandwidth requirements of cores have increased as well, the increased number of on-chip transistors has also been used to integrate on the same die deeper and larger memory hierarchies. However, reducing feature sizes has also introduced the wiredelay problem: sizes of on-chip wires have been also reduced, resulting in larger delay for signals propagation []. In particular, considering a huge traditional on-chip cache, the wire delay and the increase of clock frequency lead to an increase of the latency experienced on each access. The bulk of access time involves signals’ routing to and from cache banks and not the bank access themselves. Exploiting a non-uniform access time for on-chip caches is a viable solution to the wire-delay problem. By allowing non-uniform access time also to cache memories, it is possible to design a cache composed by independently accessible banks, connected via a scalable connection fabrics, typically a Network-onChip (NoC) [, , ]. The overall cache access latency is proportional to the physical distance between the requesting unit and the accessed bank. This organization is called NUCA (Fig. b and c): banks close to the requesting unit can be accessed faster than a traditional monolithic cache. If the cache management policies succeed in keeping the most performance impacting blocks in such banks, the overall performance will be boosted. NUCA architectures have been proposed for both single-core [, , , ] and multi-core [, , –] environments. Needs for scalability in CMP designs has lead to the design of CMP tiled systems that are composed by many identical nodes, called tiles. Each tile is composed by one cpu, its private cache levels, and the LLC. Nodes are typically connected through a NoC. In these systems,
N
NUMA Caches
the LLCs can be used as a globally shared cache, the LLC inside each node being a slice of the whole cache. Hence, the access time depends on respective position of the requesting node and of the memory slice involved in the access, and this results in an non uniform access time. Those systems results to be implicitly NUMA caches as a consequence of their tiled architecture, differently from NUCA architectures whose design explicitly focuses on the non-uniformity in access time in order to tolerate the wire-delay effects.
NUMA Caches in the Single-Core Environment: NUCA
IL1 CPU DL1
a
IL1 CPU DL1
b
L2 controller
In a traditionally organized cache, each bank is connected to the controller via a fixed length path, usually organized according to the H-Tree model (Fig. a). Thus, in a wire delay–dominated context, the access latency of each bank is dominated by the constant length of the path followed by control and data signals, and it is independent from the physical position of the bank. This organization is referred as UCA (Uniform Cache Access) as its access time is uniform. The basic idea of NUCA, first proposed by Kim et al. [], is to design the cache so that banks are connected among them and with the controller via a connection fabric, typically a NoC. Figure b and c show a system equipped with a Level NUCA, partitioned in independent banks, connected with a partial -D mesh NoC.
The adoption of a NoC allows to access each bank independently from the others. In particular, in a NUCA, there is a different access latency for each bank depending on its physical distance from the controller. Being in a wire delay–dominated environment, the signal propagation delay dominates overall access latency. Hence banks that are closer to the controller (Fig. b) can be accessed faster than the others that are farther (Fig. c). As a consequence, the access time is nonuniform as it is proportional to the physical distance to be traversed by signals. This architectural design is complemented by data management policies that maintain critical blocks in faster banks. This allows better performance with respect to UCA architectures by obtaining a lower average cache access latency, thus hiding wire-delay effects.
Data Management Policies: S-NUCA and D-NUCA When designing a NUCA, one of the main choices is related to the mapping rule that dictates which bank can hold a given memory block. The simplest mapping rule is the static mapping: a block can exclusively reside in a single predetermined bank, basing on its memory address. A NUCA cache adopting the static mapping is called Static NUCA (S-NUCA). An alternative rule is the dynamic mapping: a block can be hosted in a set of banks, called bankset, and can be moved inside the bankset. A NUCA cache adopting the dynamic mapping is called Dynamic NUCA (D-NUCA) [].
IL1 CPU DL1
L2 controller
N
L2 controller
c
NUMA Caches. Fig. Examples of systems adopting a traditional sub-banked UCA cache (a) and a NUCA cache (b and c). In all the designs the memory space is partitioned in banks (white squares in the picture). UCA adopts the H-Tree connection model, while NUCAs use a network on chip made up of routers (black circles in the picture) and links. In this way, each bank is independently accessible from the others and the access latency of a block that is in a bank near the controller (b) is lower than access latency that is in a far bank (c), because the network path that must be traversed in order to reach the farther bank is longer
NUMA Caches
In an S-NUCA made up of N banks, the bank to be accessed is selected basing on the value of log(N) bits of the address. Proposed policies for static mapping are sequential mapping, which uses the most significant bits of the index field of the physical address, and the interleaved mapping, which uses the low-order bits of the same field in order to distribute among different banks blocks that are memory contiguous, thus reducing bank contentions. When searching for a block, the controller injects into the NoC, a request that will be delivered to the pertaining bank. If the block is present, i.e., the cache access results in a hit, the bank sends it to the controller. If the block is not present, i.e., there is a miss, the block is retrieved from the main memory, then is stored in the cache bank and, finally, is delivered to satisfy the request. Despite the simple design and management policies, an S-NUCA usually exhibits lower average access latencies with respect to traditional UCA caches. However, as the data mapping doesn’t take into account the data usage frequency, it may happen that more frequently used data are mapped in banks that are far from the controller, thus resulting in higher access latencies and poor performances.
D-NUCA
CPU DL1
a
IL1 CPU DL1
b
L2 controller
IL1
L2 controller
In a D-NUCA, banks are grouped into banksets, and each block can be stored in any of the banks that belong
to the pertaining bankset: for example, in the systems of Fig. b and c, a bankset could be made up of all banks belonging to the same row. The bankset is chosen basing on the address of the block itself, using log(number of bankset) bits of the index field, adopting either the sequential or the interleaved mapping. Blocks can move from one bank to another of the same bankset thanks to the block migration mechanism. Migrating most frequently used data from farthest banks to closest banks allows to reduce the average access latency, and thus to achieve better performance with respect to S-NUCA. The three main topics in designing a D-NUCA are: ●
Bank Mapping: how to map banksets to physical banks? ● Data Search: how the possible banks are searched to find a block? ● Movement and replacement: how and when the data should migrate? In which bank of the bankset a new block should be placed?
Bank Mapping The bank mapping policy of a D-NUCA dictates how to group the available physical banks into banksets. Many different policies can be adopted, each offering a different trade-off in terms of implementation complexity and fairness, among the various banksets, of the latency distribution. Proposed mapping policies are []: () simple mapping, () fair mapping, and () shared mapping. The simple mapping (Fig. a) maps a bankset to all the banks belonging to the same row. Such policy is
IL1 CPU DL1
L2 controller
S-NUCA
N
c
NUMA Caches. Fig. Three examples of bank mapping: simple mapping (a), fair mapping (b) and shared mapping (c)
N
N
NUMA Caches
are sequentially searched from the closest to the farthest. The delivery of the request to each bank is performed only if the previous bank has missed the block. In this way, the number of bank accesses and the associated energy consumption are reduced at the cost of an increase of the total latency. Other solutions have been proposed [] mixing the two techniques: the banks of one bankset are divided in two or more groups; inside each group the multicast search is adopted, while the various groups are searched incrementally.
Data Search When the search is performed, all the banks of the bankset that can contain a referred block must be searched. The controller injects in the network a request that will be delivered to all the pertaining banks, and the block is searched inside each bank. If all the banks miss the searched block, then there is a cache miss, and the block must be retrieved by the next level in the memory hierarchy. Alternatively, one bank contains the searched block, there is a cache hit, and the block is sent to the controller. Examples of proposed search policies for D-NUCA caches are multicast search and incremental search []. In the multicast search, the request is delivered in parallel to all the possible banks, and each of them starts the search as soon as possible. Figure shows an example of memory operation in a D-NUCA adopting the simple mapping scheme and the multicast search. As opposite, in the incremental search, the banks of a set
Data Movement and Replacement Basing on the locality principle, chances there are that, when a block is referred, it will be referred again in a short time. So, when a block is accessed, it makes sense to move it in another bank of the bankset that is closer to the controller. This movement is called data promotion []. An ideally desirable policy would be to use LRU ordering for the blocks in the banksets, with the closest bank holding the MRU block, second closest holding second most-recently used, etc. Maintaining this ordering would require a heavy block redistribution among banks on each access, with consequent impact on network traffic and energy consumption. For this reason, more practical low-impact policies are employed usually based on generational promotions [] of blocks basing on their access pattern. For example, in the D-NUCA cache of Fig. c on every hit, the accessed block is moved in the left direction by one
IL1 CPU DL1
IL1 CPU DL1
M M M M M H M M
a
L2 controller
simple to be implemented, but there are very different access latencies between banksets that are mapped to the central rows of the cache and those that are mapped in the other rows; to access them, a longer network path along the vertical direction must be traversed. In the fair mapping (Fig. b), this is avoided at the cost of additional complexity: the banks are allocated to banksets so that the average access times to all the banksets are equalized. In the shared mapping (Fig. c), the banksets share the banks closest to the controller so that a fast bank-access is provided to all the sets.
L2 controller
H
b
NUMA Caches. Fig. The multicast search operation in a D-NUCA adopting the simple mapping: the request first traverses the vertical links to the pertaining bankset, then traverses the horizontal links and routers where is replicated to reach all the banks of the bankset (a). The bank containing the block sends a reply to the controller (b) that follows the inverse path of the request message
NUMA Caches
bank. In this way, the next access to the same block will incur a lower latency. This migration policy is called per-hit promotion. More generally, designing a generational data promotion, in the case of a D-NUCA, requires the definition of three elements: the promotion trigger (number of hits after which a promotion must be triggered), the promotion distance (number of banks that the promoting block must be advanced), and the initial placement of a newly loaded block after a cache miss. Another important design decision involves what to do with the block that, in the new bank, is eventually occupying the line in which a promoting block must be inserted. Basic choices for this design issue are the demotion of the block in the bank previously hosting the promoted block or its eviction from the cache. Demotion requires more bank accesses and network traffic than eviction; however, the latter affects performance as it evicts blocks without taking into account their actual usefulness. Different choices result in different power, network traffic, and performance trade-off; however, the majority of the proposed D-NUCA designs [–, , ] adopt the per-hit promotion (i.e., one hit as promotion trigger and one bank as promotion distance) and the insertion of new blocks in the farther from the controller bank. Increasing the value of promotion trigger has been shown to affect performances too much as it limits the promotion of newly loaded data that, basing on the temporal locality principle, are likely to be requested at the next accesses. At the same time, increasing the promotion distance involves the rapid demotion of data that, because of the spatial locality principle, may be needed again in the next future. Using different insertion point for newly loaded data means evicting from the cache blocks that are still useful as they occupy banks that are close to the controller.
NUMA Caches in the CMP Environment The increase in the number of available on-chip transistors and the demand for increasing performance have driven the design of CMP systems that include more elaborating cores and a multilevel cache memory hierarchy in a single die. The adoption of a large on-chip multilevel cache allows to guarantee fast memory access to each core.
N
CMP caches are usually organized in two or more levels in which the Last-Level-Cache (LLC) can be shared among all on-chip cores, while all the other levels are private for each core. As the LLC is usually quite large, it is likely to be affected by the wire-delay problem. For this reason, a CMP system can adopt a shared NUMA cache as its LLC [, , , ].
System Topologies Taking into account the respective position of cache banks and cpus, CMP systems can be classified in two main topology families: dancehall and tiled. In the dancehall, the shared LLC is concentrated in an area of the chip, while cpus stay at one or more sides of the cache. In a dancehall architecture cpus and cache banks can be connected via a NoC (or other connection fabric), resulting in NUCA CMP designs [, , , , ] that are analogous to the single core systems previously discussed. Figure a shows an example of dancehall CMP in which cpus are plugged to the same side of the NUCA. Variations of this simple configuration have half of the cpus plugged at one side of the shared NUCA, and the others plugged to the opposite side (Fig. b), or all the cores distributed along all the sides of the banks matrix []. The tiled topology, as shown in Fig. , is composed by many identical nodes, called tiles. Each tile is composed by one cpu, its private cache levels, and a LLC that can be used as a private cache or as a local slice of a globally shared LLC [–]. In the last case, the access time to the shared LLC depends on the position of both the requesting node and the looked-up LLC slice. In particular, for a core, an access to a remote slice will result in a greater latency, with respect to an access to the local slice. This is due to network paths that depend on the physical distance from the local to the remote tile. When designing a CMP system, choosing between dancehall and tiled introduces a trade-off in terms of performance and scalability. The three main components of the NUMA access time are: () the number of NoC hops to be traversed to reach the hosting bank, () the latency of each hop, and () the access time to the banks that stores the block. Hence, if the bank to be accessed is small and close to the requesting node, the access latency will be limited. If the interested bank is huge and far from the requesting node, the cost of the cache access will proportionally increase.
N
N
NUMA Caches
IL1 CPU DL1 IL1
IL1
IL1
DL1
DL1
IL1
IL1
DL1
DL1
IL1
IL1
DL1
DL1
IL1
IL1
DL1
DL1
CPU
CPU
CPU
DL1 IL1 CPU DL1 IL1
CPU
CPU
CPU
DL1 IL1 CPU DL1 CPU IL1
CPU
CPU DL1 IL1 CPU DL1 CPU IL1
CPU
CPU DL1
a
b
NUMA Caches. Fig. NUCA based CMP systems adopting the dancehall topology, with cpus with private L caches, and a shared L NUCA cache composed by × banks. All the cpus, together with their private caches, can be plugged at the same side of the NUCA (a), or at two opposite sides (b)
Adopting the dancehall topology results in a large number of small banks, each characterized by small response latencies. Moreover, NoC links, being designed to surround banks, are short. Thus they are characterized by a limited link traversal delay. Instead, in the tiled organization, there is a small number of greater and slower banks. Links are longer since they surround the entire tile, and consequently, their traversal delay is higher. Consequently, a request to any nonlocal slice traverses a small number of higher cost network hops. Adopting a tiled topology will increase scalability at the cost of high cache access latency due to huge
cache slice and high network hop traversal cost. Instead, adopting a dancehall topology minimizes the latency of each hop, as well as the access time to each single bank, but presents a smaller degree of scalability than the tiled case.
Coherency in CMP Systems Adopting NUMA Caches In a CMP system, private cache levels must be kept coherent. Given that NUMA LLCs adopt a communication fabric which does not guarantee sequencing of coherence traffic (e.g., the NoC), a directory-based coherence protocol, similarly to the ones proposed for
NUMA Caches
CPU
IL1 DL1
L2
CPU
IL1 DL1
CPU
IL1 DL1
L2
CPU
IL1 DL1
CPU
IL1 DL1
L2
CPU
IL1 DL1
CPU
IL1 DL1
L2
CPU
IL1 DL1
L2
L2
L2
L2
IL1 CPU DL1
IL1 CPU DL1
IL1 CPU DL1
IL1 CPU DL1
L2
L2
L2
L2
CPU
IL1 DL1
L2
CPU
IL1 DL1
L2
CPU
IL1 DL1
L2
CPU
IL1 DL1
L2
NUMA Caches. Fig. A NUCA based CMP system adopting the tiled topology, with cpus with private L cache, and a shared NUCA cache composed by slices, one for each tile
classical DSM systems, must be adopted. The directory is a node of the system that tracks which of the private caches hold a copy of each block. In CMP systems with NUMA LLC, the directory can be centralized or distributed. A centralized directory is a monolithic node of the system separated from the remaining nodes. This solution presents some scalability issues because all the nodes must access it to guarantee the correctness of memory operations. When the number of such nodes increases, then directory contentions increase as well, resulting in higher response time and thus affecting performance. The distributed directory is typically implemented inside each NUMA cache bank. Directory information is stored near the TAG field of each cached block, using some specific status bits. As a consequence, directory accesses are distributed among all banks. When a miss in the last private level cache occurs, an unique access is used to retrieve both the block and the directory information.
N
Mapping Policies Also for the NUMA cache in CMP systems, the two possible mapping policies are static and dynamic. Due to the presence of many traffic sources, in both cases the NoC traffic increases proportionally to the number of cores per chip. At the same time, the contention of shared blocks introduces new design issues that are related to both mapping policy and topology. Static Mapping Static mapping has been adopted in both tiled [] and dancehall [, ] designs. The static mapping associates each memory address to a single cache bank, which is able to satisfy both read/write and coherence requests coming from any of the private next-level cache. The path followed by request and coherence messages to reach the interested bank depends on the respective position of requestor and bank. If the bank is far from the requestor, the message has to traverse many network hops before reaching the bank. If the bank is close to the requestor, the network path will be short. As the bank to be accessed is chosen basing on the block address, the banks that a processor accesses depend on how the working set of the running process/thread is mapped on memory, i.e., which are the physical addresses that are accessed. So, if the running process/thread’s working set is mapped to remote banks, then the cache access latency will be high; otherwise it will be small. Consequently, both topology aspects and memory mapping rules contribute in affecting or enhancing overall performance. Dynamic Mapping Block migration can be used to improve performance also in CMP systems adopting a NUMA LLC. Dancehall [, , , , ] and tiled [–] designs have been proposed in which cache blocks are able to move among banks in order to reduce response latency. As for the dancehall case, the design results in a D-NUCA similar to the single core case. Also in this context, the shared D-NUCA is usually considered as partitioned in many banksets, and cache blocks are able to migrate among banks belonging to the same bankset [, , , ]. For example, in the systems shown in Fig. a and b, banksets may be constituted by the rows of the NUCA bank matrix. If D-NUCA succeeds in bringing the most frequently accessed blocks near to the
N
N
NUMA Caches
referring cpu(s), overall performance will be boosted. However, the performance gain that could be obtained with migration comes with some design issues, as in a CMP there are many traffic sources that complicate the managing of the block migration process. Blocks that are shared by two or more threads can be accessed by different nodes if the sharers are running on different cpus. This is not necessarily a disadvantage but depends on the overall system topology and process scheduling. By considering the system of Fig. a, as all the cpus are placed at the same D-NUCA side, all the blocks migrate toward the same direction because the faster way is the same for all cpus. Instead, in the system of Fig. b, the faster way for half of the cpus is the slowest for the others, and vice versa. If the sharers are running on cpus plugged at opposite D-NUCA sides, shared blocks will alternatively migrate in both directions, and none of the cpus will succeed in bringing such blocks in the respective faster way. This phenomenon is known as ping-pong, and if not properly managed, reduces the performance gain deriving from block migration. Another design issue of D-NUCA CMP systems regards the initial placement of a new block after a cache miss (i.e., choosing which bank of each bankset will store incoming blocks). Considering the dancehall configuration of Fig. a, choosing the fastest way as the block entry point would interfere with most referred blocks, and similar considerations can be done as in the single core case. Instead, for the system shown in Fig. b, the farthest way for half of the cpus is the closest for the other half, so blocks loaded by half of the cpus would interfere with most frequently used blocks of the other half. Consequently, a possible trade-off could be choosing the initial placement in the central banks. Alternative schemes, for both the dancehall and tiled, have been proposed [, ], in which blocks are initially placed near the first requestor and stays there until they are requested by a different cpu. In this case, the blocks migrate toward a bank (or a slice) that is determined according to its memory address basing on a static mapping policy. On subsequent requests, the block is not further moved. Dually, migration schemes have been proposed so that blocks are initially placed basing on their memory address and
subsequently migrated basing on the position of the referring cpu []. Dynamic block movement designs must face the false miss problem. This is a particular race condition that can occur when a subsequent access to a migrating block arrives when the block is still on-the-fly. A block migration involves two banks: the source and the destination. If the new request is received by the source after the block has been sent and by the destination before the migrating block has arrived, both the accesses will result in a miss; actually, the block is present in cache and must not be retrieved by the next level memory. This would cause two problems: () performance loss, because off-chip accesses are always more expensive than cache accesses; () the correctness of memory operations, as there would be two (or more) unmanaged copies of the same block. Techniques for resolving such problem have been proposed, for example, a directory holding all the cached block addresses can be accessed in order to recognize false misses [], or a specific false miss avoidance algorithm can be designed [].
Research Directions Promising lines of studies for NUMA caches are: . Power-consumption saving techniques. While being able to reduce wire-delay effects, NUCA caches still suffer of another typical problem of nanometers technologies: the static power consumptions due to the increase of leakage currents. Besides, due to the additional NoC traffic and bank access, D-NUCA exhibit an increase also in the dynamic power consumption []. Even if NUCA caches are made up of traditional cache banks, the direct adoption of wellknown techniques like, for example, drowsy memories and decay cache lines, may not be effective particularly in D-NUCA because of the data movement that modify bank access pattern. An alternative proposed technique [] exploits different usage of banks due to block migration to power-gate farther and less used banks, and has been introduced for both the single-core case and the CMP environment in which all the cpus are connected to the same cache side. However, when cpus are distributed around more NUCA sides due to the problems that arise in D-NUCA based CMP systems, the concept
NUMA Caches
of farthest banks is ambiguous and has to be redefined; thus the raw application of such technique is no longer possible. . Performance improvement via remapping policies. Memory remapping at compile time is traditionally used to improve cache memory performance, in particular for reducing conflict misses. In the context of NUCA caches, similar techniques can be used also for placing data block in banks (or banksets) that reduce their average access latency []. Typical samples of used metrics in mapping policies are the data usage frequency and their being shared or private. Remapping can be implemented also via hardware, for example, by designing specific protocols that dynamically change the hosting bank (in the case of S-NUCA, similarly to []) or bankset (in the case of D-NUCA). . Efficient block migration schemes are an open issue for CMP caches. Generational promotion schemes as in the case of single-core D-NUCA are effective in dancehall systems in which cores are all plugged at the same cache side. In all the other cases, such simple schemes are no longer effective due to the ping-pong problem that can arise on shared data. Alternative solutions are likely to take into account the position of cores that share a specific block in order to migrate it in a position that optimizes overall performance (similarly to []). Moreover, blocks replication could be evaluated as a viable solution to the ping-pong problem so that each copy can migrate toward a specific requestor []. Such scheme must take care of the need of each application. In fact, while the convenience of replicating Shared-Read-Only Blocks is obvious, there may be classes of applications, in which replicating also Shared-Read-Write blocks can contribute in improving performance.
Related Entries Cache Coherence Interconnection Networks Memory Models Memory Wall Routing (Including Deadlock Avoidance) Shared-Memory Multiprocessors
N
Bibliographic Notes and Further Reading The first NUCA cache architecture was proposed in [], and demonstrates that a dynamic NUCA structure achieves an IPC . times higher than a traditional Uniform Cache Architecture (UCA) when maintaining the same size and manufacturing technology. Leveraging on alternatives data mapping policies, NuRapid [] and Triangular D-NUCA cache [, ] have been proposed as alternative designs for NUCA architectures in order to optimize the trade-off between area occupation and performance. An energy model for NUCA architecture is proposed in [] together with an energy/performance trade-off evaluation for NUCAs and its comparisons with UCA. The Way Adaptable D-NUCA architecture is proposed in [] shows that it is possible to dynamically adapt the cache size to the application needs, thus consistently reducing NUCA static power consumption while marginally affecting the performances. The impact of NoC elements parameters on NUCA performances have been analyzed in []. The same work proposes an alternative design for NUCA caches that relaxes the constraints imposed on network routers latency by clustering multiple banks around each router. In the context of on-chip multiprocessors, the behavior of D-NUCA as shared L caches has been analyzed in [, , ]; these works evaluate the performance of different data mapping and migration policies. Topology studies for S-NUCA and D-NUCA, related to the dancehall configurations, can be found in [, ]. These works also propose a directory coherence protocol specifically suited for CMP NUCA environment and a solution to some design issue of block migration policies in D-NUCA-based CMP systems. The analysis in [] is based on the tiled architecture and employs an optimized block placement and migration scheme, taking into account both block usage characteristic (Read-Only or Read-Write) and threads migration. In [] is proposed a scheme called SP-NUCA in which private blocks are dynamically recognized and then stored in S-NUCA banks that are very close to the owner. The system proposed in [] is based on a tiled architecture and implements a migration mechanism that places blocks in slices depending on the sharers’ position.
N
N
Numerical Algorithms
Bibliography . Matzke D () Will physical scalability sabotage performance gains? IEEE Comput ():– . Kim C, Burger D, Keckler SW () Non uniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro ():– . Chisti Z, Powell MD, Vijaykumar TN () Distance associativity for high-performance energy-efficient non-uniform cache architectures. Proc. th int. symp. on microarchitecture. San Diego, CA, pp – . Huh J, Kim C, Shafi H, Zhang L, Bourger D, Keckler SW () A NUCA substrate for flexible CMP cache sharing. Proc. of the th int. conf. on supercomputing. Cambridge, MA, pp – . Beckmann BM, Wood DA () Managing wire delay in large chip-multiprocessors caches. Proc. of th int. symp. on microarchitecture. San Diego, CA, pp – . Foglia P, Mangano D, Prete CA () A cache design for high performance embedded systems. J Embedded Comput (): – . Foglia P, Mangano D, Prete CA () A NUCA model for embedded systems cache design. IEEE workshop on embedded systems for real-time multimedia (ESTIMEDIA). New York Metropolitan Area, USA, pp – . Chisti Z, Powell MD, Vijaykumar TN () Optimizing replication, communication, and capacity allocation in CMPs. Proc. of the nd int. symp. on computer architecture. Madison . Foglia P, Panicucci F, Prete CA, Solinas M () An evaluation of behaviors of S-NUCA CMPs running scientific workload. Proc. of the th euromicro conference on digital system design (DSD). Patras, Greece, pp – . Foglia P, Panicucci F, Prete CA, Solinas M () Analysis of performance dependencies in NUCA-based CMP systems. Proc. of the st international symposium on computer architecture and high performance computing (SBAC-PAD), Sao Paulo . Hardavellas N, Ferdman M, Falsafi B, Ailamaki A () Reactive NUCA: near-optimal block placement and replication in distributed caches. Proc. of the th international symposium on computer architecture (ISCA-). Austin, pp – . Merino J, Puente V, Prieto P, Gregorio JÁ () SP-NUCA: a cost effective dynamic non-uniform cache architecture. ACM SIGARCH Computer Architecture News ():–, New York . Hammoud M, Cho S, Melhem R () ACM: an efficient approach for managing shared caches in chip multiprocessors. Proc. th int. conf. on high performance embedded architectures and compilers (HiPEAC-). Paphos, Cyprus, pp – . Bardine A, Foglia P, Gabrielli G, Prete CA () Analysis of static and dynamic energy consumption in NUCA caches: initial results. Proc. of the MEDEA workshop. Brasov, Romania, pp – . Bardine A, Comparetti M, Foglia P, Gabrielli G, Prete CA () Way-adaptable D-Nuca caches. International Journals of High Performance Systems Architecture (/):– . Bartolini S, Foglia P, Prete CA, Solinas M () Feedback driven restructuring of multi-threaded applications for NUCA cache
performance in CMPs. nd international symposium on computer architecture and high performance computing. Petropolis, Brazil . Bardine A, Comparetti M, Foglia P, Gabrielli G, Prete CA () Impact of on-chip network parameters on NUCA cache performance. IET Computers & Digital Techniques ():– . Foglia P, Monni G, Prete CA, Solinas M () Re-Nuca: boosting CMP performances through block replication. In: th EUROMICRO conference on digital system design, architectures, methods and tools (DSD), Lille, – Sep . IEEE CS, Los Alamitos, pp –. ISBN: ----
Numerical Algorithms Algebraic Multigrid Asynchronous Iterative Algorithms Dense Linear System Solvers Eigenvalue and Singular-Value Problems Linear Algebra, Numerical Linear Least Squares and Orthogonal Factorization Multifrontal Method Preconditioners for Sparse Iterative Methods Rapid Elliptic Solvers Reordering SPAI (SParse Approximate Inverse)
Numerical Libraries ATLAS
(Automatically
Tuned
Linear
Algebra
Software) BLAS (Basic Linear Algebra Subprograms) ILUPACK LAPACK libflame MUMPS PARDISO PETSc (Portable, Extensible Toolkit for Scientific Computation) PLAPACK SPIKE ScaLAPACK SuperLU
NVIDIA GPU
Numerical Linear Algebra Linear Algebra, Numerical
NVIDIA GPU Michael Garland NVIDIA Corporation, Santa Clara, CA, USA
Synonyms Graphics processing unit
Definition NVIDIA’s GPUs (Graphics Processing Units) are a family of microprocessors that provide a unified architecture for both visual and parallel computing. They generate and process complex -D and -D imagery, typically via application programming interfaces such as Microsoft’s DirectX or OpenGL. NVIDIA’s CUDA architecture provides a platform for executing massively parallel computations, written in standard languages such as C or Fortran, on these processors.
Discussion Graphics Processing Units, or GPUs, are masssively parallel microprocessors. They are present in every workstation, game console, desktop computer, and laptop in use today. They are also increasingly common in mobile devices such as smartphones. The GPU is responsible for generating the -D and -D graphics required by modern applications, and it provides a platform for real-time image and video processing. These visual computing capabilities are typically accessed via standard application programming interfaces (APIs) such as OpenGL or Microsoft’s DirectX. Modern NVIDIA GPUs also support the CUDA parallel programming architecture, and can be programmed directly in standard languages such as C and Fortran. This makes them a platform for parallel computing in general.
Accelerating Computer Graphics Historically, the GPU has evolved to meet the needs of applications that generate and process complex
N
computer graphics. To understand the design of a GPU processor, it is instructive to understand the graphical tasks that they have been designed to accelerate. The real-time rendering of -D computer graphics is a computationally challenging task. Essentially all APIs designed for real-time rendering adopt a pipeline execution model. As an example, Fig. shows a highlevel block diagram of the DirectD pipeline []. Geometric primitives – typically points, lines, and polygons – are generated by the application and enter the graphics pipeline at the left. Primitives and their vertices are independently processed by the geometry and vertex shader stages, respectively. The rasterization stage converts each primitive into a set of pixel fragments, one for each pixel on the screen touched by the primitive. Each of these pixel fragments is processed in the pixel shader stage, and the results are composited into the framebuffer at the end of the pipeline. The gray-colored stages of the pipeline are fixedfunction components; their operation is restricted to a set of functionality defined in the relevant API standards. The others allow the application to provide custom procedures – typically referred to as “shaders” – that process individual primitives, vertices, and pixels. Shader programs are typically written in C-like domainspecific languages such as Cg [], OpenGL’s GL Shading Language (GLSL), or the DirectX High-Level Shading Language (HLSL). The design and evolution of GPU processors has been influenced in a number of ways by the needs of interactive graphics. Foremost among these is the fact that graphics represents a workload with tremendous inherent parallelism. Even the simplest of -D scenes typically contain thousands of polygon primitives, and scenes containing a million or more primitives are not uncommon. At high-definition resolutions, a display device contains million pixels. Rendering the scene involves processing each primitive, each vertex of every primitive, and each pixel touched by every primitive. The graphics pipeline model mandates that every primitive/vertex/pixel can be processed independently of every other primitive/vertex/pixel. Thus, at any one instant in time, a GPU may be given a workload with many millions of independent tasks, all of which could be processed in parallel given the necessary computational resources.
N
N
NVIDIA GPU
Input assembly
Vertex shader
Geometry shader
Rasterize
Pixel shader
Raster Op. & output
NVIDIA GPU. Fig. Block diagram of the graphics pipeline execution model of DirectD
Interactive computer graphics is also a fundamentally throughput-oriented problem. Applications typically render their scenes at a fixed frame rate, such as frames per second. The amount of work that can be completed in / of a second is thus of primary importance, since this will largely determine the richness of the resulting visual experience. This has led GPU processors toward aggressive throughput-oriented designs; they are generally willing to sacrifice performance of a single task in order to increase total throughput over all tasks.
Parallel Computing Architecture Current GPUs, as illustrated in Fig. , are designed around a large array of parallel processors and various fixed-function units for tasks such as work distribution, rasterization, and image processing. The pipeline execution model of the graphics APIs is mapped onto the machine by circulating work through these processors. Shader programs provided by the application for the various programmable stages of the pipeline execute on the processor array and fixed function logic handles the scheduling of these tasks. This basic architecture has been used in all NVIDIA GPUs beginning with the G processor first released in the GeForce GTX in late . This first-generation processor was later followed by the GT processor, which first introduced double precision floating point support. NVIDIA recently released its third major generation of unified processors, a GPU architecture code-named “Fermi.” In addition to executing all shader programs for graphics applications, this parallel processor array can also be programmed directly. NVIDIA’s CUDA architecture provides a programming model and application interface for executing parallel programs – referred to as “kernels” – on the GPU. This programming model can be adapted to most typical sequential programming languages, and compilers for CUDA C, C++, and Fortran are currently available. The programmer organizes a CUDA program into a sequential host program, intended to run on the CPU, and a collection of parallel kernels, that are
intended to run on the GPU. A single kernel is a blocked SPMD (Single Program Multiple Data) parallel computation. It consists of a collection of concurrent threads, all executing the same program. These threads are further grouped into thread blocks. When launching a kernel, the application specifies both the number of thread blocks and the number of threads per block to instantiate. Figure shows a simple example of a CUDA C program fragment for computing y ← αx + y for a scalar α and two n-vectors x and y. This is the so-called SAXPY kernel of the BLAS linear algebra library. The parallel_saxpy routine is part of the host program, as indicated by the __host__ attribute attached to its declaration. Given pointers to the x and y vectors allocated in the GPU memory, it invokes the parallel kernel. This code is designed to process each corresponding pair of elements with a single thread. Therefore, the host program selects a thread block size of threads – an arbitrary choice in this case since there is no inter-thread communication – and determines the number of -thread blocks needed to create at least thread per element. The function call-like saxpy_kernel<<>>() actually launches the kernel with B blocks of T threads each. The procedure saxpy_kernel is a kernel entry point, indicated by the __global__ attribute. Although each thread may subsequently follow its own execution path, all threads of the kernel begin their execution at the beginning of this procedure. Thus, the kernel procedure itself is a standard sequential program describing the action of a single thread. Individual threads within a kernel are differentiated by unique integer indices. The thread blocks of a kernel are numbered on a -D or -D grid. CUDA C exposes these indices through special variables visible within kernel functions: blockIdx.x, blockIdx.y give the index of the current block and gridDim.x, gridDim.y provide the number of blocks in each dimension. Similarly, each thread is given a unique index within its block. These indices may be one-, two-, or three-dimensional. A thread’s index is available via
NVIDIA GPU
SM
SM
Off-chip memory
SM
Streaming multiprocessor
Global L2 cache
SM
SM
SM
Memory interface
DRAM DRAM
DRAM
Rasterize
Work distribution
PCIe bus
Host interface
GPU
N
NVIDIA GPU. Fig. High-level structure of a modern discrete GPU __global__ void saxpy_kernel(int n, float a, const float *x, float *y) { // Each thread processes 1 element, determined from the thread’s index. int i = blockIdx.x*blockDim.x + threadIdx.x; if( i
y[i] = a*x[i] + y[i];
} __host__ void parallel_saxpy(int n, float a, const float *x, float *y) { // Create kernel with 256-thread blocks. int blocksize = 256; // We will need n]256 blocks to process n elements. int nblocks = ceil(n/blocksize); saxpy_kernel<<
NVIDIA GPU. Fig. CUDA C implementation of y ← αx + y given two n-element vector x and y
the special variables threadIdx.x, threadIdx.y, and threadIdx.z, while the dimensions of a block are blockDim.x, blockDim.y, and blockDim.z. The example shown in Fig. uses only -D indexing; all other dimensions are ignored. It begins by computing a linear index i from the block and thread indices of the current thread. This assigns unique elements xi and yi to the thread. The conditional test that i < n is required since there will be more threads than elements when n is not a multiple of the block size, which is in this case. Those threads that are within the bounds of the array conclude their computation by performing the element-wise update yi ← αxi + yi . Figure illustrates the basic mechanics of a CUDA program, but it also highlights an important qualitative difference between CPU and GPU programming. Typical CPUs tend to treat threads as heavyweight
objects of which there are relatively few. Concurrency runtimes like OpenMP and Intel’s Threading Building Blocks typically create a persistent pool of worker threads proportional in size to the number of cores available. On current machines, this would typically imply a thread count of up to roughly , and these threads most likely persist over the application’s entire lifetime. Their schedulers are then responsible for mapping tasks created by the application onto the available threads. In contrast, GPUs treat threads as extremely lightweight objects that may also have extremely brief lifetimes. Given two vectors of ten million elements, this single SAXPY kernel will launch, execute, and retire ten million threads. This significantly different viewpoint is largely a result of the GPU’s graphics background. Graphics applications require GPUs to execute many millions of shader invocations per frame. Unlike CPUs,
N
N
NVIDIA GPU
which tend to assume that parallelism in their given workload is fairly scarce, GPUs assume that they are provided with workloads containing abundant parallelism.
The Streaming Multiprocessor The parallel processor array at the heart of the GPU is composed of a number of multithreaded Streaming Multiprocessors (SMs). When a kernel is launched on the GPU, the hardware work distributor enumerates its thread blocks and launches them on SMs as resources are available. If no SM can accomodate any additional blocks, the work distributor waits for some running block to retire, at which point it may launch a pending block on that SM. The SM is thus “streaming” in the sense that thread blocks “stream” through the multiprocessor: they launch, run, and then retire without being suspended and evicted to memory. Each multiprocessor supports on the order of , coresident threads, with the precise number varying between processor generations. Table provides specific data for the three most recent generations of NVIDIA GPUs. The SM manages the creation, execution, and scheduling of its threads directly in hardware.
Memory Organization All threads running on the GPU have access to a global random-access memory. For discrete GPUs – those housed on PCIe cards – global memory is serviced by a high-bandwidth memory system that can deliver roughly GB/s of bandwidth on the highest end cards. For GPUs in other form factors, such as GPUs integrated with the motherboard chipset, this memory may be the same system memory used by the CPU.
NVIDIA GPU. Table Capacity of each SM for three GPU generations Gx/Gx Scalar cores
GTxx
“Fermi”
,
,
Shared memory (KB)
/
L cache (KB)
–
–
/
L cache (KB per chip)
–
–
Registers (-bit) Coresident threads
An SM is provisioned with a large register file so that every resident thread can be given its own dedicated register set. For example, the register file of a Fermi SM has , -bit registers, for a total of KB of register space. At its maximum occupancy level of , resident threads, the Fermi SM can allocate registers to every thread. These registers provide primary storage for the local variables of the thread, with excess variables spilling to external memory. In addition to the register file, which provides for private per-thread storage, the SM is also equipped with a small on-chip shared memory. This memory is high speed and low latency. It is allocated on a perblock basis, thus providing a space for sharing data among all the threads of a block and for intra-block communication. The Fermi architecture augments this on-chip RAM with an L/L cache hierarchy. Each SM maintains its own private L cache. The L shares a fixed KB memory space with the shared memory RAM. This memory may be divided into either KB L and KB shared memory, or KB L and KB shared memory. The L cache is a global cache across all SMs and is KB in size.
Thread Synchronization Parallel threads running on the GPU may synchronize with each other at one of three levels. At the highest level, there is an implicit barrier between successive kernels. By default, no thread of a given kernel may be launched until all threads of previously launched kernels have completed. For truly independent kernels, CUDA provides an additional mechanism to launch kernels in separate streams, thus specifying that they may be run concurrently. The threads of a single thread block may synchronize with each other by calling a barrier function. This is a lightweight barrier that translates to a single SM instruction. It thus incurs essentially no overhead, except for the time required to physically wait for all threads to reach the barrier. Individual threads may also perform atomic operations on memory addresses. These operations include, but are not limited to, atomic compare and swap, atomic increment/decrement, and atomic reduction for commutative operations such as addition, min/max,
N
NVIDIA GPU
and binary and/or/xor. Atomic operations may be performed in both global and on-chip shared memory.
SM
Warp scheduler
Warp scheduler
Register file (128 KB)
SIMT Execution To manage its large population of threads efficiently, the SM employs a SIMT – Single Instruction, Multiple Thread – architecture. The threads of a given block are grouped into -thread units called “warps.” A warp of threads can execute a single instruction at a time across all its threads. Since each thread may follow its own independent path of execution, threads within a warp may potentially follow different code paths. The hardware automatically tracks and handles such execution divergence. When a warp of threads executes a divergent conditional, for instance, it will execute first one branch and then the other of the conditional. Threads of the warp will be activated only in the branch of the conditional which they chose to follow. The SIMT architecture used by the GPU is an instance of the broader class of SIMD (Single Instruction, Multiple Data) architectures. However, it is qualitatively different from the vector SIMD extensions prevalent on other platforms. The action of each thread is described by a normal sequence of scalar instructions. When executing the same code path, these threads will automatically execute together and realize the benefits of SIMD execution. When they follow different paths, the execution of the program remains correct, although there is a performance loss proportional to the number of diverging paths. In contrast, a vector SIMD model embeds vector instructions directly in the instruction stream. They must be inserted explicitly, either directly by the programmer or by compiler autovectorization of loops. Warps are the fundamental unit of scheduling in the SM. On each clock cycle, an SM warp scheduler is free to select from its available pool any runnable warp to issue an instruction. Runnable warps are simply those not waiting for some pending operation to complete. Thus, the scheduler interleaves the execution of separate warps at a very fine level. When there are multiple runnable warps, this has the effect of switching between warps at each instruction.
Processing Elements The SM contains a collection of processing elements that execute the instructions issued by running threads.
Scalar cores
Load / Store units
SFUs
L1 cache & On-chip memory (64 KB)
NVIDIA GPU. Fig. Basic structure of a single streaming multiprocessor in a Fermi GPU
The precise configuration of these elements varies between GPU generations. Figure sketches the configuration for a Fermi SM. Each SM is built around scalar cores. Each of these cores provides fully pipelined integer and floating point arithmetic units. The floating point units support IEEE - single (-bit) and double (-bit) precision arithmetic based on the FMA (Fused Multiply Add) operation. The scalar cores are grouped in two sets of , and each group is associated with a separate warp scheduler. Thus, two separate warps may be executing simultaneously. A -thread warp is executed on one set of cores by running two half-warps of threads each, back to back on the scalar cores. The scalar cores are augmented with a set of Load/Store units, that handle address calculations and dispatch for memory transfers, and four Special Function Units (SFUs) that execute transcendental instructions such as sine, cosine, reciprocal, and square root.
Throughput-Oriented Design As observed above, one of the key characteristics of the workloads generated by interactive graphics applications is that they are fundamentally throughput oriented. The goal is to finish as many tasks as possible within the fixed time period of a single frame. Since the number of tasks that must be completed easily reaches into the tens of millions, the amount of time required to complete any particular task is essentially insignificant. For such workloads, the amount of available parallelism is obviously abundant. This abundance of parallelism, coupled with the throughput-oriented nature of the
N
N
NVIDIA GPU
problems it is typically targeted for, combines to drive a number of GPU architectural design decisions. One major consequence of these assumptions is that the GPU relies extensively on multithreading to hide the latency of long-running operations. This is true even of relatively short latencies, such as the time required for instructions to make their way through the instruction pipeline, but it is particularly true for accesses to global memory. On all modern processors, the time that elapses between issuing a memory request to external DRAM and its completion is typically hundreds of clock cycles. No high-performance processor can afford to remain idle for such a length of time. Traditional processors, like CPU cores, which emphasize running a single sequential task as quickly as possible attempt to avoid long-latency memory operations through large, sophisticated on-chip caches. In contrast, GPUs largely rely on multithreading to hide this latency; it is perfectly acceptable for some threads to be blocked waiting for memory transactions to complete as long as many others are ready to run. NVIDIA GPUs prior to the Fermi architecture did not have any caches for load/store operations on external memory. Even the Fermi architecture, which employs a full cache hierarchy, still relies heavily on multithreading to hide latency. Another consequence of the throughput-oriented design of GPUs is that they use relatively simple processor control schemes. Modern superscalar CPUs employ a variety of sophisticated techniques to accelerate sequential programs. These include out-of-order execution, branch prediction, and speculative execution. GPUs do not employ any of these techniques. The scalar cores of the SM are simple in-order pipelined processors. They perform neither branch prediction nor speculative execution. While this means that a single thread may take longer to run, simpler control logic translates into less chip area devoted to control. Hence, there is more area available for functional units which leads to higher total throughput for a fixed chip area. The use of SIMD execution in the SM similarly increases total peak throughput by amortizing the cost of control over many functional units. This comes at the cost of lower performance on completely divergent tasks. However, in practice, a massively parallel computation that provides a pool of tens of thousands of concurrent tasks is likely to show a high degree of uniformity. Rather than , disparate tasks, it is generally
much more likely that there will be a fairly small number of task types, each of which provides ample uniformity for efficient SIMD execution.
Related Entries Barriers BSP (Bulk Synchronous Parallelism) Shared-Memory Multiprocessors SPMD Computational Model
Bibliographic Notes and Further Reading The physical design of early hardware systems for -D graphics acceleration, exemplified by the Silicon Graphics Reality Engine [], tended to mimic the structure of the graphics pipeline execution model. The hardware itself was a feed-forward pipeline of fixed-function hardware units. Over time, graphics hardware evolved, introducing programmable units for vertex, pixel, and geometry processing. Lindholm et al. [] provide an overview of the original G chip architecture, and an NVIDIA whitepaper outlines the basic characteristics of the upcoming Fermi architecture []. Nickolls et al. [] explain the basics of the CUDA programming model. More complete information can be found in the official CUDA Programming Guide [] and a recently published book on parallel programming by Kirk and Hwu []. Owens et al. [] survey the field of GPU computing and its historical development. Garland et al. [] report on experiences with CUDA in a collection of real-world applications. An extensive collection of documentation, training materials, examples applications, and the CUDA development environment itself is all available online at http://www.nvidia.com/CUDA.
Bibliography . Akeley K () Reality engine graphics. In: Proceedings of SIGGRAPH ’, ACM, Spencer, pp – . Blythe D () The direct D system. ACM Trans Graph ():– . Garland M, Grand SL, Nickolls J, Anderson J, Hardwick J, Morton S, Phillips E, Zhang Y, Volkov V () Parallel computing experiences with CUDA. IEEE Micro ():– . Kirk DB, Hwu WMW () Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, Burlington
NWChem
N
Computational chemistry; Quantum chemistry
The NWChem program suite is an example along these lines. It offers a range of capabilities to perform first-principles electronic structure calculations on molecular and periodic systems, classical molecular dynamics simulations, and hybrid QM/MM simulations that permit one to combine quantum and classical methods within a calculation. The code is designed to run on high-performance parallel supercomputers as well as conventional workstation clusters with the goal to provide scalable solutions for large-scale atomistic simulations. It has been ported to almost all high-performance computing platforms, workstations, PCs running LINUX, as well as clusters of desktop platforms or workgroup servers. The package is scalable, both in its ability to treat large problems efficiently and in its utilization of available parallel computing resources. The parallel framework for NWChem is provided by the Global Array (GA) toolkit developed at PNNL. NWChem is developed and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The latest release (version .) of the code is distributed under the terms of the Educational Community License (ECL) version ..
Definition
Overview
NWChem (Northwest Computational Chemistry Package) is DOE’s premier massively parallel computational chemistry package. It is an open-source highperformance platform with extensive capabilities for large-scale quantum and classical atomistic simulations that can be applied to a broad range of application areas.
NWChem [] provides a number of quantum mechanical or electronic structure approaches for computing various properties of molecules, finite clusters, and periodic systems []. These include self-consistent field (SCF) or Hartree–Fock (HF) theory, density functional theory (DFT) including support for a broad range of state-of-the-art exchange-correlation (xc) functionals, plane-wave-based DFT methods for Car–Parrinello molecular dynamics (CPMD) and band structure calculations of periodic systems, higher-order many body methods like second-order Mï£iller–Plesset (MP) perturbation theories and a full range of coupled cluster (CC) theories, relativistic approximations (DKH, ZORA), excited state approaches like time-dependent density functional theory (TDDFT), and equation of motion coupled cluster (EOMCC) theories. Most of the above approaches can be used to perform geometry optimizations (structure minimization and transition state location), frequency calculations, and/or molecular response calculations either analytically or
. Lindholm E, Nickolls J, Oberman S, Montrym J () NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro ():– . Mark WR, Glanville RS, Akeley K, Kilgard MJ () Cg: a system for programming graphics hardware in a C-like language. In: Proceedings of SIGGRAPH , ACM, San Diego, pp – . Nickolls J, Buck I, Garland M, Skadron K () Scalable parallel programming with CUDA. Queue ():– . NVIDIA’s next generation CUDA compute architecture: Fermi () http://www.nvidia.com/fermi . NVIDIA Corporation () NVIDIA CUDA Programming Guide, July . Version . . Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC () GPU computing. In: Proceedings of the IEEE , No. , Santa Clara
NWChem N. Govind, Eric J. Bylaska, Wibe A. de Jong, K. Kowalski, Tjerk P. Straatsma, Marat Valiev, H. J. J. van Dam Pacific Northwest National Laboratory, Richland, WA, USA
Synonyms
Discussion Computational atomistic modeling has emerged as a powerful approach for fundamental studies in chemistry, materials science, geosciences, and biology. It has the ability to provide detailed insights into complex phenomena and, over the years, has grown into a powerful tool not only to validate experiments, but also to predict new phenomena. In order to harness computational modeling to the fullest, superior computational tools are required that not only offer a broad spectrum of capabilities, but that are also capable of taking advantage of modern massively parallel computer architectures efficiently to solve real-world problems.
N
N
NWChem
numerically. Macromolecular simulations and freeenergy computations can be performed using classical molecular dynamics with parameterized classical force fields. These methodologies may also be combined to perform multi-physics mixed quantum mechanics/molecular mechanics (QM/MM) simulations. Other capabilities include hybrid approaches like ONIOM, continuum solvation models like COSMO, electron transfer (ET), as well as interface to a number of other programs.
Core Capabilities Since an exhaustive discussion of all the capabilities is beyond the scope of this essay, only a brief overview of the core capabilities is provided here. The interested reader is referred to published review articles [, , , , ] for a complete description.
Self-consistent Field and Density Functional Theories The Hartree–Fock (HF) or self-consistent field (SCF) method is a single determinant–based theory [] and is an essential in any computational chemistry suite. It forms the basis for higher level electronic structure theories like Moller–Plesset perturbation theory (MP), coupled cluster (CC) theory-and other post-HF approaches. Density functional theory (DFT) [, , ] provides an alternative route to the many-electron problem and offers an excellent balance between accuracy and computational performance. It has emerged as a powerful theory that is broadly applicable. In analogy with the HF equations, DFT is rooted in the Kohn–Sham (KS) equations with the only difference being in the way the exchange and correlation is treated. In DFT, the HF exchange is replaced with a (in general nonlocal) density-dependent exchange and correlation contribution which encapsulates the complexities of the many-electron interactions. Since the exact form of the exchange and correlation in DFT contribution is an unsolved problem, a number of approximations of increasing complexity have been developed over the years [], most of which are available in NWChem. The HF/KS equations are very similar in structure and are typically solved using an iterative procedure with the help of basis set expansions [, , , ].
Two popular basis set approaches are implemented in NWChem: () Gaussians and () Plane waves.
Gaussian Basis Set Approach Within the Gaussian basis set approach [, ], both HF/KS can be expressed as a set of matrix equations: FC = SCє
()
where F, C, S, and ε represent the Fock, coefficient, overlap, and diagonal orbital energy matrices, respectively. The problem of solving the HF/KS equations thus boils down to solving a generalized eigenvalue problem. The Fock matrix for both HF and KS formalisms can be encompassed under a single framework as: j
x−dft
core + G μν + αGkμν + βG μν F μν = H μν
c−dft
+ γG μν
()
where H core is the one-electron contribution (kinetic and ion-electron), Gj and Gk represent the two-electron (Coulomb and explicit exchange), Gx−dft and Gc−dft are the DFT exchange and correlation parts, respectively. The mixing coefficients α, β, and γ help span the HF and DFT limits. With α = , β = , γ = one gets the pure HF limit, while the pure DFT limit is obtained with α = , β = , γ = and α < , β < , γ = covers the phase space of nonlocal hybrid-DFT forms. The time-consuming steps in any standard implementation are computation of the two-electron integrals, construction of the two-electron contribution to the Fock matrix, computation of the exchangecorrelation contribution to the Fock matrix in the case of the DFT, diagonalization of the Fock matrix, and computation of the density matrix using the molecular orbital coefficients. Within this context, only the parallelization approaches implemented in NWChem will be discussed in the following paragraphs. The reader is also referred to other detailed work on the subject [, , , ]. The calculation of the two-electron integrals is the most expensive and formally scales as O(N ) even though it can be shown that the scaling is ≈ O(N ) for large molecular systems. In most parallel implementations, including NWChem, the relevant matrices are either handled in a replicated or distributed fashion or a combination [, ]. The replicated data approach offers a straightforward way to achieve task parallelism and low communication. Each processor maintains a copy of the necessary data. In the case of
NWChem
the two-electron contribution, the Fock (F) and density (D) matrices are replicated over all the processors and the integral quartets are assigned to each processor as blocks. A partial F matrix is constructed on each processor and then consolidated into a full F matrix using a global sum operation. For a reasonable work distribution and an efficient global summation, this approach is perfectly parallelizable. It is also efficient for large systems because of the amount of parallel work available especially if the integrals are evaluated on-the-fly (also known as the direct approach). However, there is a potential O(N ) memory bottleneck. The distributed approach avoids the memory bottleneck by distributing the F and D matrices, thus placing a smaller constraint on the available local memory. However, this approach is potentially more challenging from an implementation standpoint because of the extra bookkeeping. Figure illustrates the strong scaling of the Gaussian basis set DFT module. Both ground and excited state calculations on finite systems can be performed using the Gaussian basis set– based (HF/DFT) module in NWChem.
N
Plane Wave Approach The plane wave DFT module in NWChem is composed of three components that use a common infrastructure: PSPW (a pseudopotential-based Γ-point code for isolated and periodic systems), BAND (a code to calculate the band structure of crystals and surfaces), and PAW (a projector-augmented-wave-based Γ-point code for isolated and periodic systems). Within the plane wave approach the single-particle wavefunctions of a periodic system can be expanded in general as [, , ]: ψjk (r) = e ık ⋅r ∑ ψ˜ jk (G)e ıG⋅r
()
G
where k is a vector in the first Brillouin zone, G is the reciprocal lattice vector, and j represents the orbital index, respectively. For isolated systems like molecules, the Brillouin zone sampling is limited to the Γ-point (k=) which in turn yields: ψj (r) = ∑ ψ˜ j (G)e ıG ⋅r
()
G
The size of the plane wave basis set expansion is determined by the maximum kinetic energy cutoff (Ecut ) via
10,000 5,000
Total Fock 2e Fock xc Diagonalization
1,000
CPU time (s)
500
100 50
10
32
64
128
256 512 1,024 Number of processors
2,048
4,096
NWChem. Fig. Parallel scaling of the Gaussian basis set HF/DFT implementation on the C system. Calculations were performed using the PBE hybrid exchange-correlation functional and the -G∗ basis with a total of , basis functions for the whole system. Integral evaluations were performed using the on-the-fly (or direct) approach and Fock matrix replication. The figure shows the scaling of the various components
N
N
NWChem
the following relation: ∣G∣ < Ecut
160 atoms 80 atoms
()
Some favorable features of the plane wave expansion include: being able to treat periodic systems in a seamless way, efficient calculation of the expansion coefficients using Fast Fourier Transform (FFT) techniques, independence of the basis with respect to nuclear positions which makes it immune to superposition, and over-completeness problems which are critical issues in local basis set approaches. Because plane wave basis sets are typically much larger than a local basis, explicit Fock matrix construction and diagonalization is avoided in favor of direct optimization approaches like conjugate gradient minimization []. The most time-consuming steps of plane wave–based algorithms are the evaluation of the pseudopotential (specifically the nonlocal contribution), wavefunction orthogonalization, and exact exchange (if needed) in the description of the exchange-correlation. Several parallelization strategies [, , , , ] have been implemented in NWChem to mitigate issues such as parallel data distribution over the wavevectors, distribution of the orbitals, and distribution of the real and reciprocal space grids. The wavevector distribution is a popular approach for solid-state applications, but the scalability is hindered by the size of the Brillouin zone and the method is not applicable to the pure Γ-point calculations. Orbital distribution has been shown to scale favorably, but this entails storage of the entire one-electron orbital on a single node and is not practical for large molecular systems which typically require a large plane wave expansion. The most commonly implemented strategy, including NWChem, is the spatial distribution of the real and reciprocal grid points (typically along a single dimension) are stored on different processors. This approach is not only scalable, but also fairly economical from a memory management standpoint. The trade-off here is the increased communication costs due to parallel three√ dimensional FFTs which plateus beyond N N N processors (≈ for typical applications). This limitation may be avoided by distributing the spatial and orbital degrees of freedom on a two-dimensional processor grid []. This approach is also available in NWChem.
CPU time per step (s)
100
10
100
1,000 10,000 Number of processors
NWChem. Fig. Parallel scaling of hybrid DFT plane wave calculations on the Fe O system for unit cells containing and atoms, respectively
The plane wave DFT module in NWChem also has a highly scalable algorithm [, , ] for exact exchange which makes it possible to perform calculations using hybrid exchange-correlation functionals. Using this approach, scaling to over , CPUs for modest size problems has been achieved. This is shown in Fig. where the parallel scaling is demonstrated for Fe O with unit cells containing and atoms, respectively.
Coupled Cluster Theory Coupled cluster theory [, , ] (CC) is a method rooted in many-body perturbation theory (MBPT) and is considered by many to be the gold standard for accurate quantum mechanical calculations of ground and excited states. Over the years, it has grown into one of the most reliable approaches for describing correlation effects in atoms and molecules. In CC theory the wavefunction is expressed as an exponential ansatz: ∣Ψ ⟩ = eT ∣Φ⟩
()
where ∣Φ ⟩ is the ground-state function. Tˆ is the cluster operator which produces a linear combination of excited determinants when acting on the reference function ∣Φ⟩ (typically represented by the HF determinant). In CC theory, the energy is obtained as: E = ⟨Φ∣e−T HeT ∣Φ⟩
()
NWChem
The systematic inclusion of excitations of higher rank in the cluster operator results in a series of more accurate approximations. In the basic CCSD [] approach (CC with singles and doubles) the cluster operator is approximated as T ≈ T + T . Similarly, in the more accurate CCSDT [] approach (CC with singles, doubles, and triples) the cluster operator is represented as T ≈ T + T + T . The higher accuracy of the CCSDT energies comes at a steep price because of the extra numerical complexity and scales as N with system size. Consequently, many non-iterative approaches which mediate the cost between the CCSD (N ) and CCSDT (N ) methods have dominated this field, the CCSD(T) [] approach, which scales as N with system size, being the most well known and most frequently used. Excited state calculations can be accomplished within single-reference CC theory via the linear response CC [] (LR-CC) and equation-of-motion CC formalisms (EOMCC). Analogous to ground-state calculations, various iterative (EOMCCSD [], EOMCCSDT []) and non-iterative approaches have been implemented in NWChem [, ]. Since the CC equations involve contractions between tensors corresponding to the one- and two-electron integrals and cluster amplitudes, the task of writing efficient parallel CC code is extremely challenging. The main issues that define the overall parallel efficiency are related to (a) data storage and data transfer, (b) efficient load balancing, and (c) data flow during
N
the execution phase. Another significant issue is associated with the most efficient reduction of the numerical complexity of a given CC method. NWChem provides a complete implementation of CC theory geared toward efficient execution on highperformance parallel architectures. Two classes of parallel CC codes have been implemented, a spin-free code for closed-shell systems [] and a code based on the tensor contraction engine (TCE) []. It was recently demonstrated that the non-iterative (T) part of the spinfree code can be made to scale across , processors []. The TCE-based codes are automatically generated spin-orbital codes that can be used for a broad range of reference functions (RHF, ROHF, UHF, DFT). Here the block or tile structure of all the tensors is extensively used in the code for dynamic load balancing. This code is also parallelized using the GA library which is used to store all the tensors. The scaling of the CR-EOMCCSD(T) method is shown in Fig. . The reader is also referred to more recent applications that demonstrate the scaling of the CC module [, ].
Classical Molecular Dynamics NWChem provides a highly scalable parallel framework for classical molecular dynamics (MD) simulations based on classical potentials (Amber []). The parallelization strategy is based on a domain decomposition approach which provides the best theoretical scalability of memory use and communication cost. In the domain
Time (s)
100,000
CR-EOMCCSD(T) 10,000
1,000 1,000
10,000 Number of cores
100,000
NWChem. Fig. Parallel scaling of the non-iterative triples correction (N (T)) in CR-EOMCCSD(T) calculations of the green fluorescent protein (GFP) with the cc-pVTZ basis set
N
N
NWChem
decomposition model [, ] the physical space occupied by the system is distributed and the atoms are assigned to a processor based on their location. This is accomplished by dividing the total finite or periodic system being studied into slices along the three dimensions. Communication is made more efficient by logically arranging the processors in a three-dimensional grid. This permits one or more sub-domains to be assigned to a processor and preserves locality of the sub-domains which is essential for efficient communication. The domain decomposition has some advantages and disadvantages. The former includes: avoiding atomic data replication on each processor, memory reduction to hold the entire system in core. The latter includes: the need for periodic reassignments of the atoms that move between sub-domains, the need for a sophisticated dynamic load-balancing scheme for heterogeneous systems. Other challenges include dealing with systems with incommensurate spatial decompositions. The MD implementation in NWChem takes advantage of one-sided asynchronous capabilities provided by the GA toolkit, which allows for more sophisticated dynamic load-balancing algorithms to handle these issues. Electrostatic contributions are evaluated using the smooth Particle Mesh Ewald (sPME) method [] which employs the efficient three-dimensional Fast Fourier Transform (D-FFT) approach. The data access patterns in the D-FFT, however, hinders efficient parallelization over large numbers of processors. To address this issue, a slab-based D-FFT is used so that the transform in two dimensions takes place on a single processor, thereby avoiding interprocessor communication. This leaves one dimension in which communication is required. The main scaling difficulty of this algorithm is that it only allows as many processors to be used for the D-FFT as there are slabs in the grid. To remedy this limitation and to achieve strong scaling to a larger number of processors than the number of slabs used to define the charge grid, the reciprocal space calculations are performed on a subset of the total number of processors used to carry out the simulation. Simultaneously with this subset of processors (or subgroup) performing the FFTs, the potentially much larger group of remaining processors handle the direct space interactions. The number of processors to be used for the FFTs can be specified in the input. The use of subgroups
1 NWChem MD Linear Time per MD step/s
0.1
0.01 100
1,000 Number of processor cores
NWChem. Fig. Time per molecular dynamics step for a SPC/E water simulation in the microcanonical ensemble with , atoms
is a standard feature of the Global Arrays toolkit. Figure illustrates the time per molecular dynamics step for a SPC/E water simulation in the microcanonical ensemble with , atoms.
Combined Quantum Mechanical Molecular Mechanics The combined quantum mechanical molecular mechanics (QM/MM) module in NWChem provides an effective tool to study localized molecular transformations in large-scale systems such as those encountered in solution chemistry or enzyme catalysis. This approach entails using an accurate quantum mechanical (QM) method(s) in a small region of interest embedded in an environment which is treated using a lower level approximate theory (e.g., classical molecular mechanics). With this decomposition, the total energy of the system is given by the sum of the quantum (Eqm ) and classical energies: E = Eqm [r, R; ψ] + Emm [r, R]
()
where r, R represents the coordinates of QM and MM regions and ψ denotes the ground or excited electronic wavefunction of the QM region. The QM contribution can be further decomposed into internal and external
NWChem
contributions as []: ()
where the internal part Eint qm (r; ψ) is the gas phase energy expression and the external part contains the electrostatic interactions of the classical charges (ZI ) of the MM region with the electron density (ρ): ext Eqm (r, R; ρ) = ∑ ∫ I
ZI ρ(r′ ) ′ dr ∣RI − r′ ∣
()
All the classical interactions as well as the van der Waals and electrostatic interactions with the quantum region are subsumed in the classical energy term (Emm ). Since a clean separation between the quantum and classical regions is not always possible, different schemes (e.g., link atoms, bond capping schemes) have to be utilized [, ]. The NWChem implementation supports two different treatments of link atoms – hydrogen and pseudo-carbon capping schemes. The QM/MM module in NWChem is built as a toplevel interface between the classical MD module and various QM modules and supports a wide variety of calculations like optimizations, dynamics, free energy calculations, to name a few. The scaling of QM/MM calculations is largely determined by the scaling of the underlying QM and MM implementations. We refer the reader to published review articles for further details regarding the implementation of this module [, ].
Parallel Framework NWChem uses the Global Array (GA) toolkit to scale and perform on parallel platforms [, ]. Figure illustrates how the various modules are integrated into a single framework. Since the GA programming model includes message passing as a subset, the libraries within the toolkit can interoperate with MPI [] for message passing. In other words, the programmer can make full use of MPI on GA and non-GA data. The GA toolkit comprises of the Memory Allocator (MA) which provides access to local memory, the Global Arrays (GA) which provides the necessary portable shared-memory programming tools, the Aggregate Remote Memory Copy Interface (ARMCI) [] for portable and efficient remote memory copy operations (also known as one-sided communication) optimized for noncontiguous (strided, scatter/gather, I/O vector) data transfers,
Run-time database
ext Eqm = Eint qm (r; ψ) + Eqm (r, R; ρ)
N
Hartree-fock Density functional theory Coupled-cluster theory Plane-wave DFT Classical molecular dynamics QM/MM Properties ...
Integrals Basis sets Geometry Linear algebra libraries ...
Parallel IO Memory allocator Global arrays
NWChem. Fig. Schematic showing how the various modules in NWChem are integrated within a single framework along with the parallel libraries
and the Parallel I/O (ParIO) [] tool to extend the nonuniform memory architecture model to disk. The GA toolkit combines the best features of both the shared and distributed memory programming models. It implements a shared-memory programming model on distributed memory architectures in which data locality can be managed directly by the programmer. This is accomplished by explicit function calls that transfer data between a global address space (e.g., a distributed array) and local storage. This makes the GA model similar to distributed shared-memory (DSM) models that provide explicit acquire/release protocols. However, the GA model acknowledges that remote data is slower to access compared with local data and therefore allows data locality to be explicitly managed. By optimizing and selectively moving the data requested by the user, the GA model avoids problems like redundant data transfers. The GA model exposes the programmer to the hierarchical memory of modern high-performance computer systems and promotes data reuse and locality by recognizing the communication overhead for remote data transfer. The GA library can be used in C, C++, Fortran , Fortran , and Python programs. The GA toolkit offers
N
N
NWChem
support for both task and data parallelism. Task parallelism is supported via one-sided or non-collective copy operations that transfer data between global memory (distributed/shared array) and local memory. Each process can directly access data in a section of a Global Array that is logically assigned to it. Atomic operations are provided for synchronization and to assure correctness of an accumulate operation (e.g., floating-point addition that combines local and remote data) which is crucial to several electronic structure algorithms. The data parallel computing model is supported through a set of functions that are collectively called that operate on either entire arrays or sections of Global Arrays. This set comprises BLAS-like operations (copy, additions, transpose, dot products, and matrix multiplication). Some are defined and supported for all array dimensions (e.g., addition), whereas operations like matrix multiplication are limited to twodimensional arrays. Nevertheless, multiplication may be performed on two-dimensional subsets of higher dimensional arrays. The GA toolkit also includes interface to parallel linear algebra libraries like PeIGS and ScaLAPACK [].
Acknowledgments The NWChem software and its documentation are developed at the EMSL located at PNNL, a multiprogram national laboratory, operated for the US Department of Energy by Battelle under the Contract Number DE-AC-RL. The code development is supported by the Department of Energy Office of Biological and Environmental Research, Office of Basic Energy Sciences, and the Office of Advanced Scientific Computing. Calculations reported here were performed using the Molecular Science Computing Capability (MSC) in the EMSL as well as resources of the National Energy Research Scientific Computing Center (NERSC) located at the Lawrence Berkeley Laboratory (LBL), California. The EMSL is funded by the Office of Biological and Environmental Research in the US Department of Energy.
Acronyms ARMCI: Aggregate Remote Memory Copy Interface CC: Coupled Cluster Theory CPMD: Car–Parrinello Molecular Dynamics
DFT: Density Functional Theory DKH: Douglas–Kroll–Hess Approximation EOMCC: Equation-of-Motion Coupled Cluster Theory FFT: Fast Fourier Transform GA: Global Arrays Toolkit HF: Hartree–Fock Theory MA: Memory Allocator MBPT: Many-Body Perturbation Theory MP: Mï£iller–Plesset Perturbation Theory MPI: Message Passing Interface NWChem: Northwest Computational Chemistry Package QM/MM: Quantum Mechanics-Molecular Mechanics RHF: Restricted Hartree–Fock ROHF: Restricted Open-Shell Hartree–Fock SCF: Self-consistent Field Theory TCE: Tensor Contraction Engine TDDFT: Time-Dependent Density Functional Theory UHF: Unrestricted Hartree–Fock ZORA: Zeroth-Order Regular Approximation
Related Entries Amdahl’s Law Anton, a Special-Purpose Molecular Simulation Machine BLAS (Basic Linear Algebra Subprograms) Car-Parrinello Method Computational Sciences Eigenvalue and Singular-Value Problems Exascale Computing FFT (Fast Fourier Transform) FFTW Global Arrays Parallel Programming Toolkit LAPACK NAMD (NAnoscale Molecular Dynamics) N-Body Computational Methods Non-Blocking Algorithms ScaLAPACK Spiral
Bibliographic Notes and Further Reading References [, , , , , , , ] provide the interested reader with basic and advanced background to understand and utilize the various electronic structure methods implemented in NWChem.
NWChem
Bibliography . Aprà E, Bylaska EJ, Dean DJ, Fortunelli A, Gao F, Kristic PS, Wells JC, Windus T () Comp Mat Sci : . Aprà E, Harrison RJ, de Jong WA, Rendell AP, Tipparaju V, Xantheas SS, Olsen RM () Liquid Water: Obtaining the right answer for the right reasons. In: Proceedings of the ACM/IEEE supercomputing conference, New York . Bartlett RJ, Musial M () Rev Mod Phys : . Bylaska E, Tsemekhman K, Govind N, Valiev M () Large-scale plane-wave-based density functional theory: formalism, parallelization, and applications. In: Reimers JR (ed) Computational methods for large systems: electronic structure approaches for biotechnology and nanotechnology. Wiley, New York . Bylaska EJ, Valiev M, Kawai R, Weare JH () Comput Phys Comm : . Cizek J () J Chem Phys : . Clarke LJ, Stich I, Payne M () Comput Phys Comm : . Comeau DC, Bartlett RJ () Chem Phys Lett : . Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA () J Am Chem Soc : . de Jong WA, Bylaska EJ, Govind N, Janssen CL, Kowalski K, Müller T, Nielsen IMB, van Dam HJJ, Veryazov V, Lindh R () Phys Chem Chem Phys : . Essmann U, Perera L, Berkowitz ML, Darden T, Lee H, Pedersen LG () J Chem Phys : . Foster IT, Tilson JL, Wagner AF, Shepard RL, Harrison RJ, Kendall RA, Littlefield RJ () J Comput Chem : . Gao JL, Truhlar DJ () Ann Rev Phys Chem : . Gygi F () IBM J Res Dev : . Harrison RJ, Guest MF, Kendall RA, Bernholdt DE, Wong AT, Stave M, Anchell JL, Hess AC, Littlefield RJ, Fann GI, Nieplocha J, Thomas GS, Elwood D, Tilson JL, Shepard RL, Wagner AF, Foster IT, Lusk E, Stevens R () J Comput Chem : . Helgaker T, Jorgensen P, Olsen J () Molecular electronicstructure theory. Wiley, Chichester . Hirata S () J Phys Chem A : . Janssen Cl, Nielsen IMB () Parallel computing in quantum chemistry. CRC Press/Taylor & Francis Group, Boca Raton . Kendall RA, Aprà E, Bernholdt DE, Bylaska EJ, Dupuis M, Fann GI, Harrison RJ, Ju J, Nichols JA, Nieplocha J, Straatsma TP, Windus TL, Wong AT () Comp Phys Comm : . Kobayashi R, Rendell AP () Chem Phys Lett : . Kohn W, Sham LJ () Phys Rev :A . Kowalski K, Hammond JR, de Jong WA, Fan PD, Valiev M, Wang D, Govind N () Large-scale plane-wave-based density functional theory: formalism, parallelization, and applications. In: Reimers JR (ed) Computational methods for large systems: electronic structure approaches for biotechnology and nanotechnology, Wiley, New York . Kowalski K, Krishnamoorthy S, Villa O, Hammond JR, Govind N () J Chem Phys : . Kowalski K, Piecuch P () J Chem Phys : . Martin RM () Electronic structure: basic theory and practical methods. Cambridge University Press, Cambridge
N
. Marx D, Hutter J () Ab initio molecular dynamics: basic theory and advanced methods. Cambridge University Press, Cambridge . Message Passing Interface Forum () MPI: a message-passing interface standard. http://www.mpi-forum.org . Monkhorst HJ () Int J Quantum Chem . Nelson JS, Plimpton SJ, Sears MP () Phys Rev B : . Nieplocha J, Carpenter C () ARMCI: a portable remote memory copy library for distributed Array Libraries and Compiler Run-time Systems. In: Proceedings of rd Workshop on Runtime Systems for parallel programming (RTSPP) of international parallel processing symposium IPPS/SPDP ï£i . Lecture notes in computer science, vol . Springer, Berlin/Heidelberg/NewYork . Nieplocha J, Foster I, Kendall R () Int J Supercomput Appl High Perform Comput : . Nieplocha J, Harrison RJ () J Supercomput : . Nieplocha J, Palmer B, Tipparaju V, Krishnan M, Trease H, Aprà E () Int J High Perform Comput Appl : . Noga J, Bartlett RJ () J Chem Phys : . Nogueira F, Marques M () A primer in density functional theory (Lecutre Notes in Physics) (v. ). In: Fiolhais C, Nogueria F, Marques M. A. L (eds). Springer, Berlin/Heidelberg . Paldus J, Li XZ () Critical assessment of coupled cluster method in quantum chemistry. Adv Chem Phy vol , Wiley, pp ï£i . Parr RG, Yang W () Density-functional theory of atoms and molecules. Oxford University Press, New York . Payne MC, Teter MP, Allan DC, Arias TA, Joannopoulos JD () Rev Mod Phys : . Perdew JP, Schmidt K () Large-scale plane-wave-based density functional theory: formalism, parallelization, and applications. In: Van Doren V, Van Alsenoy C, Geerlings P (eds) AIP conference proceedings: density functional theory and its application to materials, American Institute of Physics, vol . Purvis GD, Bartlett RJ () J Chem Phys : . Raghavachari K, Trucks GW, Pople JA, Headgordon M () Chem Phys Letts : . Scalapack specifications can be found at http://www.netlib.org/ scalapack/scalapack_home.html . Straatsma TP, McCammon JA () IBM Syst J : . Straatsma TP, Philippopoulos M, McCammon JA () Comp Phys Comm : . Szabo A, Ostlund NS () Modern quantum chemistry. McGraw-Hill, New York . Valiev M, Bylaska EJ, Govind N, Kowalski K, Straatsma TP, van Dam HJJ, Wang D, Nieplocha J, Aprà E, Windus TL, de Jong WA () Comp Phys Comm : . Valiev M, Garrett BC, Tsai MK, Kowalski K, Kathmann SM, Schenter GK, Dupuis M () J Chem Phys : . Windus TL, Bylaska EJ, Andzelm J, Govind N () J Comp Theor Nano : . Zhang YK, Liu HJ, Yang WT () J Chem Phys :
N
O Omega Calculator Omega Test
Omega Library Omega Test
Omega Project Omega Test
Omega Test David Wonnacott Haverford College, Haverford, PA, USA
Synonyms Omega calculator; Omega library; Omega project
for the restricted case in which array subscripts and loop bounds must be affine functions of loop indices, testing for the existence of a dependence is equivalent to testing satisfiability of a conjunction of affine constraints on integer variables, which is NP-complete []. Fortunately, in practice most array subscripts and loop bounds are even simpler, and low-complexity tests such as the GCD test and Banerjee’s Inequalities produce accurate results in many cases. In the late s and s, Paul Feautrier and Bill Pugh led independent efforts to improve array dependence analysis with techniques that were exact for the entire affine domain and fast for the common cases [, , ]. While some researchers felt exponential algorithms had no place in compilers, others saw value in the more precise information produced by these techniques. This approach, known variously as “constraint-based analysis,” the “polyhedral model,” or “instance-wise analysis,” remains an important tool for advanced restructuring compilers, some of which make use of the original code of Pugh or Feautrier. The remainder of this article reviews the work of Pugh’s Omega Project; for information about other work on this area see the web sites [, ].
Definition The Omega Test is an algorithm for dependence analysis that was developed by Bill Pugh’s Omega Project in the s to enable work on advanced loop optimizations. It is notable for its manipulation of constraints on integer variables to produce a representation of dependences that is more precise than distance/direction vectors, thereby enabling advanced analyses and transformations.
Exact Instance-Based Array Dependence Information Optimization of loop nests requires information about dependences that is both detailed and accurate. Even
Traditional Dependence Abstractions To illustrate some of the issues involved in dependence analysis, consider the codes shown in Fig. – each of these loops might write an element of array A that is read in a later iteration, so each is said to exhibit an inter-iteration flow dependence based on A, which inhibits parallel execution of the loop iterations. Earlier work on dependence analysis not only approximated some affine cases, but also reported only simple dependence abstractions, e.g., identifying loops that carry a dependence or classifying dependences with direction vectors or distance vectors. Regardless of its accuracy or domain, any test that produces this information must
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
O
Omega Test
report that both loops carry a dependence with a distance that is not known at compile time (assuming we lack information about n and k). A more detailed examination of the dependences of Fig. reveals that the loops can be parallelized with different strategies, as illustrated in Fig. (circles correspond to iterations, with lower values of i to the left, and each arrow indicates a flow dependence between iterations). For the illustrated case (n = and k = ), Fig. a has dependences from iteration (which writes a value into A[]) to iteration (which reads a value from A[]), from to , from to , and from to ; Fig. b has dependences from to , to , to , to , and to . These patterns suggest that there may be some cases in which we can run some of the iterations in parallel, e.g., when n = and k = we could execute iterations – of Fig. a simultaneously, and iterations – of Fig. b simultaneously. However, conclusions about program transformation must be made about the general case.
Dependence Relations The Omega Test describes the dependences in terms of relations among tuples of integers. To retain correspondence with the program being analyzed, these integers may be identified as inputs (e.g., the source of a dependence), outputs (the sink), or symbolic constants. A finite set of dependences can be given as a list (union) of such tuple relations, e.g., for Fig. a as {[] → []} ∪ {[] → []} ∪ {[] → []} ∪ {[] → []}. Integer variables and constraints allow a representation that is both concise (allowing {[i] → [ − i] ∣ ⩽ i ⩽ }
a
// Example 1a for (i=0; i
b
// Example 1b for (i=0; i
Omega Test. Fig. Simple loops exhibiting data dependences
a
for Fig. a) and general ({[i] → [n − − i] ∣ ⩽ i < n } when n is not known, as in Fig. a). These dependence relations follow directly from the application of the definition of data dependence to the program to be analyzed. A flow dependence exists between a write and a read (e.g., from iteration [i] to [i′ ]) when the subscript expressions are equal (e.g., i = n−−i′ in Fig. a), the loop index variables are in bounds ( ⩽ i < n ∧ ⩽ i′ < n), and the dependence source precedes the sink (i < i′ ). Combining these constraints produces {[i] → [i′ ]∣i = n − − i′ ∧ ⩽ i < n ∧ ⩽ i′ < n ∧ i < i′ } for a and {[i] → [i′ ]∣i′ = i + k ∧ ⩽ i < n ∧ ⩽ i′ < n ∧ i < i′ } for b. These dependence relations are exact in the following sense: for any given values of the symbolic constants and loop indices, the constraints evaluate to true if and only if a dependence would actually exist. The constraint-manipulation algorithms of the Omega Library (see Sect. Representation and Manipulation of Sets and Relations) can confirm that these relations are satisfiable and produce a simpler form, in these cases {[i] → [n − − i]∣ ⩽ i ∧ i ⩽ n − } (i ⩽ n − is equivalent to i < n but does not introduce division) and {[i] → [i + k]∣ ⩽ i < n − k}. This representation of dependences extends naturally to nested loops with the introduction of additional variables in the input and output tuple – dependences among references nested in loops i and j would be represented as relations from [i, j] to [i′ , j′ ]. An imperfect loop nest such as that shown in Fig. raises the additional challenge of identifying the lexical position of each statement or inner loop (i.e., is it the first, second, etc. component within the loop?). This challenge can be met by interspersing (among the loop index values) constant values that give the statement position. If the i loop in Fig. is the first (or only) statement in its function, and the update to A is the first (or only) statement in this loop, we could represent the dependence above as a relation from [, i, ] → [, i′ , ]. For Fig. , the dependence from the write of A[i] to the read A[i − ] is {[, t, , i, ] → [, t ′ , , i′ , ]∣i′ − = i ∧ t < t ′ ∧ ⩽ t, t′ < T ∧ ⩽ i, i′ < N − }.
b
Omega Test. Fig. Iteration spaces and flow dependences for Fig. , when n = and k =
Omega Test
While it is possible to produce dependence distance/direction vectors from the dependence relations produced by the Omega Test, the results are often no more accurate than those produced by earlier tests such as a combination of Banerjee’s inequalities and the GCD test. However, the added precision of the dependence relation can enable other forms of analysis or transformation if it is passed directly to later steps of the compiler.
Advanced Program Analysis and Transformation The detailed information present in a dependence relation can be used as a basis for a variety of program analyses, including analysis of the flow of values within a program’s arrays, analysis of conditional dependences, and analysis of dependences that exist over only a subset of the iteration space. This information can then be used to drive a number of program transformation/optimization techniques on perfectly or imperfectly nested loops.
Advanced Analysis Techniques: The Omega Test Information about the iterations involved in a dependence can reveal opportunities for parallelization after transformations such as index set splitting or loop peeling. Note that any iterations that are not the source of a dependence arc (e.g., iterations – of Fig. a) can be executed concurrently after all other loop iterations have been completed; and those that are not a sink (iterations – in Fig. a) can be executed concurrently before any other has started. The Omega Library’s ability to represent sets of integer tuples as well as relations can be used to show that there are never any iterations other than these nonsink and nonsource iterations for Fig. a, so this loop (unlike that in b) can be run in two sequential phases of parallel iterations. The symbolic information present in a dependence relation can be used to detect conditional dependences. The flow dependence relation for Fig. b clearly implies that the flow dependence exists only when < k < n; the Omega Library’s projection and “gist” operations (see Sect. The Gist Operation) can be used to extract this information. A compiler could introduce conditional parallelism (e.g., run the loop in parallel when n ⩽ k)
O
for (t = 0; t < T; t++) { for (i = 1; i < N-1; i++) { new[i] = (A[i-1] + A[i] + A[i+1]) * (1.0/3); } for (i = 1; i < N-1; i++) { A[i] = new[i]; } }
Omega Test. Fig. A set of “imperfectly nested” loops
or use additional analysis or queries to the programmer to attempt to prove no dependence ever exists. By combining instance-wise information about memory aliasing with information about iterations that overwrite array elements, the Omega Test can produce information about the flow of values (termed “valuebased dependence information” by the Omega Project). For example, the “memory-based” flow dependence shown above for Fig. connects writes to reads in all subsequent iterations of the t loop – i.e., when t′ > t. However, the value written in iteration t is overwritten in iteration t + , and thus never reaches any subsequent iteration. The value-based dependence relation includes this information, showing dependences only when t ′ = t + , as shown in Sect. Disjunctive Normal Form. Value-based dependence information can be used for a variety of optimizations, e.g., to identify opportunities for parallelization after array privatization or expansion. The Omega Project uses the term “Omega Test” for the production of symbolic instance-wise memoryor value-based dependence relations via the steps outlined above. More detail about the specific algorithms involved in dependence testing can be found in [] as well as earlier publications by these authors.
Iteration Space Transformation Iteration space transformations can be viewed as relations between elements of the original and transformed iteration space, and thus represented and manipulated with the same infrastructure as dependence relations. Iteration spaces can be related to execution order via a simple rule such as “iterations are executed in lexicographical order,” thereby turning iteration space transformation into a tool for transforming order of execution. A similar approach can be used to describe and transform the association of values to memory cells [], though this has not been explored as thoroughly.
O
O
Omega Test
The “transformation as relation” framework unifies all unimodular loop transformations (e.g., {[i, j] → [j, i]} describes loop interchange) with a number of other transformations. Perhaps more importantly, this framework handles imperfect loop nests, which lie outside the domain of unimodular techniques. For example, consider the problem of fusing the inner loops of Fig. . Simple loop fusion can be accomplished by making A[i] = new[i] into the second statement in the first i loop, i.e., {[, t, , i, ] → [, t, , i, ]}. However, fusion without first aligning the loops is illegal; {[, t, , i, ] → [, t, , i + , ]} is a correct alignment and fusion (now element A[] is overwritten in iteration i = , just after its last use in the computation of new []). This transformation produces a loop nest with much better memory locality, since (barring interference) it moves each array into cache only once per time step rather than twice. Algorithms for finding useful program transformations in this framework, and for generating code from the resulting iteration space sets, are discussed in [, , , ].
Iteration Space Slicing Iteration space transformations can also be defined in terms of “iteration-space slicing”: the instance-wise analog of program slicing. This approach can express very general approaches to program transformation in terms that are relatively concise and clear. To illustrate iteration space slicing, consider the challenge of parallelizing the code in Fig. b even when the dependence exists, as in Fig. b. Figure shows the iteration space of this loop when n = and k = (as in Fig. b), together with the backward iteration space slice for iteration (the set of other iterations that must be executed to produce the value in the target iteration). A similar backward slice for iteration would include iterations and . It is possible to run Fig. b as k concurrent threads, each of which updates every kth element. A simple test could be used to identify and parallelize this specific
case (a one-dimensional loop with a single constantdistance dependence), but iteration space slicing provides a concise way to generalize this idea: find the backward slices to the iterations that are not the source of any dependence – if these slices do not intersect, they can be run concurrently to produce the result of the loop nest. In some cases, it may be necessary to find the iteration space slice needed to compute one result given that another slice has already been completed. For example, consider the challenge of increasing the memory locality of Fig. by more than the factor of two achieved in Sect. Iteration Space Transformation. Figure shows the iteration space of a sample run of the original form of this code: each time step is shown as a column of circles representing the computations of new[i], followed by a column of squares representing the copy into A[i]. The slice needed to compute A[] is shown as dashed iterations surrounded by a dashed border; the marginal slice needed for A[] given that A[] has been computed is shown as shaded iterations. Executing this code as a series of sequential slices can greatly improve the memory locality (given a cache large enough to hold the results of a few slices, this can reduce memory traffic from O(T ⋅N) to O(N) in the absence of interference, producing dramatic speedups for slow memory systems []). Algorithms for iteration space slicing, for producing transformations with it, and for generating code to execute slices are described in [, , ].
Representation and Manipulation of Sets and Relations The Omega Library (or simply “Omega” when in context) employs a number of techniques to transform the constraints extracted from programs in Sect. Exact Instance-Based Array Dependence Information into answers to the questions of Sect. Advanced Program
Omega Test. Fig. The backward iteration space slice for iteration of Fig. b
Omega Test
O
i=5
i=4
i=2
i=1
Marginal slice for [1,2,2,2,1] given [1,2,2,1,1]
i=1
Slice for [1,2,2,1,1] t=0
t=1
t=2
Omega Test. Fig. Iteration space and two slices for Fig. , T = , N =
Analysis and Transformation. The details of representation and manipulation depend on the nature of the constraints and the question being asked. Most algorithms can be expressed either in terms of individual (in)equations or higher-order operations on relations and/or sets. For example, a compiler could construct each memory-based flow dependence relation by creating appropriate tuples for input and output variables and then examining loop bounds, subscript expressions, etc., to produce the appropriate constraints on the tuples’ variables. Alternatively, it could first define a set for the iteration space I of each statement S (Is ), and a relation M mapping loop iteration to array index for each array reference R (MR ), and relations T for forward-in-time order for various loop depths, and then construct a flow dependence by combining these, i.e., the dependence from access x to y could be found by taking (Mx ● (My )− ) ∩ Txy and restricting its domain to Ix and its range to Iy . Some operations on relations or sets simply involve re-labeling of free variables, e.g., swapping the input and output tuples to find R− from a relation R. Others such as intersection (∩) involve matching up variables from input and output tuples, and possibly converting free variables to quantified variables; in the relational join and composition operations (for R ● R or the equivalent R ○ R ), R ’s input tuple becomes the input tuple, R ’s output tuple the output tuple, and R ’s outputs and R ’s inputs become existentially quantified variables, all constrained to the intersection of the constraints from R and R . Finally, some operations such as transitive closure are defined in Omega only as relational operators.
The remainder of this section gives an overview of the algorithms used in the Omega Library. It progresses from less-expressive to more-expressive constraint languages, roughly following the historical development of the algorithms of the Omega Library, and gives descriptions in terms of collections of (in)equations rather than relations or sets. For more detail on these algorithms or the relationship between high- and lowlevel operations, see the cited papers, or [] for transitive closure algorithms.
Conjunctions of Affine Equations and Inequations The fundamental data structure of the Omega Library represents a conjunction of affine equations and inequations with integer coefficients (Omega does not currently represent disequations such as n ≠ m, so hereafter the term inequation refers only to <, ⩽, ⩾, and > constraints, each of which is converted to ⩾ form internally). Each individual (in)equation is immediately reduced to lowest terms (e.g., turning x + y + = into x + y + = ), and hash keys that ignore constant terms are used to quickly detect (and remove) simple redundancies (e.g., a − b ⩾ and a − b + ⩾ have the same hash key and the latter can be removed). The hashing system also detects (and replaces) pairs of inequations that are equivalent to an equation (e.g., i ⩾ and i ⩽ are converted to i = ). High-level operations for these conjunctions rely on lower-level operations on sets of (in)equations, such as variable elimination and redundancy detection. Variable elimination is also referred to as “projection” or
O
O
Omega Test
finding the “shadow” of a set or relation, since finding the two-dimensional shadow of a three-dimensional object can be thought of as an example of this operation. Note, however, that the shadow of the integer points in a set/relation is not always the same as the integer points in the shadow (as per the discussion of Fig. below). To illustrate Omega’s variable elimination and redundancy detection abilities, recall the dependence from Fig. b. Checking for possible dependence corresponds to existentially quantifying a dependence relation’s free variables (inputs, outputs, and symbolic constants), in this example checking the formula (∃i, i′ , n, k : i′ = i + k ∧ ⩽ i < n ∧ ⩽ i′ < n ∧ i < i′ ). None of the six (in)equations of this formula is redundant with respect to any one other, so all are retained to the start of our query. Omega begins by using equations to eliminate variables: given i′ = i + k it could replace all occurrences of i′ with i + k, producing the formula (∃i, n, k : ⩽ i < n ∧ ⩽ i + k < n ∧ < k). (Additional properties of integer arithmetic are used to eliminate an equation even when no variable has a unit coefficient, as discussed in [].) Upon running out of equations, Omega moves on to eliminate variables involved in inequations. Inequations are removed in several passes, with quick passes occurring first. Omega checks for variables that are bounded only on one side (i.e., have no lower bounds or no upper
bounds). Such variables cannot affect the satisfiability of the system, so they can be removed, along with all constraints on them – in this example, n, then k, and then i are removed in turn, producing an empty conjunction, i.e., True. To illustrate some of the other algorithms used for affine conjunctions, we consider what would happen in this example if we knew n ⩽ , (perhaps the loop in Fig. is guarded by the test if n <= ,), i.e., (∃i, n, k : ⩽ i < n∧ ⩽ i+k < n∧ < k∧n ⩽ ). In the absence of variables bounded on only one side, Omega examines groups of three inequalities to determine if any two contradict a third or make it redundant. In our running example, i + k < n and < k make i < n redundant, and ⩽ i and < k make ⩽ i + k redundant, reducing the query to (∃i, n, k : ⩽ i ∧ i + k < n ∧ < k ∧ n ⩽ , ). Omega then moves on to its most general technique: an adaptation of Fourier’s method of variable elimination. Fourier’s method can be used to eliminate a variable v in a conjunction of affine inequations, by replacing the set of inequations on v with the set of all possible combinations of one upper and one lower bound on v – for example, replacing ⩽ i and i < n − k with < n − k. The original conjunction has a rational solution if and only if the new one does; the new system has one variable fewer, but may have many more inequations.
7 =3x 10y< x x<=4
6 5 4 3 2 1 0 0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 Shadow = 0... Dark shadow = 14... Splinters = {0}, {4}, {7,8}, {10–12} Integer shadow = Dark + Splinters
Omega Test. Fig. Shadow and dark shadow metaphors for integer variable elimination
Omega Test
The Omega Library uses an extension of Fourier’s method that preserves the presence of integer solutions. This process eliminates a variable by comparing the result of Fourier’s original technique (which produces the “shadow”) with a variant Pugh calls a “dark shadow” – a lower-dimensional conjunction that has integer solutions only if the original did. The set of integer solutions to the lower-dimensional system is the union of this dark shadow and the “splinter” solutions that lie within the shadow but outside the dark shadow []. Omega repeatedly eliminates variables to produce a trivially testable system with at most one variable, possibly producing (∃n, k : < n − k ∧ < k ∧ n ⩽ , ) and then (∃n : < n ⩽ , ) in our running example. Since the query is now clearly satisfiable, the Omega Test concludes that there must be some values of the constants n and k for which a dependence exists between some iterations. The example above, like most occurring during dependence analysis, does not illustrate the use of splintering to project integer solutions. Figure shows an example for the inequations y ⩽ x ∧ x ⩽ y, which have integer solutions for x = , , , , , , and x ⩾ . If Omega were to eliminate y during satisfiability testing of y ⩽ x ∧ x ⩽ y ∧ x ⩽ , it would find that the dark shadow ( ⩽ x ⩽ ) is satisfiable, and report True; if it were to eliminate y during satisfiability testing of y ⩽ x ∧ x ⩽ y ∧ x ⩽ , it would find the rational shadow is satisfiable but the dark shadow is not, and proceed to explore the splinters. The number of inequations can grow exponentially in the number of variables eliminated by Fourier’s method, and the need to compute splinters can only make matters worse. To control this problem in practice, Omega first eliminates variables that do not introduce splintering (projecting away x instead of y in Fig. , since the rational and dark shadows on the y axis are identical), and from these nonsplintering variables selects those that minimize the growth in the number of inequations. These steps appear to control the growth of inequalities during dependence analysis [, ].
The Gist Operation As noted in Sect. Advanced Analysis Techniques: The Omega Test, we may wish to classify the flow dependence of Fig. b as “conditional,” since there are values
O
of n and k for which no dependence exists and the iterations of the loop can be run in parallel. A simple definition of conditional dependence could be implemented by eliminating the loop index variables and identifying any nontautology as conditional. For this example, (∃i, i′ : i′ = i + k ∧ ⩽ i < n ∧ ⩽ i′ < n ∧ i < i′ ), or simply ( < k < n), is not a tautology, and this dependence would thus be marked as conditional. Unfortunately, this simple definition would categorize as “conditional” almost any dependence inside a loop with a symbolic upper bound, such as that in a – projecting away i and i′ leaves ( ⩽ n), but of course when n < there is no point in parallelizing the loop. To avoid such “false positives,” the Omega Test defines as conditional any dependence that does not exist under some conditions when we still wish to optimize the code, e.g., when a loop we wish to parallelize runs multiple iterations. This definition makes use of the Omega Library’s “gist” operation as well as projection. Informally, the “gist” of one set of constraints p (i.e., those in which a dependence exists) given some other conditions q (i.e., those in which we care about the dependence) is the “interesting” information about p given that q is true. To give precise semantics while allowing flexibility in the processing of “interesting information,” gist is formally defined in terms of the values in the sets/relations being constrained, not the collection of constraints giving the bounds: “(gist p given q)” is any set/relation such that p ∧ q = (gist p given q) ∧ q. Figure gives a graphical illustration of the gist operation. Both examples illustrate, with a speckled region, the gist of a vertically striped region given a horizontally striped region. Both cases illustrate the formal definition of gist, as the speckled region overlaps the horizontally striped region in the same way the vertically striped region does. The option on the left illustrates the “usual” behavior that is produced by the algorithm described below: the gist is a superset of the original vertically striped region, with simple extensions of boundaries that defined that region. The possibility of results like that on the right allows flexibility in the definition of “simple” that is important when working with nonconvex sets. When p and q are conjunctions of affine (in)equations, the Omega Library computes (gist p given q) by first using the redundancy detection steps described above to check the (in)equations of p for
O
O
Omega Test
Simple constraints for gist
Complex yet legal constraints for gist
Omega Test. Fig. Examples of the gist operation: usual and unusual-but-legal options
redundancy with respect to p ∧ q (to check an (in)equation pi , it uses pi ’s hash key to determine if any other single (in)equation might make pi redundant; checks for variables that are not bounded in one direction that thus might show pi cannot be redundant; and (if pi is an inequation) compares pi with pairs of inequations). If these “quick tests” fail to classify an (in)equation as redundant or not, Omega optionally performs the definitive but potentially expensive satisfiability test of a conjunction of all (in)equations from p∧ q but with pi negated. The conjunction of (in)equations of p that are not marked an redundant in p ∧ q is then returned as (gist p given q). Returning to the codes of Fig. , the flow dependence from Example b is marked as conditional, as (gist ( < k < n) given n > ) = ( < k < n). Example a’s dependence is not, since, (gist ( ⩽ n) given n > ) = True. More details on the gist operation, such as the algorithms used when p and q are not both conjunctions of (in)equations, can be found in [, , ].
Disjunctive Normal Form Value-based dependence analysis, and certain cases of memory-based analysis (e.g., for programs with conditional statements) may produce constraints that are not simple conjunctions of equations and inequations. In Fig. , the memory-based dependence from the write to A[i] to the read of A[i − ] is {[, t, , i, ] → [, t′ , , i′ , ]∣i′ − = i ∧ ⩽ t, t′ < T ∧ ⩽ i, i′ < N − ∧ t < t ′ }, or more concisely {[, t, , i, ] → [, t′ , , i + , ]∣ ⩽
t < t ′ < T ∧ ⩽ i ⩽ N − }. To describe the actual flow of values, the Omega Test must rule out all cases in which the value written to A[i] is overwritten before the read, i.e., any iteration {[, t ′′ , , i′′ , ]} that writes to the same location (so i′′ = i) and occurs between the original write and the read (so t < t ′′ < t′ ) during a valid execution of the statement ( ⩽ t′′ < T ∧ ⩽ i′′ < N − ); in other words it must represent the dependence relation {[, t, , i, ] → [, t ′ , , i + , ]∣ ⩽ t < t′ < T ∧ ⩽ i ⩽ N − ∧ ¬(∃t′′ , i′′ : i′′ = i ∧ t < t ′′ < t ′ ∧ ⩽ t′′ < T ∧ ⩽ i′′ < N − )} Omega handles such formulae by converting them into disjunctive normal form, i.e., a list of conjunctions of the form given in Sect. Conjunction of Affine Equations and Inequations whose disjunction is equivalent to the original formula. Conversion to disjunctive normal form can increase the size of the formula exponentially; in practice, the explosion in formula size during dependence analysis is largely due to constraints that, once negated, contradict the positive constraints (on the first line of our example formula). This problem can be controlled by applying the rule a ∧ ¬b = a ∧ ¬(gist b given a). (Refer back to Fig. for the intuition behind this rule – since the speckled region (gist b given a) must intersect a in exactly the way b
Omega Test
did, so must the complement of the speckled region ¬(gist b given a) intersect a as ¬b does.) This use of gist turns the formula above into {[, t, , i, ] → [, t′ , , i + , ]∣
⩽ t < t′ < T ∧ ⩽ i ⩽
N − ∧ ¬(t ⩽ t′ − )}. The inequation t ⩽ t ′ − can then be negated to produce t ′ ⩽ t + rather than the disjunction of eight inequations that would have arisen without gist. In this context, Omega applies only the “quick tests” for gist, as there is no point in the optional full computation here. Techniques from Sect. Conjunctions of Affine Equations and Inequations show our example formula is satisfiable and further simplify it to our final description of the value-based flow dependence: {[, t, , i, ] → [, t + , , i + , ]∣ ⩽ t < T − ∧ ⩽ i ⩽ N − }. The Omega Library can apply this conversion to disjunctive normal form recursively, allowing it to operate on the full language of “Presburger Arithmetic”: arbitrary formulae involving conjunction, disjunction, and negation of linear equations and inequations. Note that this logic has super-exponential complexity [], and queries lacking the properties discussed above may well exhaust all available time or memory or produce integer overflow. The Omega Test actually performs value-based analysis in a slightly different way from that described here, to “factor out” the effects of an overwrite on many different dependences (see [, ] for this and other details). However, the need for, and implementation of, negated constraints follows the principles stated here.
Non-affine Terms and Uninterpreted Function Symbols The constraint-manipulation techniques of the previous paragraphs can be used for memory-based, valuebased, and conditional dependence analysis as long as the terms that are gathered from the program are affine functions of the loop bounds and symbolic constants. When some program term is not affine, many dependence tests simply report a maximally conservative approximation (“some dependence of some sort might exist here”) without further analysis. The Omega
O
Test can use two abilities of the Omega Library to produce more precise results, in many cases ruling out dependences despite the presence of nonaffine terms. The simplest approach to handling nonaffine terms is to approximate the troublesome constraint but retain precise information about other constraints. The Omega Library provides a special constraint UNKNOWN, to be used in place of any constraint that cannot be represented exactly. Its presence indicates an approximation of some sort, but does not prevent Omega from manipulating other constraints or producing an exact final result in some cases (note that (UNKNOWN ∨ True = True) and (UNKNOWN ∧ False = False); however, since the two unrepresentable constraints may arise from unrelated terms in a program, ¬UNKNOWN does not contradict UNKNOWN, and (UNKNOWN ∧ ¬UNKNOWN = UNKNOWN)). For example, suppose Fig. b updated A[i∗ k] rather than A[i+k] — the constraint i′ = i+k must be replaced with i′ = i⋅k, but of course this is outside the domain of the Omega Library. (Polynomial terms in dependence constraints were explored in [] and by a number of authors not related to the Omega Project, but never released in Omega.) Thus, the Omega Test instead produces the relation {[i] → [i′ ]∣UNKNOWN ∧ ⩽ i < n ∧ ⩽ i′ < n ∧ i < i′ }. A more detailed description can be created by replacing the unrepresentable term rather than the entire unrepresentable constraint. Omega records such a term as an uninterpreted function symbol – a term that indicates a value that may depend on some parameters. For example, i⋅k depends only on i and k, and can thus be represented as f (k, i) for some function f . While UNKNOWNs cannot be combined in any informative way, two occurrences of the same function can be combined in any context in which their parameters can be proved to be equal, e.g., ( f (k, i) > x ∧ i = i′ ∧ f (k′ , i′ ) < x − ∧ k′ = k) can be simplified to False. This principle holds even when f is not a familiar mathematical function, most notably any expression e nested in loops i , i , i , … in can be represented fe (i , i , i , . . . .in ). The use of function symbols can improve the precision of dependence analysis, and when this does not eliminate a dependence, the Omega Test may be able to disprove it by including user assertions involving function symbols (if the programmer has provided them).
O
O
Omega Test
Presburger Arithmetic with uninterpreted function symbols is undecidable in general, and the Omega Library currently restricts function parameters to be a prefix of the input or output tuple of the relation. The Omega Test uses these tuples to represent the values of loop index variables source and sink of a dependence, and thus must approximate in cases in which a function is applied to some other values, in particular the values of loop indices at the time of an overwrite that kills a value-based dependence (i.e., t ′′ and i′′ in the relation in Sect. Disconjunctive Normal Form). Further detail about the treatment of nonaffine constraints in the Omega Test, including empirical studies, can be found in [].
Current Status and Future Directions The Omega Project’s constraint manipulation algorithms are described in the aforecited references and the dissertations of Bill Pugh’s students. Algorithms with relatively stable implementations were generally released as the Omega Library, a freely available code base which can still be found on the Internet (an open-source version containing updates from a number of Omega Library users can be found at http://github.com/davewathaverford/the-omega-proj ect/). Notable omissions from the Omega Library code base include the code for iteration-space slicing, implementations of algorithms for polynomial constraints, some “corner cases” of the full Presburger satisfiabilitytesting algorithm, and any way of handling cases in which the core implementation of integer variable elimination produces extremely large coefficients. The Omega Library is generally distributed with the Omega Calculator, a text-based interface that allows convenient access to the operations of the library. The constraint-based/polyhedral approach that was pioneered by Bill Pugh’s Omega Project and by Paul Feautrier and his colleagues in France remains central to a number of ongoing compiler research projects. It is also a valuable tool in industrial compilers such as that produced by Reservoir Labs, Inc. For a discussion of the state-of-the-art before this work, see Michael Wolfe’s “High-Performance Compilers for Parallel Computing” []; additional discussion of the instance-wise approach to reasoning about program transformations can be found in Jean-Francois Collard’s “Reasoning
About Program Transformations: Imperative Programming and Flow of Data” []. A number of constraint-manipulation program analysis/transformation techniques and associated libraries have been developed since the release of the Omega Library. These typically contain more modern algorithms for a number of advanced functions of Omega, such as code generation, or more general implementations of underlying constraint algorithms, for example avoiding Omega’s use of limited-range integers. As of the writing of this article, most do not support all of the techniques described in this article, notably: ● Algorithms that are (at least in principle) exact for integer variables ● Algorithms for the full domain of Presburger Arithmetic ● Separation of dependence analysis and transformation from the core constraint system via − An API allowing high-level relation/set operations and low-level operations − A text-based interface to allow easy exploration of the abilities of the system ● Special terms for communicating with the core constraint system about information not in the its primary domain, e.g., function symbols ● Tagging of approximations to distinguish them from exact results ● Iteration space slicing (never released with Omega)
Bibliography . Collard J-F () Reasoning about program transformations. Springer, New York . Feautrier P () Array expansion. In: ICS, St. Malo, pp – . Fischer MJ, Rabin MO () Super-exponential complexity of Presburger arithmetic. In: Karp RM (ed) Proceedings of the SIAM-AMS Symposium in Applied Mathematics, vol , pp –, Providence, AMS . Garey MR, Johnson DS () Computers and intractability: a guide to the theory of np-completeness. W.H. Freeman and Company, New York . Wayne Kelly and William Pugh. Finding legal reordering transformations using mappings. In LCPC, pp –, . Kelly W, Pugh W () Minimizing communication while preserving parallelism. In: Proceedings of the th international conference on supercomputing, ICS ’, ACM, Philadelphia, New York, pp – . Kelly W, Pugh W, Rosser E, Shpeisman T () Transitive closure of infinite graphs and its applications. In: Proceedings of the th
OpenMP
.
.
.
. .
.
.
.
. . .
. .
. .
international workshop on languages and compilers for parallel computing. LCPC ’, Springer-Verlag, London, pp – Maslov V, Pugh W () Simplifying polynominal constraints over integers to make dependence analysis more precise. In: Proceedings of the third joint international conference on vector and parallel processing: parallel processing, CONPAR -VAPP VI, Springer-Verlag, London, pp – Pugh W () The omega test: a fast and practical integer programming algorithm for dependence analysis. In: Proceedings of the ACM/IEEE conference on supercomputing, Supercomputing ’, Albuquerque, ACM, New York, pp – Pugh W () Uniform techniques for loop optimization. In: Proceedings of the th international conference on supercomputing, ICS ’, Cologne, ACM, New York, pp – Pugh W () A practical algorithm for exact array dependence analysis. Commun ACM, New york, ():– Pugh W, Rosser E () Iteration space slicing and its application to communication optimization. In: Proceedings of the th international conference on supercomputing, ICS ’, Vienna, ACM, New York, pp – Pugh W, Rosser E () Iteration space slicing for locality. In: Proceedings of the th international workshop on languages and compilers for parallel computing, LCPC ’, Springer-Verlag, London, pp – Pugh W, Wonnacott D () Eliminating false data dependences using the Omega test. In: SIG-PLAN Conference on Programming Language Design and Implementation, pp –, San Francisco, California, June Pugh W, Wonnacott D () Experiences with constraintbased array dependence analysis. In: Principles and Practice of Constraint Programming, Second International Workshop, vol , Lecture Notes in Computer Science, pp –. Springer, Berlin, May . Also available as Tech. Report CS-TR-, Department of Computer Science, University of Maryland, College Park Pugh W, Wonnacott D () Constraint-based array dependence analysis. ACM Trans Program Lang Syst ():– Wolfe M () High performance compilers for parallel computing. Addison-Wesley, Boston Wonnacott D () Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the th international symposium on parallel and distributed processing, IEEE Computer Society, Washington, pp – Wonnacott D () Achieving scalable locality with time skewing. Int J Parallel Program ():– Wonnacott DG () Constraint-based array dependence analysis. PhD thesis, Department of Computer Science, The University of Maryland, August . Available as refftp://ftp.cs.umd.edu/ pub/omega/dav-ewThesis/davewThesis.ps PolyLib - A library of polyhedral functions, Universite de Strasbourg, icps.u-strasbg.fr/polylib Frameworks supporting the polyhedral model. Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Frameworks_ supporting_the_polyhedral_model
O
One-to-All Broadcast Broadcast
Open Distributed Systems Peer-to-Peer
OpenMP Barbara Chapman, James La Grone University of Houston, Houston, TX, USA
Definition OpenMP is an application programming interface for parallelizing sequential programs written in C, C++, and Fortran on shared-memory platforms. It provides a collection of compiler directives, a runtime library, and environment variables to enable programmers to specify the parallelism they desire to exploit in a program.
Discussion Introduction The OpenMP Application Programming Interface (API) is a parallel programming model for shared-memory computer systems intended to provide a straightforward means of exploiting concurrency inherent in many algorithms. With the insertion of compiler directives, the programmer directs the compiler to parallelize portions of the code at a high level. Originally designed to target loop-centric algorithms, version . of the API introduced the ability to define explicit tasks that may be executed concurrently. The OpenMP API is an agreement among hardware and software vendors that make up the OpenMP Architecture Review Board (ARB). The origins of the API are found a set of compiler directives for writing parallel Fortran compiled by the Parallel Computing Forum (PCF) in the late s. In , the newly formed ARB introduced the first OpenMP specification for a set of
O
O
OpenMP
directives for use with Fortran and OpenMP compilers soon followed. Bindings for C and C++ have since been defined and the feature set has grown. Version ., the most recent version, was ratified in . The ARB is a non-profit corporation comprised of members from industry and academia, which owns the OpenMP brand and manages the OpenMP specification. OpenMP does not require the programmer to explicitly decompose data and control flow to produce parallel computation. This fragmented style of programming is characterized by low-level programming to control the details of the concurrency. Rather, OpenMP allows the programmer to take a high-level view of parallelism and leave the details of the concurrency to the compiler. With the insertion of OpenMP directives, the programmer can specify what portions of sequential code should be executed in parallel. To a non-OpenMP compiler, the directives look like comments and are ignored. So by observing a few rules, one application can be both sequential and parallel, a very good quality that is quite useful in development and testing. Another benefit is the ability to incrementally apply OpenMP constructs to create a parallel program from existing sequential codes. Rather than start from scratch, the programmer can insert parallelism into a portion of the code and leave the rest sequential, repeating this process until the desired speedup is realized. This also means that to use OpenMP one need only learn a small set of constructs, not an entirely new language. For a sample of how to use OpenMP, consider this fragment of C code that multiplies two matrices, a and b, and storing the result in a third matrix, c. initialize_arrays(a,b,c,K,M,N); #pragma omp parallel private(i,j,k) shared (a,b,c) { #pragma omp for schedule(auto) for (i = 0; i < N; i++) for (k = 0; k < K; k++) for (j = 0; j < M; j++) c[i][j] = c[i][j] + a[i][k]*b [k][j]; } /* end omp parallel */
As with any sequential program, execution starts with a single thread of control. The #pragma omp
identifies a line of code to be an OpenMP parallel directive, which specifies a structured block of code that should be treated as a parallel region. In this case the parallel region is enclosed with {}. When the initial thread of control reaches a parallel directive, it creates a team of threads, all of which will then execute the code in the parallel region. The initial thread becomes the team’s master thread while the others are the slaves. The entire team will be available to share the work of the parallel region. This parallel construct includes a private clause and a shared clause. These are used to designate how variables are to be treated in the parallel region. In this case, the loop iteration variables will be treated local to individual threads while the matrices will be shared. Improperly characterized data can lead to incorrect results and errors, so this must be carefully considered. These and other clauses will be discussed later. This need to designate private and shared data is essentially what sets shared-memory programming from other methods. The next directive identified by #pragma omp identifies the for construct, which designates that the execution of the immediately following loop is to be executed in parallel. Iterations of the loop will among the threads of the team in the enclosing parallel region according to the specified loop schedule. In this example, the auto schedule will allow the compiler and runtime to decide on the actually scheduling of the loop iterations.
Overview of Features The OpenMP API relies on directives, library routines, and environment variables for expressing the parallelism desired in a given program. Some of these were mentioned previously. A directive and its accompanying structured block form a construct. Directives may have various clauses associated to provide further information to the OpenMP implementation. As seen above, directives for C and C++ begin with #pragma omp. Directives in Fortran can begin with !$omp, but other options are available. The fundamental parallel directive defines a parallel region, marking the structured block of code therein should be executed by multiple of threads. Without a parallel construct, the code will be executed sequentially. Each parallel construct designates whether data should be treated as private or shared. Data not
OpenMP
explicitly designated with a clause is shared by default. Since a parallel region is formed with a structured block, it must have only one entry point and one exit point. A program is non-conforming if it branches into or out of a parallel region. Various constructs may be used with the parallel construct to designate worksharing, tasks, synchronization, and more. The worksharing constructs include the loop, sections, single, and workshare constructs. Each worksharing construct must bind to an active parallel region before its effects will be realized. Worksharing directives encountered by a single thread are ignored and the accompanying code will be executed sequentially, i.e., outside of a parallel region or inside a parallel region with a team of one thread. Clauses used with a parallel directive may be used to designate conditional use, the number of threads, data sharing properties, and reductions, some of which will be discussed below. Worksharing constructs end with an implicit barrier causing all threads to wait for the construct to complete. A worksharing construct that appears in a function or subroutine that is invoked from within a parallel region is called an orphan directive. Orphan directives may also be called from outside a parallel region allowing sequential or parallel execution of the function. Worksharing constructs are used to define how the work in a block of code should be distributed across the executing threads. The loop construct (in C/C++ for, in Fortran do) is used to execute iterations of loops concurrently. The sections construct allows multiple blocks of code to be executed concurrently, each by a single thread. The single construct associates with the immediately following structured block for execution by a single, arbitrary thread. The workshare construct is supported only for directing the parallel execution of Fortran programs written using Fortran array syntax. Synchronization of threads in a team is achieved through another set of constructs. A barrier is a point of execution where threads must wait for all other threads is the current team before proceeding. While many constructs have implicit barriers, an explicit barrier is accomplished with the barrier directive. Barriers are used to avoid data race conditions. Access to shared data can be isolated to a critical region and protected using the critical directive or single assignment statements with the atomic directive. Explicit locks are also available for more flexible access of shared data.
O
Data in OpenMP programs is shared by the threads in the team by default. Any modification of shared data is guaranteed to be available to other threads after the next barrier or other synchronization point in the program. However, this is not always appropriate and some data will need to be local, or private, to each thread. A variable designated as private is replicated among the threads as a local copy. If the private variable needs to be initialized by the value of the corresponding variable immediately prior to the construct, the variable should be designated firstprivate. If the final value of a private variable is needed after the construct completes, the variable can be designated as lastprivate, whereby the final value of the variable inside the construct will be saved for the corresponding variable. Variables may also be designated as reduction variables with an associated operator. These variables will be used privately by each thread to perform its share of the work and then combined according to the operator as each thread completes its chunk of work. OpenMP also specifies how a programmer may interact with the runtime environment. The API defines a set of internal control variables (ICVs) for controlling the execution at runtime. These include controlling the number of threads in a thread team, enabling nested parallelism, and specifying the scheduling of iterations in loop constructs. Theses ICVs may be accessed by setting environment variables or using runtime library routines. In some cases, the ICV may be set with a clause in the proper directive.
Loop Parallelism Since a significant amount of work occurs in loops, it is important to exploit any parallelism present. Any work that can be executed without depending on the outcome of other work can be executed concurrently. This kind of concurrency often occurs in the iterations of the loops. As long as an each iteration does not depend in some way on the outcome of a previous iteration, iterations can be executed concurrently. Existing loopcarried dependences can often be removed with some reorganization to reveal concurrency. The most commonly used worksharing construct is the loop construct, which is marked with #pragma omp for [clause[[,]clause]…] for-loops
O
O
OpenMP
in C/C++ and !$omp do [clause[[,] clause]…] do-loops [!$omp end do [nowait]] in Fortran. This directs the implementation to distribute loop iterations among available threads. The number of iterations in the loop must be countable with an integer and use a fixed increment. This means only for loops in C/C++ and do loops in Fortran. Some rewriting of the loop may be necessary to expose parallelism. Clauses included can be used to prescribe the loop construct’s data environment and loop scheduling as well as others. Loop scheduling refers to how loop iterations are assigned to threads in the team. These are easily specified. using the clause schedule(kind[, chunk_size]) The kind of clause can be static, dynamic, guided, auto, or runtime. If no loop schedule is specified, a static schedule will be used. This schedule will “chunk” loop iterations together in contiguous nonempty sets. Each chunk will be assigned to a thread in a round-robin fashion. The size of a chunk is set by the chunk_size value. If the chunk size is not specified, all chunks will be approximately the same size and at most one chunk will be assigned to each thread in the team. For varying workloads, the dynamic and guided schedules may be more appropriate. With the dynamic schedule, each thread will continually grab a chunk iterations until all chunks have been executed, using a chunk size of one if none is specified. Similar to this is the guided schedule except that the chunk size decreases as threads subsequent chunks. Then chunk size is set to be a proportion of the remaining iterations. The auto schedule allows the implementation to decide the schedule, which may be any possible distribution of iterations amongst the thread team. The runtime schedule defers the actual the scheduling of threads until execution, at which time the schedule and chunk size are read from environment variables via a library routine.
Task Parallelism More flexibility is often needed to exploit parallelism in code, especially where the amount of parallelism is unknown, as with recursive algorithms or processing pointer-based structures. When the task directive is encountered by a thread, the associated block of code will be made ready for execution at sometime in the future. There is no guarantee when the task will be executed. When the task is given to a thread for execution, it will be executed sequentially. The implementation is allowed to suspend the execution and resume at a later time. By default, the thread that begins a task must complete the task in its entirety. Any suspension in the execution of the task must be resumed on the thread on which it began. This is called a tied task. A task designated untied may resume suspended execution on any thread. There is no guarantee of the ordering of task execution or completion; tasks will be completed with the use of task synchronization constructs. Tasks may be suspended or resumed when a thread reaches a task scheduling point. Task scheduling points occur at the point immediately following explicit task creation, the end of the task construct, any barriers, and in the taskwait region. The taskwait directive causes the task to wait on all child tasks it has generated.
Implementation Implementations of OpenMP usually consists of an OpenMP-aware compiler and its runtime library (RTL). The compiler is responsible for recognizing and properly translating the program’s OpenMP constructs while the RTL is responsible for thread creation and management at execution time. Most mainstream compilers that implement C, C++, and Fortran are capable of translating OpenMP. Any directives and RTL calls are translated along with the base-language program when the proper compiler options are used. Omission of these options will cause OpenMP directives to be ignored and result in a sequential program. Some constructs are merely replaced with a call to the RTL, like the barrier and taskwait. Implementations typically convert the structured block marked by a parallel directive into a separate procedure in a process called outlining. Any references to shared variables are replaced with pointers to the variables’ memory locations and passed to the outlined procedure as arguments. Private variables are local to the outlined procedure. Values for firstprivate variables are also passed as
OpenMP
arguments. A runtime library routine will fork the necessary threads and pass the outlined procedure to each thread with the needed arguments. Threads often sleep when not involved in an active team. The user can direct whether the threads will sleep or busy-wait when idle for performance considerations. A parallel loop will likely have each thread execute a runtime routine to determine iteration chunks prior to executing the iterations, followed by a call to a barrier routine. A guided or dynamic schedule may have threads carry out more than one set of iterations by invoking a routine to get a remaining set of iterations. These schedules will have slightly more overhead because of this additional coordination. However, these schedules may execute the work fast enough to justify the added overhead. A task can also be outlined to a procedure and variables passed as arguments similar to the parallel construct. Variables in tasks are firstprivate by default and their values must be saved at the time of task generation. Shared variables may be replaced with pointers to their memory locations as with the parallel construct. Each task is saved for execution at some time in the future, likely in a queue. The actual scheduling of the execution of the tasks is left to implementation dependence and varies greatly. OpenMP allow for work stealing, allowing a thread to take unfinished from other threads. By default, a task must be executed in it’s entirety by exactly one thread. If task is defined to be untied, it may finish execution on a different thread, making it a candidate for work stealing. As such, implementations may actually employ more than one schedule.
Performance of OpenMP Programs While OpenMP makes parallel programs easy to write, some extra work is usually necessary to allow complex programs to scale to high numbers of processors. Inherent in parallel programming is the presence of overhead, or work that would not otherwise be done in a sequential execution of the code. Each construct used will require some overhead to set up the parallel execution environment, so limiting their use to the presence of sufficient parallel work is advised. In addition to avoiding unnecessary overhead, the use of the memory hierarchy needs some careful attention. Good performance of an OpenMP program begins with the optimizing of its sequential counterpart. This often involves reorganizing array accesses in loops and
O
removal of redundant code which may not be done by the compiler. Once a satisfactory version of the sequential code is obtained, OpenMP constructs may be inserted one at a time, ensuring with each insertion the resulting execution remains valid. This ability to add parallelism incrementally is one of OpenMP’s strong points. One guiding principal in this process should be to keep all threads doing meaningful work as much as possible. Use of the parallel construct is generally quite expensive as it must deal with the management of the thread teams. The number of parallel constructs should be few, enclosing as much code as possible. An if clause is available to create a parallel region only on the condition that sufficient parallel work can be achieved. Synchronization of threads is also expensive, so it is important to consider carefully when it is necessary and avoid it otherwise. Since workshare constructs have implied barriers, it is easy to have redundant barriers occur in code without being obvious. The nowait clause can be used to remove such implied barriers. Since OpenMP is a shared-memory programming model some forethought as to how shared variables will be accessed is necessary. It is also prudent to minimize cache misses. Ensuring that array accesses in loops occur according to the base-language specification is one easy method. For instance, Fortran arrays are stored in memory in column-major order while C and C++ arrays are row-major. Nested loops should take this into account. Otherwise loops will likely access noncontiguous data on each subsequent iteration, causing a cache miss each time the array is accessed. Cache coherent systems have a side effect known as false sharing, the interference among threads writing to different locations within the same cache line. A write by one thread will cause the system to notify the other caches of this modification causing them to update their copy of that entire cache line. Other threads accessing different locations in the cache line will have to wait for this update. While the threads in this case are not accessing shared data, the cache line is shared. This is a matter of performance, not correctness, but can prevent a program from scaling to a large number of threads. In the following example, a parallel do construct exhibits false sharing. If N is the number of threads, each thread will execute one iteration of the loop given the chunk size of . When each thread updates one element of b it also invalidates the cache
O
O
OpenMP
line containing it. All other threads accessing that cache line will be affected and will need to get a new copy before their respective array elements are even modified. !$OMP PARALLEL DO shared(b,N) schedule(static,1) DO i = 1,N,1 b(i) = 4 * SIN(i*1.0) END DO While array padding can help prevent false sharing, a better solutions is to avoid the conditions in which false sharing will occur. When multiple threads are modifying shared data in the same cache line(s) in rapid succession, false sharing is likely to considerably impact performance. Ensuring a balanced workload among the threads will also lead to better performance. The time spent in computation between synchronization points should be comparable. Use of the runtime schedule is helpful in experimenting to find the best loop schedule to balance the load in a given program.
in C. So in C/C++, the index variable of an inner loop will be shared if not explicitly privatized. Overlooked shared variables are also a source of the data race condition. Another problem is the misunderstanding of which constructs have an implied barrier and which do not. In many cases, the single and the master constructs may be used for the same purpose. However, the former has an implied barrier and the latter does not. When using the master construct, it is important to know if an explicit barrier is needed afterwards. With the single, a nowait may be warranted to remove an unnecessary barrier. Barriers may also lead to deadlock during execution if they are not used properly. An explicit barrier must be executed by all thread in the current thread team. If any thread does not reach the barrier, all the other threads will be waiting at the barrier for a thread that will never arrive. Another source of deadlock is the improper nesting of explicit locks. Deadlocked code will appear to be executing work but will never finish. These conditions are easily avoided with careful use of these constructs.
Correctness Considerations It is actually quite easy to inadvertently introduce bugs into OpenMP programs. Bugs in shared-memory programming are usually very subtle and difficult to find. One such bug, the data race condition, involves the silent corruption of data during execution that is hard to detect as it may not manifest itself but on only a small fraction of the time. A data race condition occurs when multiple threads concurrently access the same shared data with at least one of the threads attempting to modify the data. Since there is no guarantee of the order in which the threads access data, this may lead to an incorrect result. Data race conditions may result from the lack of proper synchronization, like using a nowait incorrectly or neglecting to enclose such data access in a critical region. It is usually good practice to minimize the sharing of variables since this can be problematic. By privatizing variables that do not need to be shared, data race conditions and other issues can be avoided. And, since variables are shared by default, overlooked variables may be unintentionally shared. For example, in a Fortran loop, the index variables are always treated as private, but they are only private in a parallel-for loop
Future Directions The ARB is actively considering new strategies and features to add to or enhance the current API, deliberating on changes frequently being proposed by researchers. The ARB seeks to only make changes in the API that are generally acceptable to all vendors on all platforms. The challenge they face is to keep the API small enough to be relatively easy to use yet robust enough to offer sufficient expressivity for existing parallelism. Proposals under consideration include error handling mechanisms, support for performance tools, enhancing the current task interface, providing the mapping of work and data to threads, and the ability to exploit heterogeneous architectures such as GPUs and accelerators.
Related Entries Cilk Code Generation Loop Nest Parallelization OpenMP Profiling with ompP Shared-Memory Multiprocessors
OpenMP Profiling with OmpP
Bibliographic Notes and Further Reading The Parallel Computing Forum was an informal group comprised of representatives from industry and academia. Their parallel Fortran extensions were published in as a set of compiler directives for parallelizing loops. Though an official subcommittee for the American National Standards Institute (ANSI) was formed and a new standard drafted based on PCF in , it was never adopted when interest dwindled and other parallel programming models gained attention. In the latter half of that decade, the OpenMP ARB formed and introduced the first version of OpenMP, based on the PCF work, in . Later, a C/C++ specification was developed. These two specifications eventually were merged into a single document. Version . of the specification was ratified in . Plans are underway in for versions . and .. In addition to the current specification of the API, the OpenMP website [] provides information about the current specification, news about OpenMP, tutorials, a discussion forum, and links to more information about OpenMP. The International Workshop on OpenMP (www.iwomp.org) is an annual series of workshops for OpenMP research since and it is an excellent source for ongoing research by OpenMP users and developers and the ARB. Publications are available annually. Recent papers of interests include [task paper]. Books by Chapman [] and Chandra [] provide more detailed discussions of OpenMP, its use, and its implementation.
Bibliography . OpenMP Application Program Interface, Ver. ., May . http://www.openmp.org . Ayguadé E et al () A proposal to extend the OpenMP tasking model for heterogeneous architectures. In: Evolving openMP in an age of extreme parallelism; th International Workshop on OpenMP, IWOMP , Dresden, Germany, vol /. Lecture Notes in Computer Science, Springer, Berlin/Heidelberg, pp – . Ayguadé E et al () The design of OpenMP tasks. Parallel Distributed Syst. IEEE Trans, ():– . Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R () Parallel programming in OpenMP. Morgan Kaufmann, San Francisco
O
. Chapman B, Huang L, Biscondi E, Stotzer E, Shrivastava A, Gatherer A () Implementing OpenMP on a high performance embedded multicore MPSoC on a high performance embedded multicore mpsoc. In: IPDPS ’: proceedings of the IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, pp – . Chapman B, Gabriele Jost, Ruud van der Pas. () Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge, MA London . Compunity () cOMPunity – the community of OpenMP users. http://www.compunity.org/ . Duran A, Corbalán J, Ayguadé E () Evaluation of OpenMP task scheduling strategies. In: OpenMP in a new era of parallelism, vol /. Lecture Notes in Computer Science, Springer, Heidelberg, pp – . Itzkowitz M, Mazurov O, Copty N, Lin Y () An OpenMP runtime API for profiling. Technical report, Sun Microsystems http://developers.sun.com/solaris/articles/omp-api.html . Parallel Computing Forum () PCF parallel Fortran extensions. V..ACM Sigplan Fortran Forum ():– . Nanjegowda R, Hernandez O, Chapman B, Jin HH () Scalability evaluation of barrier algorithms for OpenMP. In: Evolving OpenMP in an age of extreme parallelism, vol /, IWOMP, Lecture Notes in Computer Science, Springer, Heidelberg, pp – . Su E, Tian X, Girkar M, Haab G, Shah S, Petersen P () Compiler support of the workqueuing execution model for Intel SMP architectures. In: The Fourth European Workshop on OpenMP. Rome, Italy . Terboven C, an Mey D, Sarholz S () OpenMP on multicore architectures. In: IWOMP ’: proceedings of the rd international workshop on OpenMP, Springer, Berlin Heidelberg, pp –
OpenMP Profiling with OmpP Karl Fürlinger Ludwig-Maximilians-Universität München, Munich, Germany
Synonyms Collection of summary statistics; Performance analysis of openMP applications; Profiling
Definition Profiling is the process of gathering summary statistics about the execution of programs. Metrics of interest for OpenMP applications are the breakdown of time spent by threads in various sections of the program to
O
O
OpenMP Profiling with OmpP
focus optimization efforts and the analysis of inefficiencies resulting from the parallel execution such as load imbalance and contention for locks and critical sections. Profiling is in contrast to tracing where individual time-stamped events are recorded and analyzed.
Discussion Introduction OpenMP is a successful approach for writing sharedmemory thread-parallel programs. It currently enjoys a renewed interest in high-performance computing and other areas due to the shift toward multicore computing and it is fully supported by every major compiler. Extensions of OpenMP for programming GPU accelerators are also currently investigated [] and under consideration for inclusion into a future version of the OpenMP specification. As with any programming model, developers are interested in improving the efficiency of their codes and a number of performance tools are available to help in understanding the execution characteristics of applications. However, the lack of a standardized monitoring interface often renders the performance analysis process more involved, complex, and error-prone than it is for other programming models. OpenMP is a language extension (and not a library, like MPI) and the compiler is responsible for transforming the program into threaded code. As a consequence, even a detailed analysis of the execution of the threads in the underlying low-level threading package is of limited use to understand the behavior on the OpenMP level. For example, compilers frequently transform parallel regions into outlined routines and the mapping of routine names (the machine’s model of the execution) to the original source code (the user’s model of the execution) is nontrivial. Vendor tools (paired with a compiler) can overcome this problem because they are aware of the transformations a compiler can employ and the naming scheme used for generating functions. Portable tools cannot have this kind of intricate knowledge for every compiler and OpenMP runtime and have to resort to other mechanisms, discussed later in this entry. The rest of this entry is organized as follows: Section “Profiling Tools Supplied by Vendors” discusses the vendor-provided solutions for OpenMP profiling, while Sect. “Platform-Independent Monitoring”
describes how platform-independent monitoring can be achieved using a source-to-source translator called Opari. Section “OpenMP Profiling with ompP” introduces the profiling tool ompP which is built on Opari to gather a variety of useful execution metrics for OpenMP applications. Sections “Overheads Analysis” and “Scalability Analysis” discuss the analysis of execution overheads and the scalability of applications based on the data delivered by ompP, respectively. Section “Future Directions” concludes with an outlook on the path forward.
Profiling Tools Supplied by Vendors OpenMP is today supported by the compilers of every major computer vendor and the GNU compiler collection has OpenMP support since version .. Most vendors also offer performance tools, such as the Intel Thread Profiler [], Sun Studio [], or Cray Pat [], supporting users in analyzing the performance of their OpenMP codes. The vendor tools mostly rely on statistical sampling and take advantage of the knowledge of the inner workings of the runtime and the transformations applied by the compiler. Typical performance issues that can be successfully detected by these tools are situations of insufficient parallelism, locking overheads, load imbalances, and in some cases memory performance issues such as false sharing. A proposal for a platform-independent standard interface for profiling was created by SUN []. With this proposal, tools could register for notifications from the runtime and ask for it to supply the mapping from the runtime to user execution model. The proposal is the basis of SUN’s own tool suite and is mostly used for statistical profiling there. The proposal is not part of the OpenMP specification but was accepted as an official white paper by the OpenMP Architecture Review Board (ARB). So far, acceptance of the proposal outside of SUN has been limited, however, and independent tool developers have to resort to other methods.
Platform-Independent Monitoring All approaches for monitoring OpenMP code in a platform-independent way in tools such as ompP [], Vampir [, ], TAU [, ], KOJAK [], and Scalasca [] today use source code instrumentation. This approach has some shortcomings, for example, it
OpenMP Profiling with OmpP
1 2 3 4 5 6 7 8 9 10 11 12 13
POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team]
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team]
} POMP_Parallel_join [master]
barr
exit
⎤ main
enter
/* user code in parallel region */
O
body
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
OpenMP Profiling with OmpP. Fig. Instrumentation added by Opari for the OpenMP parallel construct. The original code is shown in boldface, the square brackets denote the threads that execute a particular POMP call. The right part shows the subregion nesting used by ompP
requires the user to recompile their code for instrumentation. However, it has been found in practice to work well for many applications and the usage of automated preprocessor scripts eases the burden for developers. The monitoring approach used by all the aforementioned tools is based on a source code preprocessor called Opari []. Opari is a part of the KOJAK and Scalasca toolsets and it adds calls inside and around OpenMP pragmas according to the POMP specification []. An example source code fragment with Opari instrumentation is shown in Fig. : In this example, the user code is a simple parallel region (lines , , , and in bold font represent the original user code), all other lines are added by Opari. POMP_Parallel_{fork|join} calls are placed on the outside of the parallel construct and POMP_Parallel_{begin|end} calls are placed as the first and last statements inside the parallel region, respectively. An additional explicit barrier and corresponding POMP_Barrier_{enter|exit} calls (lines –) are placed toward the end of the parallel region as well. Doing so converts the implicit barrier at the end of the parallel region into an explicit barrier that can subsequently be monitored by tools, for example, to detect load imbalances among threads.
OpenMP Profiling with ompP ompP [] is a profiling tool for OpenMP based on source-code instrumentation using Opari. ompP implements the POMP interface to monitor the execution of the instrumented application. An application is linked
#pragma omp parallel { #pragma omp critical { sleep(1.0); } }
OpenMP Profiling with OmpP. Fig. A simple OpenMP code fragment with a critical section
with ompP’s profiling library and on program termination a profiling report is written to a file. Environment variables can be used to influence various settings of the data collection and reporting process. Figure shows an example OpenMP program fragment. The fragment contains a critical section inside a parallel region and the critical section only contains a call to sleep() for s. Figure shows the corresponding profile for the critical section: this critical section was unnamed (for named critical sections the name would appear instead of “default”) and that it is located at lines – in a file named “main.c.” ompP’s internal designation for this region is R. The profile represents an execution with four threads. Each line displays data for a particular thread and the last line sums over all threads. A number of columns depict the actual measured profiling values. Column names that end in a capital “C” represent counts while columns ending with a capital “T” represent times. In this example, each thread executed the critical construct and that the total execution time
O
O
OpenMP Profiling with OmpP
R00002 main.c (34-37) (default) CRITICAL TID execT execC bodyT enterT 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00
exitT 0.00 0.00 0.00 0.00 0.00
OpenMP Profiling with OmpP. Fig. An example of ompP’s profiling data for an OpenMP critical section. The body of this critical section contains only a call to sleep(1.0)
(execT) varies from . s to . s. bodyT is the time inside the body of the critical construct and it is . s corresponding to the time in sleep(), while enterT is the time (if any) threads have to wait before they can enter the critical section. In the example in Fig. , thread was the first one to enter, then thread entered after waiting . s, then thread after waiting . s, and so on. ompP reports timings and counts in the style shown in Fig. for all OpenMP constructs. Table shows the full table of constructs supported by ompP and the timing and count categories used in the profiling report. execT and execC are always reported and represent the entire construct. An additional breakdown is provided for some constructs, such as the startupT and shutdwnT for the time required starting up and tearing down the threads of a parallel region, respectively. This breakdown is based on the subdivision of regions into subregions, as shown on the right-hand side of Fig. . If applicable, a region is thought of being composed of a subregion for entering (enter), exiting (exit), a main body (body), and a barrier (barr), and main = enter + body + barr + exit always holds. In addition to the basic flat region profiles such as the ones shown in Fig. , ompP supports a number of more advanced features. ●
ompP records the application’s call graph and the nesting of the OpenMP regions and computes inclusive and exclusive times from it. This allows performance data for the same region involved from different execution paths to be analyzed separately. ● It allows the monitoring of hardware performance counter values using PAPI [, ]. If a PAPI installation is available on a machine, ompP will include the necessary functionality to set up, start, and stop the counters for all OpenMP constructs.
●
ompP computes the parallel coverage (i.e., the fraction of total execution time spent in parallel regions). According to Amdahl’s law [] a high parallel coverage is quintessential to achieve satisfactory speedups and high parallel efficiency as the number of threads increases. ● Another feature of ompP is continuous and incremental profiling [, ]. Instead of capturing only a single profile at the end of the execution, this technique allows an attribution of profile features on the timeline axis without the overheads typically associated with tracing. ● A further recent technique aimed at providing temporal runtime information without storing full traces is capturing and analyzing the execution control flow of applications. The method described in [, ] shows that this can be achieved with low overhead and small modifications to the data collection process. The resulting flow graphs are designed for interactive exploration together with performance data in the form of timings and hardware performance counter statistics to determine phase-based and iterative behavior [, ].
Overheads Analysis Any parallel algorithm and the execution of a parallel program will usually involve some overheads limiting the scalability and the parallel efficiency. Typical sources of overheads are synchronization (the need to serialize the access to a shared resource, for example) or any time spent in setting up and destroying the threads of parallel execution. An example of this is also shown in Fig. . enterT is the time threads have to wait to enter the critical section. From an application’s standpoint threads idle and do not do useful work on behalf of the application and hence enterT constitutes a form of synchronization overhead (since it arises due to the fact that the threads have to synchronize their activity). Considering the timings reported by ompP, a total of four different overhead classes can be identified. Synchronization: Overheads that arise because threads need to coordinate their activity. An example is the waiting time to enter a critical section or to acquire a lock.
OpenMP Profiling with OmpP
O
exit
construct MASTER
●
●
ATOMIC
●
●
BARRIER
●
●
FLUSH
●
●
USER REGION
●
●
CRITICAL
●
●
●
●
LOCK
●
●
●
●
LOOP
●
●
●
●
WORKSHARE
●
●
●
●
SECTIONS
●
●
SINGLE
●
●
shutdwnT
barr
exitT
singleC
singleT
sectionC
sectionT
bodyT
startupT
body
execC
enterT
enter
execT
main
exitBarT
OpenMP Profiling with OmpP. Table All timing and count categories reported by ompP for various OpenMP constructs
● ●
●
●
● ●
●
●
PARALLEL
●
●
●
●
●
●
PARALLEL LOOP
●
●
●
●
●
●
PARALLEL SECTIONS
●
●
●
●
●
PARALLEL WORKSHARE
●
●
●
●
●
Imbalance: Overhead due to different amounts of work performed by threads and subsequent idle waiting time, for example, in work-sharing regions. Limited Parallelism: This category represents overhead resulting from unparallelized or only partly parallelized regions of code. An example is the idle waiting time threads experience while one thread executes a single construct. Thread Management: Time spent by the runtime system for managing the application’s threads. That is, time for creation and destruction of threads in parallel regions and overhead incurred in critical sections and locks for signaling the lock or critical section as available. Table shows the classification of the timings reported by ompP into no overhead (●) or one of the four overhead classes: S for synchronization overhead, I for imbalance, L for Limited Parallelism, and T for Thread management overhead. ompP’s application profile includes an overhead analysis report which offers various interesting insights into where an application wastefully spends its time. The first part of the overhead analysis report shows the
●
●
●
total runtime (this corresponds to the duration of the overview section of the profiling report). Then the total number of parallel regions is reported (this includes combined worksharing-parallel regions) and then the parallel coverage is listed. The parallel coverage is the amount of execution time spent in parallel regions. A low parallel coverage limits the speedup that can be achieved by parallel execution. The following two sections list overheads according to the classification scheme described in Table and discussed in detail in []. The difference between these two sections is the order in which parallel regions are listed and the way in which percentages are computed. In “Overheads wrt. whole program” the percentages are computed with respect to the total execution time (i.e., duration × number of threads) and this section hence allows the easy identification of regions that cause high overheads for the whole application.
Scalability Analysis The scalability of an application or of individual parallel regions within an application can be analyzed in detail
O
O
OpenMP Profiling with OmpP
OpenMP Profiling with OmpP. Table Overhead classification of the times reported by ompP into one of the four overhead classes. Synchronization (S), Load Imbalance (I), Thread Management (M), Limited Parallelism (L)
MASTER
●
●
ATOMIC
S
●
BARRIER
S
●
FLUSH
S
●
USER REGION
●
●
CRITICAL
●
●
S
●
M
LOCK
●
●
S
●
M
LOOP
●
●
●
I
WORKSHARE
●
●
●
I
SECTIONS
●
●
SINGLE
●
●
PARALLEL
●
●
M
●
PARALLEL LOOP
●
●
M
●
PARALLEL SECTIONS
●
●
M
PARALLEL WORKSHARE
●
●
M
●
I/L ●
●
●
●
●
I I
M
I
M
I/L
M
I
M
------------------------------------------------------------------------ompP Overhead Analysis Report -----------------------------------------------------------------------------------------------Total runtime (wallclock) : 16.38 sec [4 threads] Number of parallel regions : 10 Parallel coverage : 10.93 sec (66.73%) Parallel regions sorted by wallclock time: Type Location R00004 PARALLEL muldeo.F (63-145) R00007 PARALLEL muldoe.F (63-145)
shutdwnT
construct
●
exitT
exit
exitBarT
singleC
singleT
sectionC
barr sectionT
bodyT
body startupT
enterT
enter execC
main execT
Wallclock (%) 4.01 (24.45) 3.99 (24.34)
... Overheads wrt. whole program: Total Ovhds (%) = Synch (%)+ Imbal (%)+ Limpar (%)+ Mgmt (%) R00007 15.95 0.22 ( 0.34) 0.00 (0.00) 0.22 (0.34) 0.00 (0.00) 0.00 (0.00) R00004 16.02 0.16 ( 0.25) 0.00 (0.00) 0.16 (0.25) 0.00 (0.00) 0.00 (0.00)
O
OpenMP Profiling with OmpP
by computing an overhead breakdown while varying the number of OpenMP threads for the execution. Assuming a constant workload and increasing number of threads (i.e., a strong scaling study), one can plot scalability graphs like the one shown in Fig. . Here the number of threads is plotted on the horizontal axis and the total aggregated execution time (summed over all threads) is plotted on the vertical axis. The topmost line is derived from the overall wallclock execution time of an application. From this total time, one can subtract the four overhead classes (synchronization, load imbalance, thread management, and limited parallelism) for each thread count. A perfectly scaling code (linear decrease in the execution time the number of threads is increased) would result in a horizontal line in this type of scalability graph. A line inclining from the horizontal corresponds to imperfect scaling and a declining from the horizontal points toward superlinear speedup. Figures a, b, and c are scalability graphs of full applications from the SPEC OpenMP benchmark suite ([, ]), corresponding to the “mgrid,” “applu” and 314.ovhds.dat
“swim” applications, respectively. Evidently the scaling behavior of these applications is widely different. A detailed discussion of the scaling properties of several of the SPEC OpenMP benchmarks can be found in []. In addition to scalability graphs for the whole application, ompP offers the option to analyze the scaling of individual parallel regions. As the workload characteristics of parallel regions within a program may vary considerably, so can their scalability characteristics and each region might have a different operating sweet spot in terms of computing throughput or energy efficiency. As an example, Fig. d shows an example for a scalability graph of an individual region of the application “galgel.”
Future Directions OpenMP is a continually evolving standard and new features pose new requirements and challenges for performance profiling. A recent major feature introduced with version . of the OpenMP specification is 3000
5000
2000
3000
1500
2000
1000
1000
500
0 4
Mgmt
8
12 16 20 mgrid_m•
Limpar
Imbal
24
28
Sync
32 Work
313.ovhds.dat
8000 6000 4000 2000 0 32 Mgmt
48
2
b 1600 1400 1200 1000 800 600 400 200 0
10000
c
O
0 2
12000
316.ovhds.dat
2500
4000
a
64
Limpar
80 96 swim_l• Imbal
112 Sync
Work
Mgmt
d
8
12 16 20 applu_m•
Limpar
Imbal
24 Sync
28
32 Work
318.R00034.ovhds.dat
2
128
4
4 8 12 16 20 24 28 32 galgel_m (lapack.f90 5081-5092).
Mgmt
Limpar
Imbal
Sync
Work
OpenMP Profiling with OmpP. Fig. Scalability graphs for some of the applications of the SPEC OpenMP benchmark suite. Suffix _m refers to the medium size benchmark while suffix _l denotes large size benchmark. The horizontal axis denotes processor (thread) count and the vertical is the accumulated execution time (over all threads) in seconds
O
OpenMP Profiling with OmpP
tasking – the ability to dynamically specify and enqueue small chucks of work for later execution. A preliminary study [] investigated the feasibility of extending the source code-based instrumentation approach to handle the new tasking-related constructs. The resulting modified version of ompP could successfully handle many common use cases for tasking. The notion of overheads had to be adapted, however, to reflect the fact that with tasking threads reaching the end of a parallel region are no longer idle but can in fact start fetching and executing tasks from the task pool. One use case, that of untied tasks, proved to be a challenge. With untied tasks, an executing thread can stop and resume task execution at any time and in fact a different thread can pick up where the previous thread left off. These events are impossible to be observed purely based on source code instrumentation without a standardized callback mechanism that would notify a tool of such an occurrence. To overcome these issues in particular and to simplify usage of ompP in general, tying together instrumentation sources from multiple layers, such as source code instrumentation, interposing on the native threading library, and incorporating the SUN whitepaper seems like a promising way for the path forward.
Related Entries
Intel Thread Profiler Scalasca TAU Vampir
Bibliography . Amdahl GM () Validity of the single processor approach to achieving large scale computing capabilities. In: Readings in computer architecture. Morgan Kaufmann Publishers, San Francisco, pp – (Reprint of a work originally published in ) . Aslot V, Eigenmann R () Performance characteristics of the SPEC OMP benchmarks. SIGARCH Comput Archit News ():– . Browne S, Dongarra J, Garner N, Ho G, Mucci PJ () A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl (): – . Brunst H, Mohr B () Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG. In: Proceedings of the first international workshop on OpenMP (IWOMP ), Eugene, May
. Using Cray performance analysis tools. http://docs.cray.com/ books/S--/S--.pdf. Accessed April . Fürlinger K, Dongarra J () On using incremental profiling for the performance analysis of shared memory parallel applications. In: Proceedings of the th international euro-par conference on parallel processing (Euro-Par ’). Lecture notes in computer science, vol . Springer, August , pp – . Fürlinger K, Gerndt M () ompP: a profiling tool for OpenMP. In: Proceedings of the first international workshop on OpenMP (IWOMP ), Eugene, May . Fürlinger K, Gerndt M () Analyzing overheads and scalability characteristics of OpenMP applications. In: Proceedings of the seventh international meeting on high performance computing for computational science (VECPAR’). Lecture notes in computer science, vol . Rio de Janeiro, pp – . Fürlinger K, Gerndt M, Dongarra J () Scalability analysis of the SPEC OpenMP benchmarks on large-scale shared memory multiprocessors. In: Proceedings of the international conference on computational science (ICCS ). Beijing, May , pp – . Fürlinger K, Moore S () Continuous runtime profiling of OpenMP applications. In: Proceedings of the conference on parallel computing (PARCO ), volume of Advances in parallel computing, IOS press, Sept , pp – . Fürlinger K, Moore S () Detection and analysis of iterative behavior in parallel applications. In: Proceedings of the international conference on computational science (ICCS ). Lecture notes in computer science, vol . Krakow, June , pp – . Fürlinger K, Moore S () Visualizing the program execution control flow of OpenMP applications. In: Proceedings of the th international workshop on OpenMP (IWOMP ). Lecture notes in computer science, vol . Purdue, May , pp – . Fürlinger K, Moore S () Capturing and analyzing the execution control flow of OpenMP applications. Int J Parallel Program (IJPP) ():– . Fürlinger K, Moore S () Recording the control flow of parallel applications to determine iterative and phase-based behavior. J Future Gener Comput Syst () . Fürlinger K, Skinner D () Performance profiling for OpenMP tasks. In: Proceedings of the th international workshop on OpenMP (IWOMP ), Dresden June . Geimer M, Wolf F, Wylie BJN, Mohr B () Scalable parallel trace-based performance analysis. In: Proceedings of the th European PVM/MPI Users’ group meeting on recent advances in parallel virtual machine and message passing interface (EuroPVM/MPI ). Bonn, pp – . Intel Corporation. Intel Thread Profiler. http://software.intel. com/en-us/articles/intel-vtune-amplifier-xe/. Accessed April . Itzkowitz M, Mazurov O, Copty N, Lin Y An OpenMP runtime API for profiling. () Accepted by the OpenMP ARB as an official ARB White Paper available online at http://www.compunity. org/futures/omp-api.html. Accessed April
OpenSHMEM - Toward a Unified RMA Model
O
There is a long-standing and successful family of SHMEM APIs but there is no standard and implementations differ from each other in various subtle ways, hindering acceptance, portability, and in some cases, program correctness. We discuss the differences between SHMEM implementations and contrast SHMEM with other extant libraries supporting RMA semantics to provide motivation for a standardsbased OpenSHMEM with the requisite breadth of functionality. The Message Passing Interface (MPI) [] is currently the most widely used communication model for largescale simulation-based scientific parallel applications. Of these applications, a large number rely on two-sided communication mechanisms. Two-sided communication mechanisms require both sides of the exchange (source and destination) to actively participate, such as in MPI_SEND and MPI_RECV. While many algorithms benefit from the coupling of data transfer and synchronization that two-sided mechanisms provide, there exists a substantial number of algorithms that do not benefit from this coupling, and in fact may be hindered by the induced overhead of the synchronization. The one-sided communication mechanisms are capable of decoupling the data transfer from the synchronization of the communication source and target. Remote memory access (RMA) is a one-sided communication mechanism that allows data to be transferred from one process memory space to another (remote) process memory space. The RMA operation is described entirely by one process (the active side) without the direct intervention of the other process (the passive side). Irregular communication patterns, in which data source and data target are not known a priori OpenSHMEM - Toward a Unified (as demonstrated in the GUPs [] benchmark), often benefit from one-sided communication models such RMA Model as RMA. Stephen W. Poole , Oscar Hernandez , Jeffery A. One of the most widely used one-sided communiKuehn , Galen M. Shipman , Anthony Curtis , cation models is SHMEM which stands for Symmetric Karl Feind Hierarchical MEMory access, so named for providing Oak Ridge National Laboratory, Oak Ridge, TN, USA a view of distributed memory through the use of ref University of Houston, Houston, TX erences to symmetric data storage on each sub-domain SGI, Eagan, MN, USA of a scalable system. While some sources (incorrectly) identify SHMEM as “SHared MEMory access,” it should be noted that SHMEM is unrelated to AT&T System V Definition OpenSHMEM is a standards-based partitioned global Shared Memory as implemented by the shmat, shmctl, address space (PGAS) one-sided communications library. shmget, etc. UNIX system calls. SHMEM [] has a long
. Lee S, Eigenmann R () OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the ACM/IEEE international conference for high performance computing, networking, storage and analysis, SC’, IEEE press . Malony AD, Shende SS () Performance technology for complex parallel and distributed systems, Kluwer Academic Publishers, pp – . Mohr B, Malony AD, Hoppe H-C, Schlimbach F, Haab G, Hoeflinger J, Shah S () A performance monitoring interface for OpenMP. In: Proceedings of the fourth workshop on OpenMP (EWOMP ), Rome, Sept . Mohr B, Malony AD, Shende SS, Wolf F () Towards a performance tool interface for OpenMP: An approach based on directive rewriting. In: Proceedings of the third workshop on OpenMP (EWOMP’), Sept . Nagel WE, Arnold A, Weber M, Hoppe H-C, Solchenbach K () VAMPIR: visualization and analysis of MPI resources. Supercomput ():– . Terpstra D et al. PAPI web page: http://icl.cs.utk.edu/papi/. Accessed April . Saito H, Gaertner G, Jones WB, Eigenmann R, Iwashita H, Lieberman R, van Waveren GM, Whitney B () Large system performance of SPEC OMP benchmarks. In: Proceedings of the international symposium on high performance computing (ISHPC ). Springer, London, pp – . Shende SS, Malony AD () The TAU parallel performance system. International Journal of High Performance Computing Applications, ACTS Collection Special Issue, . Oracle, Inc. Oracle Solaris Studio web site http://www.oracle. com/technetwork/server-storage/solarisstudio/overview/index. html. Accessed April . Wolf F, Mohr B () Automatic performance analysis of hybrid MPI/OpenMP applications. In: Proceedings of the th Euromicro conference on parallel, distributed and network-based processing (PDP ), IEEE Computer Society Press, Feb , pp –
O
O
OpenSHMEM - Toward a Unified RMA Model
history as a parallel programming model, having been used extensively on a number of products since , including Cray TD, Cray XE, the Cray XT/, SGI Origin, SGI Altix, clusters based on the Quadrics interconnect, and to a very limited extent, Infiniband-based clusters.
– History of SHMEM ● Cray SHMEM * SHMEM first introduced by Cray Research Inc. in for Cray TD * Cray is acquired by SGI in * Cray is acquired by Tera in (MTA) * Platforms: Cray TD, TE, C, J, SV, SV, X, X, XE, XMT, XT ● SGI SHMEM * SGI purchases Cray Research Inc. and SHMEM was integrated into SGI’s Message Passing Toolkit (MPT) * SGI currently owns the rights to SHMEM and OpenSHMEM * Platforms: Origin, Altix , Altix XE, Altix ICE, Altix UV * SGI was purchased by Rackable Systems in * SGI and Open Source Software Solutions, Inc. (OSSS) signed a SHMEM trademark licensing agreement, in ● Other Implementations * Quadrics (Vega UK, Ltd.), Hewlett Packard, GPSHMEM, IBM, QLogic, Mellanox * University of Houston, University of Florida
Despite being supported by a variety of vendors there is no standard defining the SHMEM memory model or programming interface. Consistencies (where they exist) and extensions across the various implementations have been driven by the needs of an enthusiastic user community. The lack of a SHMEM standard has allowed each implementation to differ in both interface and semantics from vendor to vendor and even product line to product line, which has to this point limited broader acceptance. For an introduction to the SHMEM programming model, please see any of the following: [, ] or [].
The SHMEM API and SHMEM in general encompass the following ideas: – Single Program/Multiple Data (SPMD) style – Characteristics: One-sided, data passing, RDMA, RMA, PGAS – Put, Get, Atomic Updates (AMO) to reference remote memory or memories – Remote memory/arrays are symmetric – SHMEM decouples data transfer from synchronization – Barriers and polling synchronization are used – Collectives – Explicit control over data transfer – Low latency communications – Library programming interface to PGAS – SHMEM is a suitable under-layer for some PGAS implementations – SHMEM is a viable alternative to MPI In addition to SHMEM, a number of other parallel communication libraries currently support the one-sided communication model via RMA. These RMA APIs exhibit much commonality but also several core differences. In this entry, we will compare the most widely used RMA APIs for high performance computing to the SHMEM model. The following RMA models will be considered in this entry: SGI SHMEM: SGI SHMEM [] (SGI-SHMEM) has one of the richest sets of RMA operations and is currently supported on the NUMAlink-based and Infiniband-based Altix product lines. Cray SHMEM - Unicos MP: Cray SHMEM on the Unicos MP [] (MP-SHMEM) is very similar to SGISHMEM and is currently supported on the XE supercomputer. Cray SHMEM - Unicos LC: Cray SHMEM on the Unicos LC [] (LC-SHMEM) is a subset of SHMEM on the Unicos MP but lacks a number of key items such as full data-type support. Unicos LC is currently available on the Cray XT // supercomputers. Quadrics SHMEM: Quadrics SHMEM [] (Q-SHMEM) supports most of the communication mechanisms available in Unicos MP but lacks a number of key items that enhance usability.
OpenSHMEM - Toward a Unified RMA Model
Cyclops- SHMEM: Cyclops- SHMEM [] (CSHMEM) is a SHMEM API which supports the Cyclops- architecture. Most of the core features of Cray SHMEM are available with some additional interfaces specific to the Cyclops- architecture. GASNet: GASNet [] is a lower level interface with minimal RMA support intended to be used as a communication mechanism for other parallel programming models and environments. As such, it is missing many of the usability features of the SHMEM family of RMA APIs. GASNet also differs from the other communication models due to its unique support for active messages. ARMCI: ARMCI [] is another lower level interface designed to support higher level protocols and programming models such as Global Arrays. MPI-: MPI- [] provides a one-sided interface that has not been widely adopted for a number of reasons [] but was intended to provide a portable alternative to SHMEM and other one-sided communication models. In order to effectively compare these RMA APIs, we will define the following areas of functionality: Symmetric Objects: Symmetric objects provide a mechanism to address remote variables using the address of the corresponding local variable. Remote Write: Put operations across a number of primitive data-types as well as contiguous, strided, or indexed memory locations Remote Read: Get operations across a number of primitive data-types as well as contiguous, strided, or indexed memory locations Synchronization: Two basic types: – Task Synchronization – synchronization operation between tasks, such as barriers and locks – Memory Synchronization – memory polling, and ordering of remote and local memory reads and writes Atomic Memory Operations: Compare/Mask Swap, fetch-and-add, increment Reductions: Reductions across a number of primitive data-types and reduction operations such as and, or, min, and max Collective Communication: Broadcast, collect, and fcollect operations
O
Symmetric Objects In SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM some data objects are said to be “symmetric” across multiple PEs. For a single-executable program, launched on multiple PEs, all statically allocated objects are symmetric across all PEs. Additionally, objects allocated dynamically via symmetric allocation (by calling SHMEM’s shmalloc() collectively) are also symmetric across all PEs. These symmetric data objects greatly simplify RMA operations by allowing the active side to initiate a remote operation on the passive side by specifying the address of the local symmetric data object. The RMA operation proceeds as if the address specified were the address of the passive side’s symmetric data object. In the absence of symmetric data objects, the address of the passive side’s data object must be communicated to the active side for use in the RMA (Remote) operation. In SHMEM, such objects are known as asymmetric objects. Asymmetric objects include stack variables as well as those objects dynamically allocated by nonsymmetric means (e.g., malloc()). Care must be taken in referencing asymmetric objects as the local (active side) address calculation will likely not result in the correct address on the remote (passive) side, though some implementations will not fail if asymmetric addresses strictly refer to local (active side) objects, e.g., the source for a PUT or the target for a GET. Remote access to asymmetric objects is an extension that is not supported by SGI’s SHMEM or OpenSHMEM V. Symmetric data objects are not supported on the other RMA (Remote) APIs. GASNet requires that the active side specify the address of the data object on the remote side. ARMCI provides a collective memory allocation mechanism that allows each process to allocate memory locally and share the memory location with its peers. While convenient, this is no different from allocating memory locally and gathering the addresses from other PEs. SHMEM provides similar functionality via shmalloc with one difference, the memory allocated in shmalloc is symmetric (the Symmetric Heap). This allows RMA (Remote) operations to the remote memory by simply specifying addresses within the local symmetric memory. All SHMEM implementations discussed in this entry provide symmetric memory allocation. MPI- uses a hybrid approach by requiring processes to participate in the creation of a “window” for
O
O
OpenSHMEM - Toward a Unified RMA Model
RMA (Remote) operations. The active side can then specify the target’s location for the RMA (Remote) operation via a displacement in the window. It is important to note that symmetric data objects require either special hardware, compiler, or library support. The use of the address of a local data object in an RMA (Remote) operation to a remote PE indicates that the data object is symmetric. A (pre-)compiler can detect this usage and take appropriate action to ensure that the remote data object is properly addressed. SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM support symmetric data objects, asymmetric data objects, and symmetric memory allocation. C-SHMEM supports only asymmetric data objects and symmetric memory allocation. Code listing (a slightly modified example from []) illustrates the use of symmetric data objects. Note the declaration of the “static short target” array and its use as the remote destination in shmem_short_put. The use of the “static” keyword results in the target array being symmetric on all PEs in the program. Each PE is able to transfer data to the target array by simply specifying the local address of the symmetric data object which is to receive the data. This aids programmability as the address of the target all PEs other than need not be exchanged with the active side (PE ) prior to the RMA (Remote memory operation) – an important reduction network data-flow, particularly for large systems with irregular data flows. Conversely, the declaration of the “short source” array is asymmetric. Because the put handles the references to the source array only on the active (local) side, the asymmetric source object is handled correctly. Note that C-SHMEM does not support this mechanism and relies on the use of symmetric memory allocation. Other code examples are included in the examples section.
Remote Write/Read Remote Read/Write operations are the basic building blocks of any RMA API. Generally, these are referred to as put/get operations, and in the most basic form allow one processing element (PE) to transfer data from a local memory location to a remote memory location belonging to another PE, or vice versa. Remote Write: In a Remote Write operation (PUT), the initiating (active side) PE is the source and the
remote (passive side) PE is the target. The active side PE specifies the local (active side) PE’s (source) memory from which the data will be sent, and the remote (passive side) PE’s (target) memory to which the data will be written. MP-SHMEM, LCSHMEM, Q-SHMEM, and C-SHMEM all provide the semantic that after shmem_put returns, the data specified for transfer has been buffered; this local completion on the active side does not equate to remote completion of the put on the passive side. LC-SHMEM, Q-SHMEM, and CSHMEM also provide a non-blocking PUT operation shmem_put_nb which may return before the data is buffered for transfer. Local completion of non-blocking remote (RMA) writes are guaranteed either when shmem_test_nb returns success or shmem_wait_nb returns. Unlike MPI requests, it is invalid to wait on a non-blocking operation after test returns success. Remote Read: In a Remote Read operation (GET), the initiating (active side) PE is the target and the remote (passive side) PE is the source. The active side PE specifies the remote (passive side) PE’s (source) memory from which the data will be retrieved, and the local (active side) PE’s (target) memory to which the data will be written. In MP-SHMEM, LC-SHMEM, Q-SHMEM, and CSHMEM shmem_get returns after data has been delivered to the initiator PE’s memory. LC-SHMEM, Q-SHMEM, and C-SHMEM provide a nonblocking GET operation shmem_get_nb which does not block until the data specified for transfer has been delivered to the active side. Local completion of non-blocking RMA (Remote) Read operations is guaranteed either when shmem_test_nb (or shmem_poll_nb on C-SHMEM) returns success or shmem_wait_nb returns. Table details the primitive data-types supported by Remote Read/Write. Not surprisingly, ARMCI and GASNet only support byte level transfers as they are lower level APIs meant to provide a basis for higher level languages or APIs. The bit, byte, etc., “data-types” are simply syntactic sugar for the generalized byte level data-type supported by all APIs. None of the SHMEM libraries directly support unsigned integers or higher precision types beyond long long and double. Unsigned
O
OpenSHMEM - Toward a Unified RMA Model
#include <mpp/shmem.h> #define SIZE 16 int main(int argc, char* argv[]) { short source[SIZE]; static short target[SIZE]; int i; int num_pe = atoi(argv[1]);
/* local, not symmetric */ /* static makes it symmetric */ /* no. of PEs taken from command-line */
start_pes(num_pe); if (_my_pe(0) == 0) { /* initialize array */ for(i = 0; i < SIZE; i++) source[i] = i; /* put "size" words into target on each PE */ for(i = 1; i < num_pe; i++) shmem_short_put(target, source, SIZE, i); } shmem_barrier_all(); /* sync sender and receiver */ if (_my_pe() != 0) { printf("target on PE %d is \t", _my_pe()); for(i = 0; i < SIZE; i++) printf("%hd", target[i]); printf("\n"); } shmem_barrier_all(); /* sync before exiting */ return 0; }
OpenSHMEM - Toward a Unified RMA Model. Fig. RMA (Remote operation) using a symmetric data object OpenSHMEM - Toward a Unified RMA Model. Table Symmetric data objects and memory allocation Type
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
Symmetric Data Objects
Yes
Yes
Yes
Yes
No
No
No
No
Symmetric Memory Allocation
Yes
Yes
Yes
Yes
Yes
No
No
No
an index array be used to describe the elements within the source and target arrays, and is therefore less compact than the strided data description. Generalized I/O Vectors: Generalized I/O vectors allow transfers to/from discontinuous regions that are made up of contiguous elements of like length. An array of pointers is specified for both the initiaContiguous Memory Access: Contiguous memory is spector and the target. Each entry in the array specifies ified by the active side as both the source and destithe starting address of a contiguous memory region nation for a PUT or GET operation. of length K. Both arrays are of length N. Strided: Strided access allows transfers of elements of a memory region with a stride between them. This is Arbitrary Discontiguous: This access mode allows transfers to/from discontinuous regions with minimal a compact representation suitable for regular data restrictions. Contiguous elements within the dislayouts such as multi-dimensional arrays. continuous regions may be of any byte length. The Indexed: This access pattern provides more flexibility total byte length of all contiguous elements must be than strided access as there is no requirement for a equal at both the initiator and the target. This access constant stride. This added flexibility requires that types can be used in put/get operations by using the appropriate byte size transfer at the cost of convenience. In addition to data-type support, these RMA (Remote) APIs offer different access patterns for PUT/GET operations.
O
O
OpenSHMEM - Toward a Unified RMA Model
OpenSHMEM - Toward a Unified RMA Model. Table RMA (Remote) PUT/GET data-type support Type
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
byte
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
short
Yes
Yes
Yes
Yes
No
No
No
Yes
int
Yes
Yes
Yes
Yes
No
No
No
Yes
long
Yes
Yes
Yes
Yes
Yes
No
No
Yes
long long
Yes
Yes
Yes
Yes
Yes
No
No
No
float
Yes
Yes
Yes
Yes
No
No
No
Yes
double
Yes
Yes
Yes
Yes
Yes
No
No
Yes
long double
Yes
Yes
No
Yes
No
No
No
Yes
integer (Fortran)
Yes
Yes
Yes
No
No
No
No
Yes
logical (Fortran)
Yes
Yes
Yes
No
No
No
No
Yes
real (Fortran)
Yes
Yes
Yes
No
No
No
No
Yes
complex (Fortran)
Yes
Yes
Yes
No
No
No
No
Yes
bits
No
Yes
Yes
No
No
No
No
No
bytes
Yes
Yes
Yes
Yes
No
Yes
Yes
No
bits
Yes
Yes
Yes
Yes
No
No
No
No
bytes
Yes
Yes
Yes
Yes
No
No
No
No
bits
Yes
Yes
Yes
Yes
No
No
No
No
bits
Yes
Yes
Yes
Yes
No
No
No
No
OpenSHMEM - Toward a Unified RMA Model. Table RMA (Remote) PUT/GET access methods Type
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
Contiguous
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Indexed
No
Yes
Yes
No
No
No
No
Yes
Strided
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Generalized I/O Vectors
No
No
No
No
No
No
Yes
Yes
Arbitrary Discontinuous
No
No
No
No
No
No
No
Unknown
mode uses a pair of arrays to describe the memory locations and the length of each element at the memory location. Four arrays are therefore necessary to describe the memory layout for both the initiator and the target. Note that MPI- supports Indexed, Strided, and Generalized I/O Vectors through the use of derived data-types.
Synchronization While SHMEM provides a barrier semantic which intertwines synchronization with remote completion, it also provides three somewhat more subtle tools for
enforcing ordering semantics on the sequence of memory updates arriving at a PE: fence, quiet, and wait. Fence: SHMEM Fence ensures ordering of PUT operations to a specific PE. In SGI-SHMEM, MPSHMEM, and LC-SHMEM, the fence operation forces ordering of the completion of PUT operations on a remote PE. Specifically, while the order of posting of PUT operations on the remote PE is not generally guaranteed, the fence does provide a guarantee that on the remote (passive) side the PUTs issued from a particular source PE before the fence will be completed before the PUTs issued by that source PE after the fence are completed.
OpenSHMEM - Toward a Unified RMA Model
Note that this does not guarantee that the remote (passive) side will have completed the PUTs when the fence operation returns on the active side. QSHMEM and C-SHMEM support shmem_fence via shmem_quiet and add the additional semantic that all outstanding PUT operations will be delivered to the target PE prior to return from the function. Figure illustrates the semantics of shmem_fence. ARMCI provides a fence operation similar to that of Q-SHMEM in that it blocks until all PUTS are delivered to the remote PE. GASNet provides blocking and non-blocking versions of PUT. A blocking PUT will not return until data is delivered to the remote PE. A wait on a non-blocking PUT will not return until data is delivered to the remote PE. Fence (and Quiet) are therefore not necessary in GASNet as both are implicit in the semantics of blocking and non-blocking PUT operations. The GASNet blocking interface combines the semantics of RMA (Remote) initializations, ordering, and remote completion. Applications that do not need the additional semantics may incur additional overhead. GASNet’s non-blocking PUT operations separate RMA (Remote operations) initialization from ordering and remote completion, but fail to separate ordering from remote completion.
MPI- specifies an MPI_WIN_FENCE operation, but differs dramatically from Fence in SHMEM. MPI_WIN_FENCE is a collective operation on the group of WIN requiring both the active and passive sides to participate, whereas SHMEM fence is called only on the active side. In addition, MPI_WIN_FENCE ensures that all outstanding RMA calls on a window (regardless of origin) are synchronized. Fence in SHMEM may be thought of strictly as an ordering semantic and not as a PE synchronization semantic as in MPI. Figure illustrates the semantics of MPI_Win_fence. Quiet: SHMEM quiet ensures ordering of PUT operations to all PEs. In SGI-SHMEM, MP-SHMEM, and LC-SHMEM, the quiet operation ensures the order of PUT operations. All incoming PUT operations issued by any PE and local load and store operations started prior to the quiet call are guaranteed to be complete at their targets and visible to all other PEs prior to any remote or local access to the corresponding target’s memory or any synchronization operations that follow the call to shmem_quiet. Q-SHMEM and C-SHMEM extend the semantic in that data is delivered at the remote PE at the return of shmem_quiet. This additional semantic intertwines remote completion and ordering which may impose unnecessary overhead to applications
Shmem_fence prevents operation 3 from overtaking operation 1 PE 1
PE 1
Shmem_fence prevents operation 3 from overtaking operation 1 and guarantees remote completion
PE 2 (1) Shmem_int_put(dest, src, size, 2)
(1) Shmem_int_put(dest, src, size, 2)
(2) Shmem_fence ()
(2) Shmem_fence ()
(3) Shmem_int_put(dest, src, size, 2) (3) Shmem_int_put(dest, src, size, 2)
a
O
b
OpenSHMEM - Toward a Unified RMA Model. Fig. Example of fence (a) SGI-, MP-, and LC-SHMEM (b) Q- and C-SHMEM
PE 2
O
O
OpenSHMEM - Toward a Unified RMA Model MPI_Win_fence prevents operation 3 from overtaking operation 1 and guarantees remote completion PE 1
(1) MPI_Put(src, dest, ...)
PE 2
A possible improvement in these semantics would involve separating RMA initialization, RMA ordering, remote completion, and PE synchronization.
MPI_Win_fence (...)
Atomic Memory Operations
(2) MPI_Win_fence (...)
An atomic memory operation (AMO) provides the capability to perform simple operations on the passive side of the RMA operation. The SHMEM model provides five common AMO capabilities.
(3) MPI_Put(src, dest, ...)
Swap: Writes the value specified by the initiator PE to the target PE location and returns the target value prior to the update. Compare and Swap: Compares the conditional value specified by the initiator PE to the target PE location, and if they are equal, it writes the value specified by the initiator PE to the target PE location. The return value is the target PE location’s value prior to update. Mask and Swap: Writes the value specified by the initiator PE to the target PE location updating only the bits in the target location as specified in the mask value. The return value is the target PE location’s value prior to update. Fetch and Add: Atomically adds the value specified by the initiator PE to the target PE location’s value. The return value is the target PE location’s value prior to the update. Fetch and Increment: Special case of Fetch and Add with implicit add value of . Extended Atomic Operation Support: Cyclops- includes a general purpose atomic operations: long shmem _atomic_op(long *target, long value, long pe,atomic _t op), which performs the atomic operation specified by atomic_t op on the specified target. The number of atomic operations supported is quite large and corresponds to Cyclops- intrinsic operations.
OpenSHMEM - Toward a Unified RMA Model. Fig. Example of MPI_Win_fence
that only require the ordering semantic and not remote completion. Figure illustrates the semantics of shmem_quiet. ARMCI provides Quiet through a fence_all operation similar to Q-SHMEM and C-SHMEM in that data is guaranteed to be delivered at the remote PEs when fence_all returns. Again, the ordering semantic is intertwined with remote completion. Barrier: The SHMEM barrier is a collective synchronization routine in which no PE may leave the barrier prior to all PEs entering the barrier. SGI-SHMEM, MP-SHMEM, LC-SHMEM, and QSHMEM additionally require the semantic that all outstanding PUT operations are remotely visible prior to the return of the barrier operation. This additional semantic intertwines PE synchronization and remote completion of outstanding write operations. C-SHMEM does not add this additional semantic. Wait: SHMEM wait waits for a memory location or variable to change on the local PE. This allows synchronization between two PEs. SGI-SHMEM, MP-SHMEM, LC-SHMEM, Q-SHMEM, and CSHMEM all support shmem_wait the other RMA implementations do not provide a similar semantic.
A more complete list of AMOs along with their definitions which comes from [] is shown in the OpenSHMEM section of this entry. Rather than providing atomics as defined above, MPI- provides MPI_ACCUMULATE. While MPI_ACCUMULATE provides atomic updates of local and remote variables the value of the variable prior to update is not returned to the caller. Atomic swap (and variants) are not supported by MPI-. The following operations are available for MPI_ACCUMULATE:
O
OpenSHMEM - Toward a Unified RMA Model Shmem_quiet prevents operation 1 or 2 from overtaking operation 4 or 5
PE 2
Shmem_quiet prevents operation 1 or 2 from overtaking operation 4 or 5 and guarantees remote completion of operation 1 and 2
PE 1
PE 3
PE 2
PE 1
PE 3
(1) Shmem_int_put(dest, src, size, 3)
(1) Shmem_int_put(dest, src, size, 3)
(2) Shmem_int_put(dest, src, size, 2)
(2) Shmem_int_put(dest, src, size, 2)
(3) Shmem_quiet ()
(3) Shmem_quiet ()
(4) Shmem_int_put(dest, src, size, 2) (4) Shmem_int_put(dest, src, size, 2) (5) Shmem_int_put(dest, src, size, 2) (5) Shmem_int_put(dest, src, size, 2)
a
b
OpenSHMEM - Toward a Unified RMA Model. Fig. Example of quiet (a) MP-SHMEM, LC-SHMEM (b) Q-SHMEM OpenSHMEM - Toward a Unified RMA Model. Table Synchronization methods Synchronization
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
Fence
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Quiet
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Barrier
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Wait
Yes
Yes
Yes
Yes
Yes
No
No
No
O
OpenSHMEM - Toward a Unified RMA Model. Table Atomic operations Atomic Op
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
Swap
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Compare and Swap
Yes
Yes
Yes
Yes
Yes
No
No
No
Mask and Swap
No
Yes
No
Yes
Yes
No
No
No
Fetch and Add
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Fetch and Increment
Yes
Yes
Yes
No
Yes
No
No
No
OpenSHMEM - Toward a Unified RMA Model. Table Reductions Reduction
SGI
MP
LC
Q
C
GASNET
ARMCI
MPI-
And
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Max
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Min
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Or
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Product
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Sum
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Xor
Yes
Yes
Yes
Yes
Yes
No
No
Yes
O
OpenSHMEM - Toward a Unified RMA Model
OpenSHMEM - Toward a Unified RMA Model. Table Collectives Collective
SGI
MP
LC
Q
C
GASNet
ARMCI
MPI-
Broadcast
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Collect
Yes
Yes
Yes
Yes
No
No
No
Yes
fcollect
Yes
Yes
Yes
Yes
Yes
No
No
Yes
#include <mpp/shmem.h>
#include <mpi.h>
int main(int argc, char* argv[]) { int i, me, my_num_pes; /* ** Starts/Initializes SHMEM/OpenSHMEM */ start_pes(0); /* ** Fetch the number or processes ** Some implementations use num_pes(); */ my_num_pes = _num_pes(); /* ** Assign my process ID to me */ me = _my_pe(); printf("Hello World from %d of %d\n",me,my_num_pes); return 0; }
int main(int argc, char* argv[]) { int rank, size; /* ** Starts/Initialize MPI */ MPI_Init (&argc, &argv); /* ** Get the current process id */ MPI_Comm_rank (MPI_COMM_WORLD, &tank); /* ** Get the current number of processes */ MPI_Comm_size (MPI_COMM_WORLD, &size); printf("Hello World from process %d of %d\n", rank, size); /* ** Clean up shop and exit */ MPI_Finalize(); return 0; }
OpenSHMEM - Toward a Unified RMA Model. Fig. Trivial hello world in SHMEM
#include
OpenSHMEM - Toward a Unified RMA Model. Fig. Trivial hello world in Message Passing Interface (MPI)
/* circular shift bbb into aaa */ #include <mpp/shmem.h> int aaa, bbb; int main (int argc, char * argv[]) { start_pes(0); shmem_int_get(&aaa, &bbb, 1,(_my_pe() + 1)% _num_pes()); shmem_barrier_all(); }
OpenSHMEM - Toward a Unified RMA Model. Fig. Circular shift in SHMEM
OpenSHMEM - Toward a Unified RMA Model. Fig. Trivial hello world in UPC
– – – – – –
Maximum Minimum Sum Product Logical and Bit-wise and
– – – –
Logical or Bit-wise or Logical xor Bit-wise xor
Reductions Reductions perform an associative binary operation across a set of values on multiple PEs. MP-SHMEM,
OpenSHMEM - Toward a Unified RMA Model
/* circular shift bbb into aaa */ int aaa, bbb; int main(nt argc, char * argv[]) { int numprocs, me; MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); send_to = (me - 1 + numprocs) % numprocs; recv_from = (me + 1) % numprocs; MPI_Bsend(&bbb, 1, MPI_INT, send_to, tag, MPI_COMM_WORLD); MPI_Recv(&aaa, 1, MPI_INT, recv_from, tag, MPI_COMM_WORLD, &status); }
OpenSHMEM - Toward a Unified RMA Model. Fig. Circular shift in MPI
/* circular shift bbb into aaa */ shared int aaa[THREADS], bbb[THREADS]; int main (nt argc, char * argv[]) { aaa[MYTHREAD] = bbb[(MYTHREAD + 1) % THREADS]; upc_barrier; }
OpenSHMEM - Toward a Unified RMA Model. Fig. Circular shift in UPC
LC-SHMEM, and Q-SHMEM all provide the same reduction operations differing only in the supported data-types as detailed previously. GASNet provides no reduction operations. ARMCI supports some of the reduction operations of SHMEM. MPI- supports all the SHMEM reduction operations as well as reductions based on logical operations.
Collective Communication Collective communication is uniformly supported across SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM. C-SHMEM and other communication libraries differ in their support for these (or similarly defined) operations. Broadcast: Transfers data from the root PE to the active set of PEs. The active set of PEs is specified using a starting PE and a log stride between consecutive PEs. Collect: Gathers data from the source array into the target array across all PEs in the active set. Each
O
PE contributes their source array which is concatenated into the target array on all the PEs in the active set in ascending PE order. Source arrays can differ in size from PE to PE. MPI- provides MPI_ALL_GATHERV which provides similar semantics to Collect. fcollect: A special case of Collect in which all source arrays are of the same size. This allows for an optimized implementation. MPI- provides MPI_ALL_GATHER which provides similar semantics to fcollect.
Tool Support Few performance tools have been developed specifically for SHMEM and Partitioned Global Address languages. The challenge comes as a result that traditional two-sided communication analyses do not apply to sided data accesses to “symmetric,” or shared variables as the library calls are decoupled from point-to-point synchronization. An event-based performance analysis tool [] was developed to support both SHMEM and UPC which uses a wrapper function approach to gather SHMEM events. Tools such as CrayPat, TAU, and SCALASCA use similar approaches to profile and trace SHMEM calls. However, the wrapper function approach might not work well for future SHMEMaware compilers since some might inline and replace SHMEM calls with direct shared memory accesses on certain architectures. GASP [] has been used as an alternative mechanism to gather performance events for GASnet which is implicit from the communication library API. None of these approaches have been sanctioned for standardization or have been officially adopted for SHMEM. However, work is on the way. Commercial available debuggers [, ] and compile time static checkers [, ] have limited support for SHMEM and UPC not to say tools that optimize them. In the case of OpenSHMEM, the semantics of the library can allow compilers or static checkers to report the incorrect use of SHMEM library calls. For example, compiler-based tools can easily check if arguments to the data access calls meet the “symmetric” storage for some variables, check if variables are accessed within bounds, and detect incorrect type usage in shmem_put/ calls. Such semantic checker tools can to be invoked as preprocessors to help the
O
O
OpenSHMEM - Toward a Unified RMA Model
user reduce the amount of errors detected at runtime. Research needs to be done to find proper ways to optimize SHMEM in compilers. Traditional dataflow frameworks such as SSA (single-static assignment) have been extended in tools for MPI to handle two-sided communication and few for one-sided communications but not specifically for OpenSHMEM programs; and these technologies still have to be introduced in production compilers. The OpenSHMEM community needs to promote the development of tools that facilitate the usage of SHMEM programs as few tools exist for this as for now.
Toward an Open SHMEM standard When comparing the various implementations of SHMEM and other RMA APIs, we see that semantics vary dramatically. Not only do SHMEM RMA semantics differ from other RMA implementations (as expected) but different SHMEM implementations differ from each other. Support for primitive data types varies. Discontiguous RMA operations and Atomic RMA operations are not uniformly supported. Synchronization and completion semantics differ substantially which can cause valid programs on one architecture to be completely invalid on another. Symmetric data objects that dramatically aid the programmer are unique to the SHMEM model but are not uniformly supported. There are a number of capabilities that are available from implementations that differ from the SGI SHMEM baseline version. Below is a brief list of potential future additions to OpenSHMEM: – – – – – – – – –
Support for Multiple Host Channel Adapters (Rails) Support for non-Blocking Transfers Support for Events Additional Atomic Memory Operations and Collectives Potential options for NUMA/Hybrid architectures Additional communications transport mechanisms OpenSHMEM I/O library enhancements OpenSHMEM tools and Compiler enhancements Enabling exa-scale applications
An OpenSHMEM standard can address the lack of uniformity across the available SHMEM implementations. Standardization levels can be established to provide a base level that all SHMEM implementations must support in order to meet the standard, with higher levels
available for additional functionality and platform specific features. In , an initial dialogue was started between SGI and a small not-for-profit company called Open Source Software Solutions, Inc.(OSSS). The purpose of the dialogue was to determine if an agreement could be reached for SGI’s approval of an open source version of SHMEM that would serve as an umbrella standard under which to unify all of the already existing implementations into one cohesive API. As part of this discussion, OSSS held a BOF on OpenSHMEM at SC to discuss this plan. The BOF was well attended by all of the vendors interested in supporting OpenSHMEM/SHMEM, and many of the interested parties in the SHMEM user community. The unanimous opinion of the attendees favored continuing the process toward developing OpenSHMEM as a community standard. The final agreement between SGI and OSSS was signed in . The agreement allows OSSS to use the name OpenSHMEM and directly reference SHMEM going forward. The base version of OpenSHMEM V. is based on the SGI man pages. The “look and feel” of OpenSHMEM needs to preserve the original “look and feel” of the SGI SHMEM. OpenSHMEM version . has been released and input for version . is being actively solicited. There are a number of enhancements under discussion for version . and future offerings. Some of the features specific to one implementation or another as listed in the preceding tables and elements will be incorporated into later versions of OpenSHMEM. OpenSHMEM will be supported on a variety of commodity networks (including Infiniband from both Mellanox and QLogic) as well as several proprietary networks (Cray, SGI, HP, and IBM). The OpenSHMEM mail reflector is hosted at ORNL and can be joined by sending a request via the OpenSHMEM list server at https://email.ornl.gov/mailman/listinfo/ openshmem. Future enhancements and RFIs will be sent to developers and other interested parties via this mechanism. Source code examples, Validation and Verification suites, performance analysis, and OpenSHMEM compliance will be hosted at the OpenSHMEM Web site at http://www.openshmem.org, and the OpenSHMEM standard will be owned and maintained by OSSS.
Operating System Strategies
With the inexorable march toward exa-scale, programming methodologies such as OpenSHMEM will certainly find their place in enabling extreme scale architectures. By virtue of decoupling data motion from synchronization, OpenSHMEM exposes the potential for synergistic application improvements by scaling more readily than two-sided models, and by minimizing data motion, thus affording the possibility of concomitant savings in power consumption by applications. These gains, along with portability, programmability, productivity, and adoption, will secure OpenSHMEM a place for future extreme scale systems.
.
.
.
.
Acknowledgments This work was supported by the United States Department of Defense and used resources of the Extreme Scale Systems Center located at Oak Ridge National Laboratory.
.
O
space programming. In: th IEEE international workship on parallel and distributed scientific and engineering computing (PDSEC), Miami Leko A, Bonachea D, hsun Su H, Golden B, Sherburne H, George AD () GASP: a performance tool interface for global address space languages. Technical report. Lawrence Berkeley National Lab, Berkeley Krammer B, Himmler V, Lecomber D () Coupling DDT and Marmot for debugging of MPI applications. In: PARCO, Jülich/Aachen, pp – Gottbrath C, Thompson P () TotalView tips and tricks. In: Proceedings of the ACM/IEEE conference on supercomputing (SC’), ACM, New York Luecke G, Coyle J, Hoekstra J, Kraeva M, Xu Y, Kleiman E, Weiss O () Evaluating error detection capabilities of UPC run-time systems. In: The rd conference on partitioned global address space programming models, Ashburn Liao C, Quinlan D, Panas T, de Supinski B () A ROSEbased OpenMP . research compiler supporting multiple runtime libraries. In: Sato M, Hanawa T, Mller M, Chapman B, de Supinski B (eds) Beyond loop level parallelism in OpenMP: accelerators, tasking and more. Lecture notes in computer science, vol . Springer, Berlin/Heidelberg, pp –
Bibliography . Message Passing Interface Forum () MPI: a message passing interface standard. Int J Supercomputer Appl (/): – . Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, Takahashi D () The HPC challenge (HPCC) benchmark suite. In: SC’: Proceedings of the ACM/IEEE conference on supercomputing, ACM, New York, p . Feind K () SHMEM: an overview and brief history ( SC OpenSHMEM BOF) . Curtis T, Pophale S () SHMEM tutorial at PGAS’ . Su HH () SHMEM tutorial . Su HH () Programming in SHMEM . SGI Inc.: SGI SHMEM API Man Pages . Cray Inc.: Man page collection (Unicos MP): shared memory access HMEM . Cray Inc.: Man page collection (Unicos LC): shared memory access SHMEM . Quadrics Ltd.: The SHMEM programming manual . ET International Inc.: Cyclops- programming manual . Bonachea D () GASNet specification, v.. Technical report, University of California, Berkeley . PNNL: ARMCI – Programming interfaces. http://www.emsl.pnl. gov/docs/parsoft/armci/documentation.htm . Message Passing Interface Forum () MPI: a message passing interface standard. High Perform Comput Appl (–):– . Bonachea D, Duell J () Problems with using MPI . and . as compilation targets for parallel language implementations. In: nd workshop on hardware/software support for high performance scientific and engineering computing (SHPSEC-), New Orleans . Su H, Billingsley M, George A () Parallel performance wizard: a performance analysis tool for partitioned global address
Operating System Strategies Rolf Riesen , Arthur B. Maccabe IBM Research, Dublin, Ireland Oak Ridge National Laboratory, Oak Ridge, TN, USA
Synonyms Parallel operating system; Resource management for parallel computers; Runtime system
Definition An operating system manages the processors and other resources of a parallel computing system. Multiple instances of individual operating systems and the runtime system form a parallel operating system, which manages the resources of the entire machine and provides services for users and system administrators to obtain information and control various aspects of the machine and the application jobs that run on it.
Discussion Introduction A parallel computing system employs two or more processing elements (PE) which can be single or multicore
O
O
Operating System Strategies
CPUs plus attached Graphic Processing Units (GPU) or other accelerators. Usually the PEs are general-purpose microprocessors, but specialized CPUs have been used. These PEs and other resources in the system need to be managed so they can run parallel jobs and be used effectively. Some of the tasks of an OS for a parallel system are the same as the tasks a traditional OS performs for a desktop computer or a server. These include starting a job (or process), memory management, and I/O to disks and other peripherals. There are additional tasks an OS for a parallel system must perform or support. For example, in a distributed memory system where not every CPU has access to all memory in the system, data must be transferred between memories. The data is transmitted inside messages, and the OS must provide functions for sending and receiving these messages, or provide functions for applications to access the network interface directly in a safe manner. Larger parallel machines often use a batch system to schedule jobs. Users submit their requests for a certain number of nodes, how long they will need them, and what application they want to run. The batch system then schedules job launches based on node availability and job priority. The batch system interacts with the OS to allocate nodes, launch jobs, and terminate jobs that have exceeded their requested execution time. Yet another task that an OS for a parallel system must perform, that is not needed on personal computers, is process coordination across the machine. Very often the processes that are part of a parallel application need to communicate and synchronize with each other. Some parallel systems provide hardware or special networks for such operations, and the OS provides primitives to access these. Parallel OSs also face constraints that generalpurpose OSs do not have. On some large parallel machines, there is relatively little memory for each PE. The more of that memory is consumed by the OS, the less is available for the applications. Since every megabyte wasted is multiplied by the number of PEs in the system, this can be a significant cost. General-purpose OSs do a lot of housekeeping tasks, such as checking for email or the status of an attached printer, while waiting for a key stroke or mouse click from the user. While these administrative tasks do not take a lot of time, they can slow down a parallel
application in the following way. When the individual copies of an OS running on multiple PEs, or the application itself, exchange messages or synchronize activities, they need to interact with each other. When one of them is busy with administrative tasks, it delays all other processes that are part of the synchronization. When the busy OS finds time to service a request and then needs an acknowledgement or more information from another PE, it itself may be delayed because the other OS is now doing administrative tasks instead of responding to requests. This cascading effect is called OS noise and can slow down parallel applications. The main strategy to avoid this problem in a parallel OS is to disable services such as email and printing, which are not needed on the PEs of a parallel system. Other strategies include synchronizing timer interrupts and OS services in general and dedicating specific processor cores to OS work. In addition to the tasks a parallel OS must perform and the constraints it faces, there are also a few things that make a parallel OS simpler. Usually, the PEs in a system are all of the same type and have very few peripherals. That means that a, possibly customized, parallel OS needs to support only a few hardware devices and a lot of functionality, for example, driving a display, is not needed within the OS. Often it is enough to manage the CPU, memory, and the network interface, since no other peripherals are attached or used. Many large-scale systems have no local disks, and the necessary device drivers and file systems need not be present in the OS. In such a case, I/O is done using messages to an external parallel file system running on a disk farm separate from the parallel compute system. This leads to opportunities for simplifications and helps reducing the memory footprint of the OS and the noise it introduces into the system.
Parallel Computing Systems There are two main categories of parallel computing systems: message-passing and shared-memory systems. In a message-passing system, each PE has local memory that it can access directly, using load and store operations. Each PE is also connected to a network that allows data to move from the memory of one PE to the memory of another PE. In order to access data stored in memory attached to a remote PE, the data has to be moved from
Operating System Strategies
PE 0
PE 1 Memory 0
NI
PE 2 Memory 1
NI
PE 0 Memory 2
NI
Network
PE 1
O
PE 2 Memory 0 Memory 1 Memory 2
Operating System Strategies. Fig. A multicomputer consists of PEs that each have their own local memory. Data is moved from one PE to another using explicit message passing. The Network Interface (NI) connects the system bus of a PE to the network
Operating System Strategies. Fig. In a shared-memory system, all PEs have access to all memory. A bus or, as in this picture, a crossbar switch connects the PEs to the available memory modules
one memory to the other using explicit message passing. Large-scale supercompuetrs by companies like Cray and IBM are of this type. Clusters, which are built out of commodity computers (desktops or servers), are also message-passing systems. The individual computers in a cluster are connected with a commodity network such as Ethernet or Infiniband, while supercomputers often employ vendor specific networks. Systems where the PEs do not share memory are called multicomputers. Large enough clusters and message-passing systems are sometimes called Massively Parallel Processors (MPP). Figure shows an example of a multicomputer. In a shared-memory system, all PEs have access to all of the memory in the system. Figure shows an example. The diagram shows a crossbar switch where hardware sets the switches and enables access to memory based on the address the PE uses in its load or store instruction. Smaller and less expensive systems use a bus instead of a crossbar switch. Systems that use a crossbar switch or a shared memory bus have Uniform Memory Access (UMA) where the cost of accessing any memory location is roughly the same. UMA systems are often called Symmetric Multi Processors (SMP) and sometimes multiprocessors. Larger systems, with many PEs, are Non-Uniform Memory Access (NUMA) machines. Accessing local memory is much faster than accessing the memory that is attached to a remote PE. Although memory access in a shared-memory system may not be uniform, the process of moving data is done by hardware and largely hidden from the program doing the access. A program simply issues load and store instructions to
global addresses which the system resolves, if necessary, by moving data from a remote memory to the PE doing the request. This is in stark contrast to explicit message passing and changes how these systems are programmed and the services an OS must provide. The OS in a shared-memory system is responsible for enforcing memory access restrictions by configuring the underlying hardware accordingly and providing abstractions for sharing data among processes when needed. In a message-passing system, the OS must provide mechanisms for explicitly passing messages from one PE to another and enforce access restrictions to local memory and the network interface to protect other parts of the system from malicious messages. Coherency, when a data change in remote memory becomes visible to a local PE, and how write access conflicts are handled depends on the hardware available in a shared-memory system. The OS may provide synchronization and access coordination primitives. The OS is also responsible to initialize the hardware, for example, to set caches to write-through or write-back. The networks connecting the PEs in a messagepassing system range from commodity networks used in clusters to custom hardware used in some supercomputers. A network consists of routers that are connected to each other and form a topology such as a -D or -D mesh, a hypercube, or a tree. The PEs connect to the routers via a network interface. Some network interfaces are simple, and the OS is responsible to trigger data movement in or out of the network. High-end network interfaces have processors and memory of their own and can be programmed to process incoming data
O
O
Operating System Strategies
need to run a file system themselves. Some systems may be connected to other, wide-area, networks. Additional I/O nodes are sometimes used to handle the traffic to and from these networks (bottom right in Fig. ). Some clusters have local disks on each node since they consist of commercial, off-the-shelf computers that often come with built-in disks. The usage of these local disks is somewhat problematic on compute nodes for anything other than temporary data during the run of a job. Once the job finishes and the compute nodes are reallocated to another job, it becomes difficult to locate and retrieve the data from the earlier job run. Sometimes service nodes in clusters also serve as compute nodes, but better performance can be achieved using a strict separation of compute and service nodes.
Overview of OS Approaches Most clusters and supercomputers today run an OS that is a version of UNIX, such as various versions of Linux, IBM’s AIX, or Sun Microsystem’s Solaris. These are usually slightly modified or tuned versions of the corresponding desktop and server OSs. An alternative approach is to run a lightweight kernel on the compute nodes and a full-featured OS on the service and I/O nodes. Systems employing this approach include IBM’s Blue Gene which runs the Compute Node Kernel (CNK) [], Cray’s XT- which runs Catamount [], Intel’s ASCI Red at Sandia National Laboratories which ran Cougar (a port of Puma []), and the Intel Paragon which had the option of running the Sandia/University of New Mexico OS (SUNMOS) [].
vic
e
and make decisions where in memory the data needs to be deposited. For systems like that, the OS is responsible to restrict unprivileged access and provide abstractions to control and program the network interface. Parallel computing systems are often partitioned so that each node in the system serves a specific purpose. Figure shows an example. Each node is an individual computer of the many that make up a multicomputer. A node in a multicomputer consists of memory, one or more PEs, and a connection to the network. Some nodes have additional network connections that are used by users to log into the system. When they do, they land on one of the service nodes from where they launch and control parallel applications. Often, development and compiling of parallel applications is done on the service nodes as well. The service partition in Fig. consists of the light-colored spheres at the top left. The majority of nodes in a parallel system are reserved for computation. The cubes in Fig. represent those nodes. Users or a batch processing system launch jobs from the service partition. The runtime system allocates nodes from the compute partition and instructs the OSs on these nodes to start the processes of that application. Many larger systems have a parallel file system that is external to the parallel machine itself. In the partitioning model, dedicated I/O nodes, the dark spheres on the right of Fig. , are used to coordinate traffic to these external disks. The compute nodes send messages containing file data to the I/O nodes. The file system running on the I/O nodes then deposits the data on the disks. The compute nodes themselves do not
File I / O
Se r
Users
% command % batch % run % execute %
Compute
Net I / O
/ home
Operating System Strategies. Fig. Partitioning of a system by dedicating nodes to usage functions such as compute, service, and I/O []
O
Operating System Strategies
Using Fig. as an example, let us assume there are multiple jobs, j . . . j , running concurrently on PEs p and p . Since the activities on the individual PEs are not synchronized, it will happen that in a given time slice j is active on p , while j is active on p . If j needs to frequently transfer data between p and p , it will have to wait for its process to become active on p every time it requests an answer. Only active processes can send and reply to messages. When the OS on p makes j active there through a context switch, j will be able to send a reply from p . By the time the reply arrives at p , j may not be active on that PE anymore, now causing wasted time for j on p waiting for an answer. On some systems, the process schedulers can be synchronized, leading to a mode of operation called gang scheduling. Gang scheduling is not very common and difficult to achieve on large machines due to signal delays from one end of the machine to the other.
Parallel OS Functionality
j1
j1
In most cases, each node of a parallel system runs one copy of the OS which manages all the resources local to that node. The collection of individual OSs on all the nodes of a parallel system forms the parallel OS and runtime system that controls the machine. In addition to the local OS services, this collection of system software must provide the ability to allocate nodes, launch and control parallel applications, for example, stopping and terminating all processes of a parallel job. In addition, services for debugging, determining system status, and parallel file I/O are usually available.
Network Process scheduler on p1
j3
Process scheduler on p0
j0
j2
j0
j2 j3
In the late s and early s, lightweight kernels were required because the supercomputers of that time had very limited amounts of memory available and every last bit of performance needed to be available to the parallel applications. That meant the OS had to have a very small memory footprint and very low processing overhead. With the advent of clusters in the late s, the situation began to change. Users began to demand features such as multithreading and shared libraries on parallel systems. Because the processors had become more powerful, users were often willing to sacrifice some performance in return for the additional features and convenience of programming. Today most parallel computers run full-featured OSs, while only a few high-end systems use lightweight kernels to ensure performance and scalability. Desktop and server OSs are designed to allow multiple users or multiple tasks to share the available resources. Memory is partitioned and allocated to the currently running processes. The OS is responsible for memory allocation and protection, making sure that a process has access to only its portion of the memory. In addition, the OS provides control mechanisms to share memory among processes when that is needed for data exchanges. CPU cycles are bundled into time slices and allocated to processes which are currently ready to run. This happens many times per second and provides the illusion to a user that multiple processes are making progress at the same time, even when there is only a single CPU present in the computer. For the most part, these traditional OS services are not needed in a parallel computer. Instead of timesharing individual CPUs, whole sets of PEs (CPUs) are allocated to a single application that uses the PEs in parallel to solve a computationally intensive problem. This method of allocating resources is called space-sharing. Multiple applications may be running simultaneously on a parallel system, but each application has been allocated memory and PEs for its exclusive use. The main reason for space-sharing rather than timesharing a parallel machine is throughput: The number of jobs a parallel system can complete in a given amount of time. This number depends on how long each job takes to complete. If the PEs of a parallel system were to run multiple jobs concurrently, completion time of a job would increase with the number of PEs it uses.
Operating System Strategies. Fig. The two wheels represent the process schedules on p and p . Each process on these two PEs gets its turn to run in a round-robin fashion. Only a running process can send and reply to messages. If the rotation of the wheels (process scheduling) is not synchronized, delays occur because communicating processes have to wait for each other
O
O
Operating System Strategies
Parallel jobs can be launched by users interactively from the service partition or through a batch scheduling system. In the latter case, users submit scripts to the batch system that contains information about what application the user wishes to run, how much time it needs to complete, and how many nodes it uses. The batch system queues these requests and when compute nodes become available selects the highest priority jobs and those that will best utilize the system and launches them. This is done in conjunction with the local OSs, but is usually a separate utility. Whether a job is launched interactively by a user or by the batch system, the job launcher must interact with a node allocator to select the compute nodes that will be used to run the job. A node allocator can be simple and select from the available node list in a sequential fashion. More sophisticated allocators attempt to group the individual processes of an application physically close together on nodes that have one or only a few network hops between them. When the processes of a parallel application are scattered throughout a parallel system, their network traffic can interfere with the traffic from other applications, or traffic to and from the I/O nodes. Message latency also increases when application processes are further apart from each other. Once nodes are allocated to a job, the launcher must interact with the OS instances on the selected nodes and initiate the start of the processes on each node. This involves coordination and sometimes synchronization among the selected nodes. Once the application is running, the job launcher serves as a collection point for standard input and output for each process of the application and as a control center. The job launcher can terminate the application and often serves as a conduit for debugging and signaling the application processes. Although not strictly part of the OS, the job launcher interacts with the OS on the nodes to control the processes. It also interacts with the node allocator, which is often a separate process, to allocate and release compute nodes. The node allocator often has an interface that allows users and system administrators to obtain status information, such as how many nodes are available for allocation, down for servicing, or currently running interactive or batch jobs. Memory in early parallel systems was a scarce commodity and the OS had to be designed to consume
as little as possible of that resource. For example, an early version of OSF/-AD, the OS for the Intel Paragon, consumed more than half of the MB of memory installed on each node for its own code, message buffers, and other data structures. SUNMOS, another OS available for that machine, consumed only a few hundred kilo bytes. Cost of memory for that machine was a factor and wasting half of it on administrative services was not cost effective and limited the scope of scientific simulations that could be run. Memory has become a lot cheaper since then and is now available in abundance, but conserving it may still be important. For example, some members of IBM’s Blue Gene family of computers have a large number of PEs but a modest MB of memory per PE. The future will increase the number of compute cores per processor, which in many systems may not be followed with an equal increase in memory size per core. Some systems may employ new types of memory, for example, FLASH memory which retains its content after a loss of power. Such new memory technologies may be, initially, more expensive and smaller in capacity leading to a need to conserve memory. While a desktop OS should not waste memory either, it is not quite as critical there. Desktop systems and servers employ a technique called demand paging to provide the illusion that a system has much more memory than is physically available. When the physical memory becomes full, the OS will migrate less frequently used data to the swap area on a local disk and bring it back into memory later when it is needed again. In parallel systems, even on clusters with local disks, demand paging is usually disabled. The long delays of swapping and the unpredictable timing of it harm a parallel system much more because all processes of a single application must make progress together. Each time one of them is delayed, the whole application is delayed. The individual instances of an OS on each PE run mostly independent of each other. This makes it more difficult to gather information about the system as a whole and exert control over it. For example, it might be interesting to see how busy the individual CPUs in a system are and how many processes are currently running on each. Another interesting statistic is the amount of available memory on each PE. Information like that can
Operating System Strategies
be used to diagnose the health of a system and pinpoint problems. On a UNIX desktop system or server, there are many different utilities to gather such information. The goal of a Single System Image (SSI) is to provide similar functionality, but for all PEs in a system. Instead of issuing commands to all PEs in a system, which can deliver a daunting amount of information, SSI usually allows commands to be restricted to a subset of PEs, for example, the ones currently running a given job. Examples of SSI systems include BProc, MOSIX, and OpenSSI. In addition to process control, some SSI provide functionality such as process migration, process check-pointing, a shared file system with a single root, and inter-process communications (IPC) between processes independent of their location. Implementing an SSI is easier on a shared-memory system where processes have access to all of memory and sharing data is automated by the hardware. An SSI for a multicomputer using message passing is much more difficult. Typically, full-featured SSI scale to a few tens of nodes, while those with a more restrictive set of features have been shown to run on up to a few hundred nodes. An SSI is integrated into the OS of a parallel computer, or, if SSI is a separate module, needs to make use of services the OS provides. File I/O, and I/O in general, is a quandary on a parallel system. One problem in large systems is that not enough pathways are available to reach a shared storage device, which leads to network congestion. A second problem is the sharing of files and metadata, which contains information about the status of individual files and directories. For example, if many thousand processes start creating files in a single directory, the information for that directory must be continuously updated which creates an access bottleneck due to locking, unlocking, and synchronizing of these data structures. Parallel file systems, not usually found on desktop systems or severs, attempt to alleviate the problem. A parallel file system often runs on the dedicated nodes of an I/O partition. These nodes, not burdened with compute tasks, can aggregate the data flow in and out of a parallel storage system and consolidate many of the updates and synchronizations that would otherwise have to happen individually. Parallel file systems in use on clusters and other large parallel computers include PVFS and Lustre.
O
Lightweight Versus Full-Featured Operating Systems Lightweight kernels are used on the compute nodes of some very large parallel systems. Often they are criticized for not providing enough functionality for today’s highly complex and multi-layered parallel applications. Some argue that lightweight kernels are difficult to maintain since few people are familiar with them. This section briefly explores why lightweight kernels are a good fit for large message-passing systems. In a partitioned systems like the one in Fig. , instances of a lightweight kernel run on the compute nodes, while the service and I/O nodes run a fullfeatured OS. This allows a lightweight kernel to be tailored to the hardware of a compute node, which is often much simpler than the hardware found in service and I/O nodes. For example, sometimes specialized or proprietary drivers are needed on I/O nodes to access the attached storage devices. Compute nodes on the other hand are often kept simple to reduce cost and power consumption. Their main components are often limited to one or more CPUs, local memory, and a network interface. OS functionality and device drivers for anything beyond that are not needed and often consume resources, such as memory and CPU cycles, if present. Large parallel systems usually do not have local disks attached to compute nodes. That means the OSs on these nodes do not need to provide any file systems or device drivers for disks. File I/O requests by the application are converted into messages and sent off to the I/O nodes and external storage devices that handle them. A large functional block in a modern OS deals with multiprocess management and scheduling. Algorithms to prioritize processes and to ensure fair scheduling are complex and must be able to deal with hundreds of concurrent processes. Since the nodes of a messagepassing system are space shared, only one or a handful of processes run on each CPU. Some early lightweight kernels went so far as to allow only a single active process at a time and used cooperative scheduling to switch from one process to another. While this may seem like a throwback to an earlier time in OS design, it has several advantages: Expensive context switches are few, frequent and disrupting timer interrupts are not needed, and the code section for process control inside the kernel becomes much simpler. A desktop OS interrupts
O
O
Operating System Strategies
running applications from to , times per second to check whether there are any administrative tasks for it to do and to potentially context switch to another process. A typical lightweight kernel receives ten interrupts per second and is much quicker in determining whether a context switch is needed. Other than possibly sending a heartbeat signal to a Reliability, Availability, and Serviceability (RAS) system, a lightweight kernel has no other administrative tasks while it is running an application. This greatly reduces the noise signature of a lightweight kernel and gives more CPU cycles to the application. Another large functional block in a full-featured OS deals with demand paging. Demand paging, even if local disks are available, is bad for the performance of a parallel system. While this feature can be turned off, a lot of the underlying data structures and assumptions remain, even when not in use. For example, page sizes are small to allow the OS more flexibility in allocating and releasing memory, and demand paging. Furthermore, pages that are adjacent to each other in the virtual address space are usually scattered all across physical memory. This increases the number of Translation Look-aside Buffer (TLB) entries needed and limits data transfers to and from most network interfaces to a single page at a time. A lightweight kernel does not support demand paging and can manage memory as a set of physically contiguous large pages. This leads to performance improvements because the TLB entries are never exhausted, since the few page table entries that are needed usually fit in the TLB. Further performance improvements come from larger data transfers between the network interface and the memory. Large transfers are more efficient and lead to higher bandwidth. A somewhat related problem has to do with shared and dynamically loaded libraries. Many lightweight kernels require statically linked applications. This is counter to modern software development strategies and is one of the main complaints about lightweight kernels. The processes of a parallel application are fairly well synchronized and reach a given section of code at about the same time. If at that point a shared library needs to be loaded, thousands or hundred thousands of PEs request that particular library. This leads to a massive bottleneck when all these requests converge at a single point in the file system. Because of that, and because of
the simplified way of dealing with memory, lightweight kernels are usually prevented from offering shared and dynamically loaded libraries. In a message-passing system, moving data from the network into memory has to be done with as little overhead as possible. Traditionally, network interfaces deposit incoming data into buffers managed by the OS. When the application requests the data, the OS copies it from the buffer into the memory location the application requested. This technique works well when the network is much slower than the memory system, and when data packets are small. Modern parallel computers have networks that are often as fast or faster than the memory can copy data, and many parallel applications transmit large messages. Moving data from the network interface to memory and then again from one memory location to another causes delays in message delivery and consumes valuable memory bandwidth. It is therefore advantageous for the network interface to move incoming messages directly into the memory location where the application expects the data. There are several requirements to make this possible. The Application Programming Interface (API) must allow pre-posting of receives. This allows an application to inform the OS or the network interface where future messages can be deposited. The traditional socket API used for TCP/IP programming does not provide this functionality; data is copied after it has arrived and the application has issued a receive request. The Message Passing Interface (MPI) API was designed for parallel computing and enables preposting of receive requests. Sophisticated network interfaces used in messagepassing systems have the ability to deliver data anywhere in local memory and can filter and direct incoming messages according to criteria such as which process the message is destined for, which process sent the message, its length, and a set of tag bits that can be used by the application to route messages to specific memory locations. To make optimal use of such network interface capabilities, the OS must allow incoming messages to be handled by the hardware without expensive OS interrupts. This is called OS bypass. The goal is to avoid invoking the OS on standard message deliveries. However, the OS is still responsible for protection and must be invoked during initial setup, perhaps while the application is initializing. Even with OS bypass it must not be
Operating System Strategies
possible for unrelated processes to harm each other or access each others’ private data. Because a lightweight kernel can arrange physical pages in sequential order, the hardware in a network interface that is required to deliver messages larger than a page size can be much simpler; the network interface does not need its own virtual memory controller. In such a system, there is a one-to-one mapping of virtual to physical addresses which makes it easier to designate a memory region for delivery of messages: Only a starting address and a length are required. This further reduces the hardware complexity of the network interface and makes interaction of an application directly with the network interface more efficient and simpler. The goal of a lightweight kernel is to turn over most hardware resources of a node to the application, while maintaining protection barriers between unrelated processes and protecting the rest of the system from errant or malicious processes. It is possible to design, implement, and maintain a lightweight kernel with a small group of programmers in a short amount of time. However, the talent pool to perform this kind of work is limited, and lightweight kernels do not offer some features that modern parallel applications demand. Whether these demands make sense on extremely large-scale systems is debatable.
The Next Big Challenge: Resilience Future high-end systems are expected to grow in size and part count. Because of that it is also expected that failures in software and hardware will become more common. For large applications that use a significant number of nodes in such a machine, that means they will almost certainly be interrupted by a failing component before the computation has a chance to complete. To nonetheless make progress, applications save their state and intermediate results frequently to disk. During a restart after an interruption, that data is read back from the disk and the application continues where it has left off. This is called checkpointing or checkpoint/restart. Writing a checkpoint is time consuming, especially when many nodes do it at the same time. Delaying the writing of checkpoints risks that too much work is lost and has to be redone after an interruption. Current models predict that, even with optimal checkpoint intervals, when job sizes approach , nodes, applications will spend more than % of their
O
total execution time writing checkpoints, restarting, and redoing lost computations. Clearly, that is not an efficient use of such large and expensive machines. Research into making applications more resilient is under way. Checkpoint/restart can be done automatically by the OS since the OS knows what portions of memory a process is using. There might be data structures the application uses which it can reconstruct easily without having to save and restore them. Without hints from the application, the OS will have to checkpoint these data structures, making automatic checkpointing somewhat less efficient than application checkpointing or application-directed checkpointing. Before a checkpoint can be written, the application must quiesce to ensure data does not change while checkpointing is in progress. That means no messages must be in transit during that time. Again, this can be done at the user level or with the help of the OS. In the future, OSs may provide additional features to aggregate checkpoint data and to trigger checkpoints when a failure is imminent. Aggregating checkpoint data from multiple nodes on intermediate nodes before it is sent off to disk storage may allow an OS to schedule massive writes to storage and coordinate them with I/O from other applications currently running. Additional functions, such as data consolidation and compression, may also be feasible. Sometimes the onset of a failure can be predicted. For example, a stark increase in singlebit errors in an Error Correction Code (ECC) protected memory module indicates that an uncorrectable double-bit error is not far off. It makes sense to create a checkpoint at that time and restart the process on the failing node somewhere else in the system.
Future Directions The largest computing systems in the world have reached the petascale, and exascale systems are on the horizon. These are systems that will deliver performance in the thousands of petaflops. Early arrivals will look similar to today’s large-scale systems. They will consist of thousands of nodes connected by a message-passing network. Each node will have multiple CPUs, each with tens or hundreds of cores. No one knows how these extreme scale systems will evolve after that. Furthermore, application developers are hoping for a new programming model beyond the traditional MPI
O
O
Operating System Strategies
and C/C++/Fortran interface. But, again, there is no consensus yet what that will be. Depending on the architecture of these future systems and how they will be used, OS technology needs to evolve as well. It is possible that some of the protection mechanisms of today’s OSs will be implemented in hardware. Whole sections of a machine can then be turned over to an application, to be used in the fashion that provides the most benefit. Commonly used functions will be in libraries, and system control utilities interact with the hardware directly. Since the future is so uncertain, it is more likely that OSs and runtime systems take on an even bigger role to support the new hardware and provide new usage models to applications. For that, it may be possible that whole cores or CPUs will be dedicated to running portions of the OS and to accelerate message passing. Some applications can make use of the compute capabilities a GPU provides. In the future, more systems will be delivered with GPUs and other specialized processors to accelerate computing. Again, programming and usage models for such systems are still emerging and the role OSs will play to make these accelerators accessible is unclear at the moment.
Related Entries
provide good introductions to the various general OS topics mentioned in this article. Understanding some of the hardware and the architecture of computers in general and parallel systems in particular is also necessary to understand OSs. The best book for that is []. A good introduction to the issues involved in designing and evaluating a parallel file system is []. The evolution of a family of lightweight kernels and the experience from developing and using them is described in []. The Compute Node Kernel (CNK) in use on IBM’s Blue Gene machines is discussed in []. Puma [] and its predecessor, the Sandia/University of New Mexico OS (SUNMOS) [], were lightweight kernels for early massively parallel machines such as the nCUBE , the Intel Paragon, and Intel’s ASCI Red at Sandia National Laboratories. Performance comparisons between a full-featured OS and a lightweight kernel can be found in [, ]. Another comparison of an earlier system is in []. The experiences designing and deploying one of the earliest large clusters, including the system software required to do it, are chronicled in []. An explanation of the partitioning of a system into service, compute, and I/O partition is also in []. A brief overview of SSI is presented in [] and a more thorough treatment is in Chap. of [].
Checkpointing Clusters Distributed-Memory Multiprocessor Flynn’s Taxonomy Linda MIMD (Multiple Instruction, Multiple Data) Machines Metrics Numa Caches Nonuniform Memory Access (NUMA) Machines File Systems Race Conditions Scheduling Algorithms Shared-Memory Multiprocessors Single System Image
Bibliographic Notes and Further Reading Many of the functions and services an OS for a parallel system provides are the same as the ones in most modern OSs. Standard textbooks, such as [, , ],
Bibliography . Brightwell R, Fisk LA, Greenberg DS, Hudson T, Levenhagen M, Maccabe AB, Riesen R () Massively parallel computing using commodity components. Parallel Comput (–):– . Brightwell R, Maccabe AB, Riesen R () On the appropriateness of commodity operating systems for large-scale, balanced computing systems. In: International parallel and distributed processing symposium (IPDPS ’), Nice. IEEE Computer Society, Washington, DC . Brightwell R, Riesen R, Underwood K, Bridges PG, Maccabe AB, Hudson T () A performance comparison of Linux and a lightweight kernel. In: IEEE international conference on cluster computing, Hong Kong, pp – . Buyya R, Cortes T, Jin H () Single system image. Int J High Perform Comput Appl ():– . Hennessy JL, Patterson DA () Computer architecture, th edn. Morgan Kaufmann, Boston . May JM () Parallel I/O for high performance computing. Morgan Kaufmann, San Francisco . Moreira JE, Almási G, Archer C, Bellofatto R, Bergner P, Brunheroto JR, Brutman M, Castaños JG, Crumley PG, Gupta M, Inglett T, Lieber D, Limpert D, McCarthy P, Megerian M, Mendell M, Mundy M, Reed D, Sahoo RK, Sanomiya A, Shok R, Smith B,
Owicki-Gries Method of Axiomatic Verification
. . .
.
. . .
Stewart GG () Blue Gene/L programming and operating environment. IBM J Res Dev :– Nutt G () Operating systems, rd edn. Addison Wesley Pfister GF () In search of clusters: the ongoing battle in lowly parallel computing, nd edn. Prentice-Hall, Upper Saddle River Riesen R, Brightwell R, Bridges PG, Hudson T, Maccabe AB, Widener PM, Ferreira K () Designing and implementing lightweight kernels for capability computing. Concurr Comput Pract Exp ():– Saini S, Simon HD () Applications performance under OSF/ AD and SUNMOS on Intel Paragon XP/S-. In: Proceedings of the ACM/IEEE conference on supercomputing, Supercomputing ’, Washington, DC. ACM, New York, pp – Silberschatz A, Galvina PB, Gagne G () Operating system concepts, th edn. Wiley Tanenbaum AS () Modern operating systems, rd edn. Prentice Hall Wheat SR, Maccabe AB, Riesen R, van Dresser DW, Stallcup TM () PUMA: an operating system for massively parallel systems. Sci Program :–
Optimistic Loop Parallelization Speculative Parallelization of Loops
Orthogonal Factorization Linear Least Squares and Orthogonal Factorization
OS Jitter Cray TE
OS, Light-Weight Operating System Strategies
O
Overdetermined Systems Linear Least Squares and Orthogonal Factorization
Overlay Network Peer-to-Peer
Owicki-Gries Method of Axiomatic Verification Christian Lengauer University of Passau, Passau, Germany
Definition Axiomatic verification, although a general term, refers usually to the method, introduced by Bob Floyd [] and Tony Hoare [] for sequential programs, of proving that a program satisfies a specification formulated as a predicate pair of a pre- and a postcondition. Precondition P states the assumptions made of the program variables before program S executes. Postcondition R characterizes the state of the program variables after program S terminates, if it does (partial correctness). Sometimes, termination is part of the proof obligation (total correctness). The proof obligation is usually denoted by the so-called Hoare triple {P} S {R}. The verification method proceeds by inserting intermediate assertions between the program’s statements and proving the resulting Hoare triples, based on a number of axiomatic inferences rules. Susan Owicki and David Gries extended the axiom system to parallel programs with shared memory [, ].
Discussion Overview
Out-of-Order Execution Processors Superscalar Processors
Hoare’s axiom system, also called Hoare logic, provides one axiomatic inference rule for each the following language constructs: empty statement, assignment (:=), sequential composition (;), binary deterministic choice (if), and iteration (while). Two further rules are
O
O
Owicki-Gries Method of Axiomatic Verification
needed to make propositional logic work properly. An axiomatic inference rule is written as follows: Obligations to be proved Conclusion that can be inferred without proof The reader of this entry is expected to have a basic knowledge of Hoare logic. The extension to parallel programs comes in two flavors: a general setting [] and a restricted setting []. The general setting allows for more concurrency but burdens the programmer with, possibly, complex proof obligations. The restricted setting takes some of the burden of proof away from the programmer and relegates it to the compiler, at the expense of a possible loss of concurrency. In each of the two settings, there is one language construct that introduces parallelism (parallel composition) and one that curtails it (synchronization and exclusion).
The General Setting The Axioms Parallel composition:
●
{P } S {R } , ⋯ , {Pn } Sn {Rn } are interference-free {P ∧ ⋯ ∧ Pn } cobegin S // ⋯ // Sn coend {R ∧ ⋯ ∧ Rn }
The Si are called processes and are executed in parallel. The n proof obligations concern the respective processes in isolation. Interference freedom is explained below. ●
Synchronization and exclusion: {P ∧ B} S {R} {P} await B then S end {R}
This rule rests on the assumption that the await statement is implemented as a test-and-set operation, i.e., the successful test of B is followed directly by the execution of S and all of this must proceed without interference. A comparison with the rule for the if statement [] reveals that await does not require a poststate satisfying R if B is false, while if does. Instead, await suspends execution until some parallel statement makes B true. The await statement has recently received support via the concept of transactional memory (see the corresponding encyclopedia entry).
Interference Freedom Shared variables are protected only inside an await statement but can also appear elsewhere in the program. To ensure that unprotected shared variables are being treated properly, one must prove that an assignment to a shared variable in one process does not affect the assertions made in the proof of any other parallel process. Thus, if execution of Si falsifies neither any precondition of a statement in Sj nor the postcondition of Sj , then execution of Si cannot interfere with the proof of Sj , and the proof of Sj is valid even in the face of parallel execution of Si with Sj . More formally, the proof obligation is as follows: ●
Consider every process Si of the parallel statement and every statement S of Si that is not inside the body of an await. ● For every assertion Q in the proof of some process other than Si , but not in the body of an await, prove {Q ∧ pre(S)} S {Q}. This property is called interference freedom. The concept of interference freedom brought the complexity of correctness proofs of parallel programs down to a manageable size. For a program with n processes, each with m statements, the computational complexity of the proof obligation is in O(n m ). Further, in many cases, the proof of each process can be organized such that many of the assertions can be written in terms of an invariant, which reduces the proof obligation further to showing that this invariant is not interfered with. The validity of the noninterference argument rests on two further assumptions: ●
At-most-once property: Every expression e in the program accesses at most one nonlocal, mutable variable and does so at most once. In every assignment x := e, expression e is at-most-once and x is local, or x is a simple variable (i.e., not a data aggregation like an array) and e accesses no nonlocal, mutable variable. For example, if x is nonlocal, x := x + is not at-most-once, since x is once read and once written. However, with local variable t, in the equivalent program x := t; t := x + , both assignments are at-most-once. In this way, any program violating the at-most-once property can be refined to one satisfying it.
Owicki-Gries Method of Axiomatic Verification
●
Memory interlock: Parallel accesses to a fixed memory cell are serialized by hardware.
The await statement is a device to exclude portions of the program from the obligation of a noninterference proof. Assertions placed inside the body of an await are presumed to be shielded from interference. The shorthand [S] stands for the atomic statement: await true then S end.
Auxiliary Variables The above rules do not guarantee the relative completeness of the method, i.e., that every correct program is provable (modulo the incompleteness of integer arithmetic due to Gödel). For example, one cannot prove that cobegin [x := x+] // [x := x+] coend increases variable x by , while one can prove that cobegin [x := x+] // [x := x+] coend increases x by (see the example proofs below). The reason is that the two processes in the former parallel statement are indistinguishable, so the individual progress made by each process cannot be accounted for. In order to make it recognizable, one must introduce auxiliary variables (also called history variables) that log the individual progress but do not influence the flow of data or control of the program that is subject to proof. Then, one must assume by axiom that the proof, which holds for the program with auxiliary variables, also holds for the program without auxiliary variables. ●
Auxiliary variable: A set AV of auxiliary variables of a program contains only variables x that appear in assignments x := e, where e may contain any kind of variable (auxiliary or not). ● Auxiliary variable axiom: Let S be a program with auxiliary variables and S′ the corresponding program with all assignments to auxiliary variables removed. {P} S {R} {P} S′ {R} There is an algorithm that introduces into a program the necessary auxiliary variables to conduct the
O
proof []. It floods the program with auxiliary variables. Typically, the programmer takes up the intellectual challenge instead and inserts a few, well-chosen variables. The auxiliary variable axiom is very powerful. It also supports the removal of await statements, should they become unnecessary after the removal of auxiliary variables.
Total Correctness For total correctness, i.e., termination (or at least progress for a nonterminating program), three extra steps are necessary. First, termination functions must be provided for loops in order to prove that the individual processes terminate. Second, it must be shown that the parallel composition does not cause any interference with these termination functions. Third, it must be proved that the await statements do not cause deadlock in the parallel composition. Owicki and Gries provide a sufficient, static proof condition for the absence of deadlocks []. Examples Claim: {x = } cobegin [x := x+] // [x := x+] coend {x = } Proof outline: {x =} {(x = ∨ x =) ∧ (x = ∨ x =)} cobegin {x = ∨ x =} {x = ∨ x =} [x := x+] // [x := x+] {x = ∨ x =} {x = ∨ x =} coend {(x = ∨ x =) ∧ (x = ∨ x =)} {x =} Noninterference argument: {(x =∨x =) ∧ (x =∨x =)} [x := x+] {x = ∨x = } {(x =∨x =) ∧ (x =∨x =)} [x := x+] {x = ∨x = } {(x =∨x =) ∧ (x =∨x =)} [x := x+] {x = ∨x = } {(x =∨x =) ∧ (x =∨x =)} [x := x+] {x = ∨x = }
O
O
Owicki-Gries Method of Axiomatic Verification
Note that some noninterference must be proved, even though the two processes are fully exclusive: their respective pre- and postconditions must still be checked. To prove the following program, auxiliary variables and an invariant to relate them are necessary. Claim: {x = } cobegin [x := x+] // [x := x+] coend {x = }
//
{x = ∨ x =} u := x+ {u= ∨ u=} x := u {x = ∨ x =}
coend {(x = ∨ x =) ∧ (x = ∨ x =)} {x =}
Proof outline: The invariant is I : x =y+z {x =} y := ; z := {y = ∧ z = ∧ I} cobegin {y = ∧ I} [x := x+; y := ]
cobegin {x = ∨ x =} t := x+ {t = ∨ t =} x := t {x = ∨ x =}
//
{z = ∧ I} [x := x+; z := ]
{y = ∧ I}
{z = ∧ I} coend {y = ∧ z = ∧ I} {x =}
Noninterference argument: x := u affects the postcondition of x := t, i.e., the following proof obligation is not a valid Hoare triple: {(x =∨x =) ∧ (u= ∨ u=)} x := u {x = ∨ x =} The same holds vice versa. There is interference: the program is not correct.
The Restricted Setting The Axioms Parallel composition: Let r be a set of program variables.
●
{P } S {R } ∧ ⋯ ∧ {Pn } Sn {Rn } ∧ FREE ∧ RES {P ∧ ⋯ ∧ Pn ∧ I(r)} resource r : cobegin S // ⋯ // Sn coend {R ∧ ⋯ ∧ Rn ∧ I(r)}
Noninterference argument: {(z = ∧ I) ∧ (y = ∧ I)}[x := x + ; y := ] {z = ∧ I} {(z = ∧ I) ∧ (y = ∧ I)}[x := x + ; y := ] {z = ∧ I} {(y = ∧ I) ∧ (z = ∧ I)}[x := x + ; z := ] {y = ∧ I} {(y = ∧ I) ∧ (z = ∧ I)}[x := x + ; z := ] {y = ∧ I} Our final example is again the two processes x := x + and x := x + but with the exclusion lifted. Thus, they must be made at-most-once. The proof of partial correctness succeeds, except that the proof obligation of noninterference cannot be met. Claim: {x = } cobegin t := x+; x := t // u := x+; x := u coend {x = } Proof outline: {x =} {(x = ∨ x =) ∧ (x = ∨ x =)}
FREE: no variable free in Pi or Ri is being changed by Sj ( j≠i) RES: invariant I(r) refers only to variables in r ● Synchronization and exclusion: {P ∧ B ∧ I(r)} S {R ∧ I(r)} {P} with r when B do S end {R} The with-when statement is a conditional critical region, i.e., it is executed under mutual exclusion. The rules in this setting work only if the following syntactic restrictions are obeyed: ●
Any process may access a resource variable only inside a critical region protecting this variable. ● Any variable changed by process Si may be accessed by process Sj ( j≠i) only if it is a resource variable. Thus, while shared variables must be in the resource, local variables need be in the resource only if the invariant refers to them.
Owicki-Gries Method of Axiomatic Verification
These restrictions can be checked by a compiler. This check takes the place of the proof of interference freedom necessary in the general setting.
Example In the following claim, the shorthand [S] stands for with r when true do S end. Claim: {x = } resource r(x) : cobegin [x := x+] // [x := x+] coend {x = } Proof outline: The invariant is I(r) : x =y+z {x =} y := ; z := {y = ∧ z = ∧ I(r)} resource r(x, y, z) : cobegin {y =} with r when true do {y = ∧ I(r)} x := x+; y := {y = ∧ I(r)} end {y =}
//
{z =} with r when true do {z = ∧ I(r)} x := x+; z := {z = ∧ I(r)} end {z =}
coend {y = ∧ z = ∧ I(r)} {x =} Variables y and z are accessed locally but, because the invariant refers to them, they must be part of the resource.
Related Entries Monitors, Axiomatic Verification of Formal Methods-Based Tools for Race, Deadlock, and
Other Errors
Bibliographic Notes and Further Reading The Owicki–Gries verification method received much attention in the late s and remains useful to this day. It provided the first rigorous treatment of the semantics of imperative shared-memory programs, and its notion of interference freedom is at the core of later
O
methods. The seminal idea was that the execution of one process should not interfere with the proof of correctness of another process, rather than with its execution. The necessity of the auxiliary variable axiom demonstrated the price that one has to pay for the ability to reason first about processes individually and then about their parallel composition. If the parallel program is considered as a whole, there is no need for auxiliary variables [, ]. The Owicki–Gries method demonstrated that proofs of parallel programs are feasible but require an effort that is, in the worst case, quadratic in the number of statements across all processes that are being composed in parallel. This cannot be avoided if shared updates are allowed to happen without synchronization. Further, proofs can be engineered to use invariants in order to reduce the proof effort. Soundararajan [] was able to do without auxiliary variables and without the noninterference proof by introducing instead history variables into the assertions (so-called hidden variables []). Inherently, their manipulation is of similar complexity but involves no intuition, similarly to Owicki’s algorithm for “flooding” the program with auxiliary variables []. The technical complexity of the general setting revealed the intricacies of unbridled parallelism. Still, it played a significant role in helping to prove implementations of intricate algorithms like Dekker’s algorithm for mutual exclusion or the concurrent garbage collector for LISP data structures []. The restricted setting was laid out before Owicki and Gries by Hoare []. It is relevant in practice today for the verification of monitors. A monitor is an abstract data structure whose operations are conditional critical regions. See the corresponding encyclopedia entry. The Owicki–Gries method was extended in at least two ways to address distributed-memory parallelism [, ]. The lack of shared updates in distributed programs raised the hope for simpler proofs. Unfortunately, the axiom system turned out to be more complex rather than simpler. First, an additional proof step is needed (the so-called cooperation proof [] or satisfaction proof []): the communication operations send and receive must be matched to mutual pairs whose semantics is essentially the same as that of an assignment. Second, interference freedom may still have to be proved as well. All program variables are local but, in general, there is still the need for shared auxiliary variables.
O
O
Owicki-Gries Method of Axiomatic Verification
A methodology for the development of parallel programs, which is similar to the one proposed for sequential programs by Dijkstra [] and Gries [], has been proposed by Feijen and van Gasteren []. It is based on the Owicki-Gries verification method.
Bibliography . Apt KR, Francez N, de Roever WP () A proof system for communicating sequential processes. ACM Trans Program Lang Syst ():– . Dijkstra EW () A discipline of programming. Series in automatic computation. Prentice-Hall, Englewood Cliffs . Feijen WHJ, van Gasteren AJM () On a method of multiprogramming. Springer, New York . Floyd R () Assigning meanings to programs. In: Schwartz JT (ed) Proceedings of symposium on applied mathematics, vol , American Mathematical Society, Providence, pp – . Gries D () An exercise in proving parallel programs correct. Commun ACM ():– . Gries D () The science of programming. Texts and monographs in computer science. Springer, Berlin
. Hoare CAR () An axiomatic basis for computer programming. Commun ACM, ():–, . Hoare CAR () Towards a theory of parallel programming. In: Hoare CAR, Perrott RH (eds) Operating systems techniques, A.P.I.C. studies in parallel processing , Academic, London, pp – . Lengauer C () A methodology for programming with concurrency: the formalism. Sci Comput Program ():– . Lengauer C, Hehner ECR () A methodology for programming with concurrency: an informal presentation. Sci Comput Program ():– . Levin GM, Gries D () A proof technique for communicating sequential processes. Acta Informatica ():– . Owicki SS () A consistent and complete deductive system for the verification of parallel programs. In: Proceedings of the Eighth Annual ACM Symposium on Theory of Computing (STOC’), ACM, , pp – . Owicki SS, Gries D () An axiomatic proof technique for parallel programs. Acta Informatica ():– . Owicki SS, Gries D () Verifying properties of parallel programs. Commun ACM ():– . Soundararajan N () A proof technique for parallel programs. Theor Comput Sci (–):–
P Parafrase Bruce Leasure Saint Paul, MN, USA
Synonyms Parallelization
Discussion Parafrase was a successful project allowing experimentation in source-to-source translation of FORTRAN programs. Parafrase was the work of David J. Kuck and his students at the University of Illinois at Urbana-Champaign. It was supported largely by NSF. The name of system is due to Stott Parker. A wide variety of optimization strategies were developed by researchers, and the performance achieved was measured using a broad cross section of applications, across a variety of theoretical and actual machines. Parafrase was successful because it reduced the effort required of the researcher to implement an optimization strategy, and to perform experiments. Parafrase was structured much like a multi-pass compiler of the same era. The FORTRAN program being compiled was processed one routine at a time. A scanner consumed the source code for a FORTRAN routine and built an equivalent representation in the Internal Language (IL). Then a series of passes that transformed the IL in various ways was run. The researcher would select the passes based on the goals of the research. Then lastly, a FORTRAN code regenerator pass was run. This entire process was repeated for each of the routines in the input program. The FORTRAN program output by Parafrase could be compiled by a FORTRAN compiler and run on any machine. Because the ultimate goal of Parafrase was to regenerate optimized FORTRAN code, the IL used by Parafrase was considerably different than the IL found in a compiler. The IL used by Parafrase was a relatively
straight forward representation of FORTRAN source, with some additional look-aside tables to hold summary information about variables and loop structures. Each pass was required to accept and generate this common IL. Having an IL that was very close to FORTRAN enabled Parafrase to provide an easily understandable IL dump by simply pretty printing the FORTRAN program that was represented in the data structures. The pretty printer could be invoked by Parafrase at the end of a pass, or called by the pass itself to display intermediate results. This made it easy for the researcher to see what an optimization was doing because the researcher only had to understand FORTRAN, not the details of some unusual IL. Parafrase also included an IL verification phase that could be enabled to run after any pass to ensure that that pass was following the rules. This was especially useful when identifying which pass was incorrectly processing a FORTRAN routine. Because all of the passes were required to accept and generate the common IL, the passes could be run in any order. This flexibility allowed the researcher to construct a specialized ordering of the passes to achieve specific goals. Very early in Parafrase’s life, the ability to specify the order of the passes in a text file was added. This enabled the researcher to adjust the order of the passes without having to rebuild the executable image of Parafrase. Parafrase was implemented in PL/I and run on an IBM System/ before virtual memory. With all of the passes and the limited amount of physical memory available, the executable had to be built using the technique of overlays. By placing each pass in its own overlay, the passes would all share the same memory address space, thus reducing the memory foot print. Of course, it was a little more complicated than that, and the obscure interface to IBM’s link editor to build the overlays was understood by only a few people. A build tool was created for Parafrase to automatically
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
P
Parafrase
identify an appropriate overlay structure, to create the appropriate link editor commands, and to build the overlay executable, thus relieving the researcher from understanding the messy details of this process. When researchers started to use Parafrase on more than just toy programs, it was quickly discovered that a scanner handling ANSI FORTRAN was not nearly strong enough. The various extensions to FORTAN promoted by computer manufacturers had changed common use. The only solution was to add these extensions to the Parafrase scanner, and to extend the IL where needed to include them. Eventually, VAX, CDC, HP, IBM, and other manufactures extensions were implemented. Some syntax extensions were just syntactic sugar, and were translated away by the scanner. Others required unique extensions to the IL. Another unexpected discovery when processing non-toy programs was that most FORTRAN programs contained hand optimized code targeting a particular machine. For most of the research performed with Parafrase, the ideal input program was one that was not hand optimized. Consider the case of exploring the impact of vectorization on a program. If the loops in the program that could be vectorized were unrolled by hand in the original code, then the vector length and the reference pattern for each vector operand would be different than if the loops were not unrolled in the original program. Vector lengths would be shorter, and operands
would be less likely to be contiguous. On most vector capable machines, these changes would seriously impact performance. Consequently, a collection of de-optimization passes were written for Parafrase. It was impossible to write de-optimization passes for every hand optimization. There were just too many. Instead, we kept track of how many important loops exhibited a particular hand optimization, and whenever a sufficiently large number of loops exhibited a particular hand optimization, we wrote a de-optimization pass to address the issue. Parafrase ended up having de-optimization passes and optimization passes that performed inverses of each other for many of the common optimizations. For example: Loop rerolling and loop unrolling, forward substitution and code floating, and loop distribution and loop fusion. The research efforts supported by Parafrase dealt with optimization techniques for supporting new types of hardware acceleration devices: vector processors, streaming memory systems, cache memory systems, multiple functional units, parallel processors, and memory banks. At the time, the impact of any one of these acceleration devices on the performance of ordinary programs was not well understood. The optimization techniques to automatically exploit these acceleration devices were not well understood. Parafrase provided a reasonable vehicle to explore optimization techniques.
Parafrase. Table Theses related to Parafrase Year
Author
Topics
Yoichi Muraoka
Arithmetic expressions, loop dependence testing, wave fronts, scheduling
Steve S.C. Chen
m-th order linear recurrences
Ross A. Towle
Dependence testing, loops with branching, parallel parsing time
Bruce Leasure
Design of Parafrase
Walid Abu Sufah
Virtual memory optimization, name partitioning, loop reindexing
Utpal Banerjee
Data dependence tests for multi-loop programs
David A. Padua
Clustered system loop transforms, loop pipelining, scheduling
Robert H. Kuhn
Vector optimization, decision tree optimization
Michael J. Wolfe
Direction-vector based optimization, recurrences, while loops
Ron G. Cytron
Doacross loop optimization and scheduling
Alex Veidenbaum
Blocks of assignment statements, coarse grain optimization
Constantine Polychronopoulos
Loop coalescing, subscript blocking, static and dynamic scheduling (GSS)
Parallel Computing
The history of research can be traced from the early s to the mid s through thesis topics, shown in Table , and published papers listed in the References. Twelve theses and papers are listed from a larger collection of work on the Parafrase system.
David J. Kuck Intel Corporation, Champaign, IL, USA
Bibliography
Definition
. Kuck D, Muraoka Y, Chen SC () On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans Comput C-():– . Kuck D, Budnik P, Chen SC, Davis E Jr, Han J, Kraska P, Lawrie D, Muraoka Y, Strebendt R, Towle R () Measurements of parallelism in ordinary FORTRAN programs. Computer ():– . Kuck DJ () Parallel processing of ordinary programs. Adv Comput :–, Rubinoff M, Yovits MC (eds). Academic Press, New York . Abu-Sufah W, Kuck D, Lawrie D () Automatic program transformations for virtual memory computers. Proceedings of the national computer conference, AFIPS Press, June , pp – . Kuck DJ, Padua DA () High-speed multiprocessors and their compilers. Proceedings of the international conference on parallel processing, Aug , pp – . Padua DA, Kuck DJ, Lawrie DH () High-speed multiprocessors and compilation techniques, special issue on parallel processing. IEEE Trans Comput C-():– . Kuck DJ, Kuhn RH, Leasure B, Wolfe M () The structure of an advanced vectorizer for pipelined processors. Proceedings of COMPSAC , The th international computer software and applications Conference, Chicago, IL, Oct , pp – . Kuck DJ, Kuhn RH, Padua DA, Leasure B, Wolfe M. Dependence graphs and compiler optimizations. Proceedings of the th ACM symposium on principles of programming languages (POPL), Williamsburg, VA, Jan , pp – . Cytron R, Kuck DJ, Veidenbaum AV () The effect of restructuring compilers on program performance for high-speed computers. In: Duff IS, Reid JK (eds) Special issue of computer physics communications devoted to the proceedings of the conference on vector and parallel processors in computational science II, vol Elsevier Science Publishers B V (North-Holland Physics Publ.), Oxford, England, pp – . Lee G, Kruskal CP, Kuck DJ () An empirical study of automatic restructuring of nonnumerical programs for parallel processors. Special issue on parallel processing. IEEE Trans Comput C-():– . Constantine Polychronopoulos, Kuck D, Padua D () Utilizing multidimensional loop parallelism on large-scale parallel processor systems. IEEE Trans Comput ():–
Parallel Communication Models Bandwidth-Latency Models (BSP, LogP)
P
Parallel Computing
Parallel computing covers a broad range of topics, including algorithms and applications, programming languages, operating systems, and computer architecture. Each of these must be specialized to support parallel computing, and all must be designed and implemented coherently to provide highly efficient parallel computations.
Discussion Introduction and History All computations transform data – logically and arithmetically – following a series of algorithms. The broad goals of parallel computing are to improve the speed or functional ease with which this is done. Speed and functionality goals can usually be met, but in many practical cases, difficulties arise that present far more complexity than does traditional sequential computing. The following gives an overview of background issues and the current state of parallel computing. The long and diverse history of computing is intertwined with parallelism. There will be no attempt here to formulate a rigorous definition of parallel computing. Instead, an introduction to parallel computing can be provided by sketching the many uses for parallelism over time. Computer system performance has always been the key driver of hardware parallelism. On the other hand, software parallelism has been driven by both performance and application functionality. The baseline of parallelism is serial computing, and in the beginning, some computers operated in bit-serial fashion. One bit, the minimal information unit in a computer, was processed at a time. To improve speed, multiple bits were fetched from memory and transformed in parallel, a process that continued up to wordlevel computation ( or bits). This required making all data paths in a system – memory, processor, and interconnection – one word wide. Fast parallel algorithms for individual arithmetic operations occupied the research agenda in early computing, into the s.
P
P
Parallel Computing
Some early machines used one hardware (HW) technology to implement all parts of a computer, from processor to memory. Over time, as technology advanced and more performance was demanded, distinct technologies were used in different parts of machines – e.g., transistors for processing and magnetic cores for memory. As processor technology speeds increased more quickly than memory, parallel memory units were introduced to match the data rates (i.e., bandwidths measured in bits per second) of processors. As one part of the processor became a performance bottleneck relative to others, multiple units (e.g., adders) were introduced within the processor. The idea of performing more than one arithmetic operation in parallel is an obvious way to improve performance, and was mentioned by the first computer designer, Charles Babbage (who designed a mechanical digital computer in the s [, ]). Eventually, reasonably balanced systems for various types of computation became well understood, and parallelism as above was employed to produce very efficient uniprocessor systems. Simultaneous operation on whole words and overlap of various operations are the basic principles that support parallelism. After sufficient HW has been invested to produce a good architecture using various types of low-level parallelism, the clock speed (which determines the time of a computer’s most basic steps) can be increased to speed up a system. From the beginning, new device technology allowed faster clocks; from the s on, this and architectural refinements drove system performance. But at the high end of performance demands, the ability to change the internal structure of a system by simply adding low-level parallelism became increasingly difficult and yielded slower growth. Architectural refinements were added to keep systems balanced (e.g., cache memory hierarchies) as technology advances drove processor performance faster than offchip memories.
Parallel Architectures System clock speed determines the bandwidth (bits/sec) of various computer elements, and the clock speed depends on the basic technology available. Physical issues underlie technology, including: ●
Transistor density in an integrated circuit limits clock speed
●
The speed of light limits data transmission time or latency (measured in seconds) ● Manufacturing costs of HW limit the complexity that can be built into a system To avoid limits imposed by the first and last of these, parallel processors can be used, each carrying out part of an overall computation. In other words, transistor density increases can be frozen as can the complexity of faster uniprocessor manufacturing, by moving to parallel processing. Although this forces the system size to grow, which in turn exacerbates latency issues, two of the three physical issues are controlled. The number of parallel units being used is often expressed as the degree of parallelism, n, see Fig. . Parallelism also raises a new architectural issue: How does one interconnect the processors and the processors to memory? Ideally one would choose direct connection of each unit to all others (a crossbar switch), but this raises severe cost and design issues as the degree of parallelism grows. At the other extreme, the bus can be extended from within each processor to all others, a simple but low-performance idea. Between these extremes, one can choose a fast parallel ring, torus, or shuffle interconnect network pattern, each with many parallel data transmission paths and various performance and manufacturing trade-offs. For details and more references see [, ]. From the s through the s, a host of research and commercially available parallel systems were built, exploring a wide variety of architectural trade-offs. At the same time, beginning in the s, several related approaches to computing emerged. Multiprocessing, the connection of whole computers to increase system throughput, began. The distinction can be captured by viewing parallel computing as turnaround computing – whose goal is the fast turnaround of one job at a time. In contrast, multiprocessing can be viewed as throughput computing – the goal being to complete a large volume of similar transactions quickly. The third approach, distributed computing arose for a totally different reason []. Large companies bought multiple computer systems at different sites, and wanted to exchange results among them – e.g., financial information for the whole enterprise – via telephone line communications. Distributed computing in general puts less demand on communication than parallel computing or multiprocessing. By the late s, this idea
Parallel Computing
P
Parallel-phase operations / processor
Degree of parallelism
p1 p2 . . . pi . . . pn
Load imbalance
Nonoverlapped communication time
Parallel Computing. Fig. Key parallel computation parameters
grew into the ARPANET among government contractors and agencies, and was followed by new concepts that led to the Internet. At the HW technology level, transistor speeds increased continually and casualties developed – by the early s ECL (emitter coupled logic) could not be pushed faster, and was supplanted by traditionally slower CMOS transistors. By the first decade of the twenty-first century, it became clear that silicon transistors, whose size continued to decrease (Moore’s Law []) and whose power dissipation density concomitantly increased with clock speed, were hitting a “power wall.” The result was that clock speeds could no longer increase, even though densities could continue to grow, allowing more processors on a single semiconductor chip. Thus, parallel computing reached the mainstream, and by has become ubiquitous.
Parallel Software Concepts Over the decades of parallel architecture development, research into algorithms and parallel programming tools enjoyed broad growth. However, software (SW) application development is a very diverse subject, and parallel programming has not matured at the rate that parallel HW has arrived in the mainstream of computing. Roughly speaking, parallel application development suffers from all the problems inherent in sequential programming,
plus a number of additional problems arising from parallelism. As is obvious from the release delays, security weaknesses, and other flaws in sequential applications, SW development methods are still maturing. Some decades after sequential computing HW reached its eventual complexity (except perhaps for cache hierarchy management), SW development remains labor intensive, with results that are error prone, and hard to test. Improving parallel SW development tools remains an important activity.
Application Software Development In , various programming languages are able to capture any given algorithm in many forms. Some algorithms and applications are naturally parallel. For example, little dependence exists among tasks in multiprogramming and distributed computing, and any language that suits their problem domain will do. Other applications contain data and control dependences that appear to force sequential execution of program phases. While a continuum of ideas exists to deal with parallel programming, the following breakdown captures some key points, concluding with the unexploited parallelism remaining in useful parallel applications. This section is an introduction; more details appear in the following sections.
P
P
Parallel Computing
Parallelization Methods . For some apparently nonparallel algorithms and programs, effective SW procedures lead to good parallel performance: (a) Compilers can extract some parallelism from sequential constructs by language-level transforms []. (b) Parallel libraries allow developers to express parts of programs in parallel terms at the language level, without the need to understand the underlying parallel algorithms []. (c) For other program constructs, compilation is too hard and libraries are not available, so a language or algorithm change is necessary for compiler success, forcing the application developer to think through parallelism detail. Three programming models will be discussed in detail below as found in [–]. Thus, there are several ways to adjust some programs to be suitable for efficient parallel execution, but in other cases parallelism is harder to obtain.
Parallelization Impediments . For the most difficult applications, algorithms and their programs, there are inherent, nonparallelizable issues: (a) Data and control dependences arise that cannot practically be removed by a combination of compiler and run-time HW checks. For example, too much branching, or too many pointers or run-time parameter tests exist. (b) Architecture-related clashes arise, e.g., there is too much nonoverlapped interprocessor communication or imbalance between the sizes of parallel tasks that waste parallel computation time, see Fig. . In case , the best approach is to rethink the algorithms and program, and restructure the program or use entirely different algorithms. These steps can be very difficult and nonintuitive, unless a developer is specifically driven by the parallel architectural constraints imposed, understands parallel languages, and has a deep understanding of the underlying problem and various algorithms for solving it.
Unexploited Parallelism in “Well-Parallelized” Applications . Consider the amount of “unexploited” parallelism remaining in an application that is already regarded as well parallelized for a given system. In other words, how far do parallel applications in use deviate from the perfect exploitation of parallelism in their algorithms? This question exposes language and compiler issues. Assume that an algorithm and program contain reduction operations in the form of Eq. , which sums the elements ai of array a into scalar x. If the compiler cannot parallelize this, x = ;
for all i (x = x + ai )
()
then unexploited parallelism can be recovered by using a language (extension) that contains a parallel reduction operator. Much language research is directed toward filling such language gaps in ways that are easy to use in specific algorithms or application domains, e.g., games, image processing, and transaction processing. Figure summarizes several details that determine parallel system performance. The shaded areas represent useful work (their unequal lengths show that not all processor steps take the same amount of time), and the trees represent communication or synchronization, e.g., join/fork or message-passing patterns. Performance increases with the degree of parallelism and parallel-phase operations per processor, which increase the use of parallel resources. Performance decreases due to load imbalance and nonoverlapped communication, which waste resources relative to decreasing elapsed time in Eq. .
System Architecture Performance Criteria Let computation time be defined as a measure of the useful processing steps in a source program, i.e., corresponding to necessary steps in the original algorithms and their program counterparts. Communication time includes the total HW interconnect time and system SW overhead required in a program and its resulting computation to support the desired computation time. Our overall objective is to minimize total elapsed time, in Eq. (cf. Fig. ). Total elapsed time = Computation time + Nonoverlapped communication time
()
Parallel Computing
In general, the diversity of parallel architectures, especially issues arising from major increases in processor count, change some of the inherent needs of algorithms to satisfy the constraints outlined above. Furthermore, each data size to which a given algorithm applies may force algorithm adjustments for machine size. Thus, for a given algorithm type, adaptations to machine parameters may be necessary. As outlined above, algorithms, languages, run-time libraries, and architecture are closely linked in obtaining top performance. Good parallel performance across all types of problems can usually be obtained by an approach that includes all of these factors. A universally important performance criterion that follows from Eq. is shown in Eq. (where communication time includes overlapped and nonoverlapped time). Computation time ≥ Communication time
()
Maintaining this inequality, averaged over time windows throughout a computation, means that the HW design and application SW running may be brought into balance by overlapping communication with meaningful computation. The ratio can be used as a guideline in system and SW design. If the ratio of Eq. is less than , increasing the numerator, while decreasing the denominator can reduce overall time and bring the HW use into balance. For example, if the ratio is less than , performance may be improved by using an algorithm with redundant operations that reduces the total data communication time. Such a trade-off causes no problems because the key ratio was too low to begin with.
Parallel Performance A successfully performing parallel computation must satisfy four basic requirements: . . . .
Algorithmic parallelism Program parallelism Data size and parallelism Architecture balance and parallelism
These success requirements have many interpretations and interrelations, which are sketched below. They include mathematical issues at the algorithmic level, programming languages that well-suit important application areas, accommodating the data structures and
P
sizes that occur naturally, and then synthesizing these requirements into an architectural design that is well suited to efficient solution of the problems defined by the original applications. Many books discuss various aspects of performance for a range of algorithms and applications, ranging from broad coverage [] to emphases on nonnumerical [] to numerical [] computations. . Algorithmic parallelism: The algorithms used must avoid dependences among operations that force one step to follow another. The essence of parallelism is to be able to do large numbers of operations simultaneously, and this must begin at the algorithm and data structure level. A simple case is an interactive program that has several sequential algorithms, each of which can be run simultaneously – e.g., a game may use three processors working independently on processing user input, readying the next frame, and displaying the current frame. A more scalable algorithm that may require any number of processors is one that operates on arrays of many numbers, or files of many records. For example, if computing xi requires ai and bi , ≤ i ≤ n, as in Eq. , the process can be carried out for any number of values at once. If it requires xi− , as in Eq. , this is a linear recurrence, which for all i (xi = ai + bi ) x = ;
()
for all i (xi = xi− + ai )
()
appears to have sequential dependences but can be transformed and computed by reasonable parallel algorithms in O(log n) steps. On the other hand, nonlinear recurrences, e.g., Eq. ( ≤ i ≤ n), usually present insurmountable mathematical problems x− = ; x− = ; x = ;
for all i (xi =
ai (xi− ) xi− ) bi xi− ()
for parallelization, so a different algorithm must be found. . Program parallelism: Given a set of parallel algorithms, expressing them effectively in an appropriate parallel language is relatively straightforward. However, the structure of an algorithm and architecture
P
P
Parallel Computing
will force the choice of language, as the following example illustrates. Game Example: Consider the many algorithms necessary to support an interactive computer game. At the highest level, communication from the players must be coordinated with the internals of the game. These include a changing scene driven by the game logic as well as modifications made by players’ actions. Highquality graphics depends on computationally intense solutions of complex physics equations for mechanical structures, fluid flow, cloth and hair motion, etc. An important consideration for game writers is how to use all of the computing power available to create appealing graphics and game experiences. Finally, each new frame must be composed and displayed. All of this must occur in real time. Three programming models form the core of most programming languages and methods (cf. Fig. ). (a) Message passing among communicating sequential (or parallel) processes may be the appropriate programming model at the highest level of game SW. Each process runs under independent control of the operating system on one or more processors, with the added feature that processes occasionally may send data to, and receive data from, one another. To keep the program logic valid, the system SW must provide for processes interrupting one another, and later expecting acknowledgements back from other processes. Communication delays are typical of performance problems in message-passing programs. A good exposition of MPI, a popular messagepassing language for scalable systems, is found in []. (b) The fork-join parallel model matches well with the execution of high-level tasks internal to a game. Any number of tasks can be dispatched with fork statements to separate processors, in the form of threads, and when their independent work is finished, they combine in join statements. Threads share memory and may be controlled by each application without OS intervention in scheduling. The degree of parallelism is limited by the number of parallel tasks. For a description of OpenMP and its use in shared memory programming, see [].
(c) Data parallel languages, embody a third type of program parallelism. They normally express a great deal of low-level parallelism in the data, e.g., application-specific array oriented languages can be used to express parallelism directly in the form required by an application algorithm. Equation is an example, and may occur in rendering an image for display, where each pixel is computed independently of all others. Another example is movie making, where each frame can be rendered independently of the others. An introduction to Cuda and OpenCL, two recent data parallel languages, is given in []. . Data size and parallelism: The degree of parallelism is also limited by data dependence and data size. The number of independent data structures and the size of each are indicators of the degree of available parallelism in a computation. A successful parallel computation requires data locality in that program references stay relatively confined to the data available in each processor – otherwise too much overhead time can be consumed, ruining parallel performance. The number of parallel-phase operations per processor must be sufficient to dominate the communication time per phase, see Fig. . Equation can be used to explain the reason why. In parallel algorithms, communication among processors (or processor-memory communication) occurs with some frequency; in the easy case, the frequency is low. But as the frequency grows, so does the communication in any time window. Equation says that as the communication time exceeds computation time, performance can suffer. If there is little data per processor, then the computation time window without communication is expected to be small. Thus a small numerator leads to a large denominator, driving the ratio to violate the inequality. . Architecture balance and parallelism: The architecture must have a sufficient number of processors, sufficiently fast global memory access, and interprocessor communication of data and control information that allows parallel scalability. As systems grow in degree of parallelism, the memory and communication systems define the success of a design.
Parallel Computing
Of course, these qualitative statements imply many difficult technical design choices (mentioned earlier) that must provide a good match to the programs and data sets to be run. In systems with low degrees of parallelism, including modern multicore chips, all of the processors generate addresses for a single, shared address space. This shared-memory parallelism (SMP) allows them to communicate quickly and simply via memory. But as the degree of parallelism increases and parallelism grows, distributed-memory parallel (DMP) systems are used. Whether DMP clusters of smaller SMP systems occupy a single large computer room, or span different cities, large systems have nonuniform memory access time (NUMA), so called because of the very large range of overheads involved in communicating with local (SMP) and global (DMP) memory. These systems present greater programming challenges, in order to minimize and overlap communication and computation time (recall Eq. ) and to schedule tasks properly (load balancing) to minimize wasted resources (cf. Fig. ). The great variety in real-world computations has led designers down many successful paths from the s to the present [, ].
Parallel Program Correctness Correctness is harder to determine for a parallel program than for a sequential program, and fixing bugs is concomitantly much harder. Whole new classes of bugs arise in parallel programs, for various reasons. A major problem that arises at the hardware level is non-determinism of execution. Because the system is large and some clocks may be out of sync, slight timing variations may exist from run to run of a given program. Even if in sync, different interactions with the OS (e.g., on which processor a process is scheduled) or with other applications, or slightly different data-dependent execution paths can cause a program to execute in different ways each time it runs. These nondeterministic issues make error states nearly impossible to capture or recreate. This is obvious for reactive or interactive programs, whose inputs are difficult to recreate exactly (precisely when did a keystroke that triggered a process occur?). In some parallel programs, synchronization points occur to coordinate the work of distinct processors, e.g.,
P
at join points and at the end of data parallel operations. These are necessary to ensure that the computations on all processors are complete for a given parallel phase of the overall program, before continuing. In some programming languages, message passing is used to send data and synchronize control of various processors. This is common in distributed memory machines. Software developers often find it hard to get the logic of such programs exactly right, given the variety of inputs some programs have, and the fact that multiple developers may have worked on a given section of code. The logic of some programs may require critical sections, in which a data structure is accessed by only one computational process at a time. For example, if work tasks are being chosen from a queue, to assure that each task is assigned to only one processor, the task assignment algorithm would include a critical section. Software locks may be used to allow just one process at a time to execute a critical section. When combinations of the above occur, various anomalies may arise, including program deadlock, in which, e.g., two processors halt, each waiting for the other. In what is perhaps the worst case, poorly synchronized programs actually do not halt, but spend so much time waiting for one another and occasionally proceeding, that the effect is to lead naïve developers to confuse a complex correctness bug with a performance problem. The result may be fruitless developer time spent using performance tools and techniques to find a correctness problem, which needs other approaches.
Future The era of ubiquitous parallel processing hardware has arrived, but the development of parallel application SW is lagging. In , single chips contain eight shared memory processors (or “cores”). Distributed memory clusters of these, containing from to over K processors are the state of the art in server computers. The use of clustered systems ranges from scientific research and engineering design, to Internet search engine support. While their clock speeds are frozen, the core count in future chips will continue to provide more performance, based on better architecture, parallel applications algorithms, and SW. Computers based on new technologies and principles (e.g., quantum or biological
P
P
Parallel I/O Library (PIO)
principles) could change all of this in some application areas, but it is most likely that parallel SW and architectural issues will continue to dominate much of system design for many decades. There is a seldom mentioned issue in parallelism. After the challenges of getting all current applications well parallelized are met, how does performance continue to grow? There are only two ways that more parallel processors can produce more performance: larger data sets, or more complexity in algorithms for today’s applications. Both of these factors have driven much of computing in the past, but clearly there are limitations for some applications in the future. Those that cannot grow, will be frozen at a performance level, which is fine in some areas. However, application developers will be forced to work harder in the future to exploit parallelism than they did in the past, and new parallel applications will continue to arise. This leads to a host of new and continuing problems to be solved in order to allow parallelism to continue to provide performance benefits. Parallel algorithm research will continue to provide important basic parallelism. SW researchers will design new development tools to aid the above, work on languages that prevent the commission of serious bug errors, and drive toward hardware support for easier ways to write correct programs or tools to aid in debugging. Architecture and HW research will provide faster systems in accordance with Eqs. and . All of these efforts tend to increase the degree of parallelism while reducing parallel computation time; increasing the width and shrinking the length of Fig. .
. Gropp W, Lusk E, Skjellum A () Using MPI: portable parallel programming with the message-passing interface, nd edn. MIT, Cambridge . Kirk D, Hwu W-m () Programming massively parallel processors. Morgan Kaufmann, San Francisco . Akl SG () The design and analysis of parallel algorithms. Prentice Hall, Englewood Cliffs . Grama A () Introduction to parallel computing. Addison Wesley, Reading . Trobec R, Vajtersic M, Zinterhof P (eds) () Parallel computing: numerics, applications, and trends. Springer, New York . Kuck DJ () High performance computing. Oxford University Press, New York . Culler DE, Singh JP, Gupta A () Parallel computer architecture. Morgan Kaufmann, San Francisco
Parallel I/O Library (PIO) Community Climate System Model
Parallel Ocean Program (POP) Community Climate System Model
Parallel Operating System Operating System Strategies
Bibliography . Kuck DJ () The structure of computers and computations. Wiley, New York . Randell B () The origins of digital computers. Springer, New York . Hennessy J, Patterson DA () Computer architecture: a quantitative approach, nd edn. Morgan Kaufmann, San Francisco . Jia W, Zhou W () Distributed network systems: from concepts to implementations. Springer, New York . Noyce RN () Microelectronics. Sci Am ():– . Kennedy K, Allen R () Optimizing compilers for modern architectures. Morgan-Kaufmann, San Francisco . Reinders J () Intel threading building blocks. O’Reilley Media, Sebastapol . Chapman B, Jost G, van der Pas R () Using open MP: portable shared memory parallel programming. O’Reilley Media, Sebastapol
Parallel Prefix Algorithms Reduce and Scan Scan for Distributed Memory, Message-Passing Systems
Parallel Prefix Sums Reduce and Scan Scan
Systems
for Distributed Memory, Message-Passing
Parallel Skeletons
Parallel Random Access Machines (PRAM) PRAM (Parallel Random Access Machines)
Parallel Skeletons Sergei Gorlatch , Murray Cole Westfälische Wilhelms-Universität Münster, Münster, Germany University of Edinburgh, Edinburgh, UK
Synonyms Algorithmic skeletons
Definition A parallel skeleton is a programming construct (or a function in a library), which abstracts a pattern of parallel computation and interaction. To use a skeleton, the programmer must provide the code and type definitions for various application-specific operations, usually expressed sequentially. The skeleton implementation takes responsibility for composing these operations with control and interaction code in order to effect the specified computation, in parallel, as efficiently as possible. Abstracting from parallelism in this way can greatly simplify and systematize the development of parallel programs and assist in their costmodeling, transformation, and optimization. Because of their high level of abstraction, skeletons have a natural affinity with the concepts of higher-order function from functional programming and of templates and generics from object-oriented programming, and many concrete skeleton systems exploit these mechanisms.
Discussion Traditional tools for parallel programming have adopted an approach in which the programmer is required to micromanage the interactions between concurrent activities, using mechanisms appropriate to the underlying model. For example, the core of MPI is based around point-to-point exchange of data between processes, with a variety of synchronization modes. Similarly, threading libraries are based around primitives for locking, condition synchronization, atomicity,
P
and so on. This approach is highly flexible, allowing the most intricate of interactions to be expressed. However, in reality many parallel algorithms and applications follow well-understood patterns, either monolithically or in composition. For example, image processing algorithms can often be described as pipelines. Similarly, parameter sweep applications present a parallel scheduling challenge which is orthogonal to the detail of the application and parameter space.
Programming with Skeletons The term algorithmic skeleton stems from the observation that many parallel applications share common internal interaction patterns. The parallel skeleton approach proposes that such patterns be abstracted as programming language constructs or library operations, in which the implementation is responsible for implicitly providing the control “skeleton,” leaving the programmer to describe the application-specific operations that specialize its behavior to solve a particular problem. For example, a pipeline skeleton would require the programmer to describe the computation performed by each stage, while the skeleton would be responsible for scheduling stages to processors, communication, stage replication, and so on. Similarly, in a parameter sweep skeleton, the programmer would be required to provide code for the individual experiment and the required parameter range. The skeleton would decide upon (and perhaps dynamically adjust) the number of workers to use, the granularity of distributed work packages, communication mechanisms, and fault-tolerance. At a more abstract level, a divideand-conquer skeleton would require the programmer to specify the operations used to divide a problem, to combine subsolutions, to decide whether a problem is (appropriately) divisible, and to solve indivisible problems directly. The skeleton would take on all other responsibilities, from whether to use parallelism at all, to details of dynamic scheduling, granularity, and interaction. Thus, in contrast to the micromanagement of traditional approaches, skeletons offer the possibility of macromanagement – by selection of skeletons, the programmer conveys macro-properties of the intended computation. This is clearly attractive, provided that the skeletons offered are sufficiently comprehensive collectively, while being sufficiently optimizable individually and in composition. Sometimes a skeleton may
P
P
Parallel Skeletons
have several implementations, each geared to a particular parallel architecture, for example, distributed- or shared-memory, multithreaded, etc. Such customization has the potential for achieving high performance, portable across various target machines. Application programmers gain from abstraction, which hides much of the complexity of managing massive parallelism. They are provided with a set of basic abstract skeletons, whose parallel implementations have a well-understood behavior and predictable efficiency. To express an application in terms of skeletons is usually simpler than developing a low-level parallel program for it. This high-level approach changes the program design process in several ways. First, it liberates the user from the practically unmanageable task of making the right design decisions based on numerous, mutually influencing low-level details of a particular application and a particular machine. Second, by providing standard implementations, it increases confidence in the correctness of the target programs, for which traditional debugging is too hard to be practical on massively parallel machines. Third, it offers predictability instead of an a posteriori approach to performance evaluation, in which a laboriously developed parallel program may have to be abandoned because of inadequate efficiency. Fourth, it provides semantically sound methods for program composition and refinement, which open up new perspectives in software engineering (in particular, for reusability). And last but not least, abstraction, that is, going from the specific to the general, gives new insights into the basic principles of parallel programming. An important feature of the skeleton-based methodology is that the underlying formal framework remains largely invisible to application programmers. The programmers are given a set of methods for instantiating, composing, and implementing diverse skeletons, but the development of these methods is delegated to the community of implementers. In order to understand the spectrum of skeleton programming research, it is helpful to distinguish between skeletons that are predominantly data-parallel in nature (with an emphasis on transformational approaches), skeletons that are predominantly taskparallel or related to algorithmic classes, and the concrete skeleton-based systems that have embedded these concepts in real frameworks.
Data-Parallel Skeletons and Transformational Programming The formal background for data-parallel skeletons can be built in the functional setting of the Bird–Meertens formalism (BMF), in which skeletons are viewed as higher-order functions (functionals) on regular bulk data structures such as lists, arrays, and trees. The simplest – and at the same time the “most parallel” – functional is map, which applies a unary function f to each element of a list, that is, map f [x , x , . . . , xn ] = [ f x , f x , . . . , f xn ]
()
Map has the following natural data-parallel interpretation: each processor of a parallel machine computes function f on the piece of data residing in that processor, in parallel with the computations performed in all other processors. There are also the functionals red (reduction) and scan (parallel prefix), each with an associative operator ⊕ as parameter: red (⊕) [x , x , . . ., xn ] = x ⊕ x ⊕ . . . ⊕ xn scan (⊕) [x , x , . . ., xn ] = [ x , x ⊕x , . . . , x ⊕ . . . ⊕xn ]
() ()
Reduction can be computed in parallel in a tree-like manner with logarithmic time complexity, owing to the associativity of the base operation. There are also parallel algorithms for computing the scan functional with logarithmic time complexity, despite an apparently sequential data dependence between the elements of the resulting list. Individual functions are composed in BMF by means of backward functional composition ○, such that ( f ○ g) x = f (g x), which represents the sequential execution order on (parallel) stages. Special functions, called homomorphisms, possess the common property of being well-parallelizable in a data-parallel manner. Definition (List Homomorphism) A function h on lists is called a homomorphism with combine operation ⊛, iff for arbitrary lists x, y: h (x ++ y) = (h x) ⊛ (h y)
()
Definition describes a class of functions, operation ⊛ being a parameter, which is why it can be viewed as
Parallel Skeletons
defining a skeleton. Both map and reduction can obviously be obtained by an appropriate instantiation of this skeleton. The key property of homomorphisms is given by the following theorem: Theorem (Factorization) A function h on lists is a homomorphism with combine operation ⊛, iff it can be factorised as follows: h = red(⊛) ○ map ϕ
()
where ϕ a = h [a]. In this theorem, homomorphism has one more parameter beside ⊛, namely function ϕ. The practical importance of the theorem lies in the fact that the right-hand side of the equation () is a good candidate for parallel implementation. This term consists of two stages. In the first stage, function ϕ is applied in parallel on each processor (map functional). The second stage constructs the end result from the partial results in the processors by applying the red functional. Therefore, if a given problem can be expressed as a homomorphism instance then this problem can be solved in a standard manner as two consecutive parallel stages – map and reduction. The standard two-stage implementation () may be time-optimal, but only under an assumption that makes it impractical: the required number of processors must grow linearly with the size of the data. A more practical approach is to consider a bounded number p of processors, with a data block assigned to each of them. Let [α]p denote the type of lists of length p, and subscript functions defined on such lists with p, for example, mapp . The partitioning of an arbitrary list into p sublists, called blocks, is done by the distribution function, dist (p) : [α] → [ [α] ]p . The following obvious equality relates distribution to its inverse, flattening: red (++) ○ dist (p) = id. Theorem (Promotion) If h is a homomorphism w.r.t. ⊛, then h ⊛ red (++) = red (⊛) ○ map h
()
This general result about homomorphisms is useful for parallelisation via data partitioning: from (), a
P
standard distributed implementation of a homomorphism h on p processors follows: h = red (⊛) ○ mapp h ○ dist (p)
()
Sometimes, it can be assumed that data is distributed in advance: either the distribution is taken care of by the operating system, or the distributed data are produced and consumed by other stages of a larger application. The development of programs using skeletons differs fundamentally from the traditional process of parallel programming. Skeletons are amenable to formal transformation, that is, the rewriting of programs in the course of development, while ensuring preservation of the program’s semantics. The transformational design process starts by formulating an initial version of the program in terms of the available set of skeletons. This initial version is usually relatively simple and its correctness is obvious, but its performance may be far from optimal. Program transformation rules are then applied to improve performance or other desirable properties of the program. The rules applied are semantics-preserving, guaranteeing the correctness of the improved program with respect to the initial version. Once rules for skeletons have been established (and proved by, say, induction), they can be used in different contexts of the skeletons’ use without having to be reproved. For example, in the SAT programming methodology [] based on skeletons and collective operations of MPI, a program is a sequence of stages that are either a computation or a collective communication. The developer can estimate the impact of every single transformation on the target program’s performance. The approach is based on reasoning about how individual stages can be composed into a complete program, with the ultimate goal of systematically finding the best composition. The following example of a transformation rule from [] is expressed in a simplified C+MPI notation. It states that, if binary operators ⊗ and ⊕ are associative and ⊗ distributes over ⊕, then the following transformation of a composition of the collective operations scan and reduction is applicable: ⎡ ⎢ MPI_Scan ( ⊗ ); ⎢ ⎢ ⎢ ⎢ MPI_Reduce ( ⊕ ); ⎢ ⎣
⇒
P
P
Parallel Skeletons
⎡ ⎢ ⎢ ⎢ Make_pair; ⎢ ⎢ ⎢ ⎢ MPI_Reduce (f (⊗, ⊕)); ⎢ ⎢ ⎢ ⎢ if my_pid==ROOT then Take_first; ⎢ ⎣ () Here, the functions Make_pair and Take_first implement simple data arrangements that are executed locally in the processes, that is, without interprocess communication. The binary operator f (⊗, ⊕) on the right-hand side is built using the operators from the left-hand side of the transformation. Rule () and other, similar transformation rules have the following important properties: () they are formulated and proved formally as mathematical theorems; () they are parameterized in one or more operators, for example, ⊕ and ⊗, and are therefore usable for a wide variety of applications; () they are valid for all possible implementations of the collective operations involved; () they can be applied independently of the parallel target architecture.
Task- and Algorithm-Oriented Skeletons In contrast to the data-parallel style discussed in the preceding section, there are many instances of skeletons in which the abstraction is best understood by reference to an encapsulated parallel control structure and/or algorithmic paradigm. These can be examined along a number of dimensions, including the linguistic framework within which the skeletons are embedded, the degree of flexibility and control provided through the API, the complexity of the underlying implementation framework and the range of intended target architectures. While map is the simplest data-parallel skeleton, farm can be viewed as the simplest task-parallel skeleton. Indeed, in its most straightforward form, farm is effectively equivalent to map, calling for some operation to be applied independently to each component of a bulk data structure. More subtly, the use of farm often carries the implication that the execution cost of these applications is unpredictable and variable, and therefore that some form of dynamic scheduling will be appropriate – the programmer is providing a high-level, application-specific hint to assist the implementation. Typical farm implementations will employ centralized
or decentralized master–worker approaches, with internal optimizations which try to find an appropriate number of workers and an appropriate granularity of task distribution (trading interaction overhead against load balance). From the programmer’s perspective, all that must be provided is code for the operation to be applied to each task, and a source of tasks, which could be a data structure such as an array or a stream emerging from some other part of the program. The bag-of-tasks skeleton extends the simple farm with a facility for generating new tasks dynamically. Pipeline skeletons capture the pattern in which a stream of data is processed by a sequence of “stages,” with performance derived from parallelism both between stages, and where applicable, within stages. Even such a simple structure allows considerable flexibility in both API design and internal optimization. For example, the simplest pipeline specification might dictate that each item in the stream is processed by each stage, that each such operation produces exactly one result, and that all stages are stateless. For such a pipeline, the programmer is required to provide a sequential function corresponding to each stage. More flexible APIs may admit the possibility of stateful stages, of stages in which the relationship between inputs and outputs is no longer one-for-one (e.g., filter stages, source, and sink stages), of bypassing stages under certain circumstances, and even of controlled sharing of state between stages. From the implementation perspective, internal decisions include the selection of the number of implementing processes or threads, allocation of operations to processes or threads, whether statically or dynamically, and correct choice of synchronization mechanism. The divide-and-conquer paradigm underpins many algorithms: the initial problem is divided into a number of sub-instances of the same problem to be solved recursively, with subsolutions finally combined to “conquer” the original problem. For the situation in which the sub-instances may be solved independently the opportunities for parallelism are clear. Within this context, there is considerable scope to explore a skeleton design space in which constraints in the API are traded against performance optimizations within the implementation. A very generic API would simply require the programmer to provide operations for “divide” and “conquer,” together with a test determining whether an
Parallel Skeletons
instance should be solved recursively, and a direct solution method for those instances failing this test. Lessflexible APIs, could, for example, require the degree of division to be fixed (every call of divide returns the same number of sub problems), the depth of recursion to be uniform across all branches of divide tree, or even for some aspects of the “divide” or “conquer” operations to be structurally constrained. Each such constraint provides useful information, which can be exploited within the implementation. Wavefront computations are to be found at the core of a diverse set of applications, ranging from numerical solvers in particle physics to dynamic programming approaches in bioinformatics. From the application perspective, a multidimensional table is computed from initial boundary information and a simple stencil that defines each entry as a function of neighboring entries. Constraints on the form of the stencil allow the table to be generated with a “wavefront” of computations, flowing from the known entries to the unknown, with consequent scope for parallelism. A wavefront skeleton may require the programmer to describe the stencil and the operation performed within it, leaving the implementation to determine and schedule an appropriate level of parallelism. As with divide-and-conquer, specific wavefront skeleton APIs may impose further constraints on the form of these components. Branch-and-bound is an algorithmic technique for searching optimization spaces, with built-in pruning heuristics, and scope for parallelization of the search. Efficient parallelization is difficult, since although results are ultimately deterministic, the success of pruning is highly sensitive to the order in which points in the space are examined. This makes it both promising and challenging to the skeleton designer. With respect to the API, a branch-and-bound skeleton can be characterized by small number of parameters, capturing operations for generating new points in the search space, bounding the value of such a point, comparing bounds, and determining whether a point corresponds to a solution.
Skeleton-Based Systems Past and ongoing research projects have embedded selections of the above skeletons into a range of programming frameworks, targeting a range of platforms. The PL language [] was a notable early example in which skeletons became first-class language constructs,
P
together with sequential “skeletons” for iteration and composition. Subsequent projects within the Pisa group have taken similar skeletons into object-oriented contexts, with a focus on distributed and Grid implementations. In contrast, Muesli [] allows data-parallel skeletons to be composed and called from within a layer which itself composes task-parallel skeletons, all within a C++ template-based library. Several systems have taken a functional approach. Hermann’s HDC [] focuses exclusively on a collection of divide-andconquer skeletons within Haskell, and implemented on top of MPI, while the Eden skeleton library [] and related work [] have implemented skeletons on top of both implicitly and explicitly parallel functional languages. The COPS project [] presents a layered interface in which expert programmers can also have access to the implementation of the provided “templates,” all embedded in Java with both threaded and distributed RMI-based implementations. The SkeTo project offers a BMF-based collection of data-parallel skeletons for lists, matrices, and trees, implemented in C with MPI. Domain-specific skeletons are represented within the Mallba project [] (focusing on combinatorial search) and Skipper and QUAFF projects (image processing). Higher-Order Components (HOCs) [], Enhance [], and Aspara [] extend the idea of skeletons toward the area of distributed computing and Grids. HOCs implement generic parallel and distributed processing patterns, together with the required middleware support and are offered to the user via a high-level service interface. Users only have to provide the applicationspecific pieces of their programs as parameters, while low-level implementation details such as transfer of data across the grid are handled by the HOC implementations. HOCs have become an optional extension of the popular Globus middleware for Grids. Skeleton principles are also evident in a number of other emerging parallel programming tools. Most notably, the MapReduce paradigm (related to, but distinct from the similarly named BMF skeletons) represents a pattern common to many applications in Google’s infrastructure. Emphasis in the original implementation was placed on load balancing and fault-tolerance within an unreliable massively parallel computational resource, a task strongly facilitated by the structurally constraining nature of the skeleton. Thain’s Cloud Computing Abstractions [] implement
P
P
Parallel Tools Platform
a range of distributed patterns with applications in Biometrics and Genomics. As with MapReduce, these are trivial to implement sequentially, and relatively straightforward on a reliable, homogeneous parallel architecture. The strength of the approach becomes apparent when ported to less predictable (or reliable) targets, with no additional effort on the part of the programmer. Finally, skeletal approaches can also be discerned in MPI’s collective operations [], where MPI_Reduce and MPI_Scan are parameterized by operations as well as data, and Intel’s Threading Building Blocks [], which in particular includes a pipeline skeleton.
Related Entries Collective Communication Eden Glasgow Parallel Haskell (GpH) NESL Reduce and Scan Scan
for Distributed Memory, Message-Passing
Systems
Bibliographic Notes and Further Reading The term “algorithmic skeleton” was originally introduced by Cole []. A considerable body of work now exists. Helpful snapshots can be obtained by consulting the book edited by Rabhi and Gorlatch [], the September special edition of the journal Parallel Computing [], and the recent survey by GonzalezVelez and Leyton [].
. Gonzalez-Velez H, Cole M () Adaptive structured parallelism for distributed heterogeneous architectures: A methodological approach with pipelines and farms. Concurrency Comput Pract Exp ():– . Gonzalez-Velez H, Leyton M () A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Software Pract Exp :– . Gorlatch S () Towards formally-based design of message passing programs. IEEE Trans Softw Eng ():– . Gorlatch S () Send-receive considered harmful: Myths and realities of message passing. ACM TOPLAS ():– . Gorlatch S, Lengauer C () Abstraction and performance in the design of parallel programs: an overview of the SAT approach. Acta Inf (/):– . Hammond K, Berthold J, Loogen R () Automatic skeletons in template Haskell. Parallel Process Lett ():– . Herrmann CA, Lengauer C () HDC: a higher-order language for divide-and-conquer. Parallel Process Lett (/):– . Kuchen H, Cole M (eds) () Algorithmic skeletons. Parallel Comput (/):– . Loogen R, Ortega Y, Pena R, Priebe S, Rubio F () Parallelism abstractions in Eden. In: Rabhi F, Gorlatch S (eds) Patterns and skeletons for parallel and distributed computing. Springer, London, pp – . Pelagatti S () Structured development of parallel programs. Taylor & Francis, London . Rabhi FA, Gorlatch S (eds) Patterns and skeletons for parallel and distributed computing. Springer, London, ISBN --- . Reinders J () Intel threading building blocks. O’Reilly, Sebastopol CA . Thain D, Moretti C () Abstractions for cloud computing with condor. In: Ahson S, Ilyas M (eds) Cloud computing and software services. CRC Press, Boca Raton . Yaikhom G, Cole M, Gilmore S, Hillston J () A structural approach for modelling performance of systems using skeletons. Electr Notes Theor Comput Sci ():–
Bibliography . Alba E, Almeida F, Blesa MJ, Cabeza J, Cotta C, Díaz M, Dorta I, Gabarró J, León C, Luna J, Moreno LM, Pablos C, Petit J, Rojas A, Xhafa F () MALLBA: a library of skeletons for combinatorial optimisation. Euro-Par ’: Proceedings of the th International Euro-Par Conference on Parallel Processing, Springer, London, pp – . Anvik J, Schaeffer J, Szafron D, Tan K () Why not use a pattern-based parallel programming system? Euro-Par, Klagenfurt Austria, pp – . Ciechanowicz P, Poldner M, Kuchen H () The Münster skeleton library muesli – a comprehensive overview. Technical report, University of Münster . Cole M () Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge . Dünnweber J, Gorlatch S () Higher-order components for grid programming: making grids more usable. Springer, New York
Parallel Tools Platform Gregory R. Watson IBM, Yorktown Heights, NY, USA
Synonyms PTP
Definition The Parallel Tools Platform (PTP) is an integrated development environment (IDE) for developing parallel programs using the C, C++, Fortran, and Unified Parallel C (UPC) languages. PTP builds on the Eclipse platform
Parallel Tools Platform
by adding features such as advanced help, static analysis, parallel debugging, remote projects, and remote launching and monitoring. It also provides a framework for integrating other non-Eclipse tools into the Eclipse platform. The Parallel Tools Platform is not restricted to any particular programming model, but most tools to date are designed to support course-grained parallelism.
Discussion
P
in order to achieve extremely high levels of performance. Parallel programming introduces many additional complexities that impact on all aspects of the development process. The nature of these complexities depends on the particular programming model employed, but includes issues such as deciding how the algorithm or data is to be partitioned, the need to deal with communication in addition to computation, and the inherent nondeterminism introduced by concurrent execution.
Challenges
Productivity
The challenges facing a developer writing parallel programs largely fall into the following three areas:
The quest to maximize productivity has received a much greater emphasis in recent years, where productivity is defined as both the speed and ease at which a correctly functioning application can be developed combined with the level of performance achieved by fully exploiting the available parallelism. Improving productivity is therefore dependent on a wide range of factors. One approach for improving the application development process, and that has seen wide success across a range of computing disciplines, is the use of an integrated development environment (IDE). An IDE combines a core set of tools into a single package that provides a consistent user interface across a variety of systems and architectures. Many IDEs also provide a plug-in capability so that the core set of tools can be extended as new tools are developed or extra functionality is required. IDEs tend to improve developer productivity because the required tools are always available regardless of the environment, and also because they are integrated together, tools can share data and other information in order to streamline the developer workflow. Scientific computing is one of the few areas that have largely shunned the use of IDEs. Recently, however, the benefits of using an IDE for developing scientific programs have become more apparent to the community. This is probably because the scope of the programming task for the new generation of supercomputers is looking increasingly daunting, combined with an influx of new developers who have already had exposure to, and understand the advantages of, IDEs.
Coding: The programmer must translate the mathematical model of a problem into an algorithm that is implemented using a particular programming language and parallel programming model. This process of translation involves a large degree of effort, since there is often a significant mismatch between the model and an implementation that fully exploits the available parallelism. Testing and debugging: Once an application program has been created, its correctness and accuracy must be validated. This involves a cycle of comparing the program output to known values using a variety of input data sets in order to verify that the results are correct. Coding errors, logic errors, and nondeterministic behavior all have to be addressed during this process, typically by employing an interactive debugger that allows program state to be inspected at key points during execution. Performance optimization: A correctly functioning program may still not optimally utilize the available hardware resources. A variety of tools are typically employed to examine program behavior in order to identify bottlenecks and other performance related issues. These tools usually collect relevant information about the program execution, either directly, or in an aggregated form, and provide a means of analyzing and interpreting the data in order to identify actions that can be taken to improve performance. This is important for many situations when the speed of execution is time critical, such as in weather applications. These activities are made significantly more difficult when applications are being developed for architectures that employ concurrent execution (or parallelism)
Productive Parallel Programming The goal of the Parallel Tools Platform is to address the challenges facing developers of parallel programs in a
P
P
Parallel Tools Platform
manner that leads to improved productivity. The strategy employed by PTP to achieve this is to combine the productivity enhancements inherent in an integrated development environment with a number of key tools that specifically target the parallel application developers’ capacity to deliver correct applications that are optimized for their target platforms.
C, C++, Fortran, and UPC Programming in Eclipse The Eclipse platform is an open source, vendor neutral, portable, extensible platform for tool integration. It employs a plug-in architecture that allows tools to be integrated directly with the platform, and provides a range of core functionality that is suited to a variety of application development activities. This core functionality includes support for multi-developer projects, an integrated help system, multiple language support, project resource management, advanced editing features, incremental builds, and a range of other features. Eclipse functionality is extensible using the plug-in
mechanism, and there is a rich community of developers and a large number of both commercial and open source plug-ins available. The IDE is also highly portable, so it is available on a wide variety of hardware platforms. The main interface provided by Eclipse is the workbench shown in Fig. . The workbench arranges views (text editors, project explorers, consoles, and other user interface components) into groupings known as perspectives. There is typically a perspective available for particular types of activities (coding, debugging, etc.) Eclipse also permits views to be shared between different perspectives. Along with a traditional menu bar, the workbench provides a toolbar containing a variety of actions that are associated with the currently active perspective. There is also a status bar and a range of other user interface decorations. Individual views may also have toolbars and menus that vary depending on context. Eclipse provides a number of advanced coding and editing features that are common across any language
Parallel Tools Platform. Fig. The eclipse workbench showing the C/C++ development perspective
Parallel Tools Platform
that it supports. Many of these features rely on core functionality that is provided as standard in Eclipse, but is typically not available in other development environments. Such features include: Context-sensitive help: In order to provide user assistance that is targeted to a particular context, this feature can deliver help documentation on demand (for example, when the user hits a key) or automatically display information such as prototype definitions or code snippets that pop up when the cursor is placed over a particular piece of text. Searching: In addition to the usual string or pattern matching searches available in many editors, Eclipse allows searching on language elements (such as types, variables, functions etc.), the ability to limit searches to contexts (such as declarations, references, etc.), and to specify the scope of the search (such as a subset of the files in a project). Content assistance: Reducing the amount of typing that a developer is required to do can be a powerful means of improving productivity. By inferring what the developer will type based on context and scope information, Eclipse’s code completion makes suggestions using the first few letters of the language element currently being entered (e.g., a function name, variable name, etc.), via either manual or automatic activation. Frequently used snippets of code can also be saved as code templates. These templates can then be inserted into the text according to the current scope in the same manner as other code completions. Refactoring: Improving code by changing its structure without changing behavior is a common, but errorprone, task in programming. By providing a range of common but sophisticated refactorings that can be automatically performed, Eclipse is able to prevent many of the errors that are introduced by manual changes. Typical refactorings include the ability to rename language elements, such as variables or functions, extract sections of code into a method or function, and automatically declare an interface, as well as many others. Wizards: Another method of reducing the amount of repetitive work is to automate many of the common tasks that a developer must undertake. Eclipse provides a variety of wizards for activities such as creating new projects, classes, or files, importing projects into
P
Eclipse, exporting projects from Eclipse, and project conversion. In addition to these powerful features, Eclipse provides support for application building, either using project supplied build scripts (for example Makefiles), or using an internal build system to track file dependencies and manage external tools such as compliers and linkers. The build support is also able to process errors and warnings generated by external tools, and map these back to messages displayed in the user interface, and markers that are inserted into the editor views. The Eclipse C/C++ Development Tools (CDT) provide C, C++, and UPC language support, while the Parallel Tools Platform Photran feature adds Fortran support to Eclipse. Support for a wide range of other languages also available through a variety of Eclipse and third-party plug-ins.
Code and Static Analysis for Parallel Programs In addition to the advanced editing features that are a standard part of Eclipse, PTP adds a range of enhanced features that are specifically targeted toward parallel programming. The first of these provides additional high-level help documentation for language elements used by common programming models, such as the message passing interface (MPI), the OpenMP Application Programming Interface, IBM’s Low-level messaging Application Programming Interface (LAPI), and UPC. This documentation can be accessed using Eclipse’s standard context sensitive help feature during coding, and will display prototype information along with a detailed description of the element. Another feature provided by PTP is cataloging and navigation of MPI application programming interfaces and OpenMP pragma statements in a parallel program. This enables the developer to see the elements that have been used in the program at a single glance, and to navigate to the source code line that contains a selected element. PTP also includes a Barrier Analysis tool that provides a more sophisticated analysis of MPI programs. This tool locates all calls to MPI barrier functions (synchronization points) in a program, and computes logical paths through the program to determine if it
P
P
Parallel Tools Platform
is possible for a deadlock to occur. This can happen because all tasks must synchronize on an MPI barrier, but logic errors may enable one or more tasks to miss execution of the barrier. The advantage of this type of analysis is that it can be performed without the need to execute the program.
Launching and Monitoring Parallel Programs One hindrance to developer productivity is the seemingly basic ability to easily launch an application on a target parallel system. Most high-performance computing systems employ complex environments to monitor and control application launching. Many of these systems are also highly centralized assets that are tightly controlled, and they will typically restrict direct user access using a batch scheduler system. A developer’s ability to interact with these systems will therefore be limited by their ability to deal with the additional complexity that this introduces. This is particularly the case where programs are being ported from one system to another, or where the developer has access to multiple systems and architectures. PTP’s approach to this issue is to provide a single, uniform interface that allows the developer to launch and monitor applications on a wide range of systems, without needing to understand the intricacies of each particular system. This is achieved through two features:
Parallel runtime perspective: This perspective provides a number of views that give the developer a snapshot of system activity at any particular time. Importantly, the views are independent of the type of system that is being targeted, so the developer does not need to be concerned with the underlying system details. The perspective is comprised of the Resource Manager View, which lists the systems that are available to the developer for launching and debugging applications, the Machines View, which shows the status of the system and its computing elements, and the Jobs View, which lists pending, running, and completed jobs on the system. A console is also available to display the program-generated output. Parallel application run configuration: This utilizes the standard Eclipse Run Configuration dialogs, which are
used to enter all the parameters necessary for a successful program launch. Run configurations are automatically saved so they can be easily used to relaunch an application during testing or debugging activities. PTP provides an extended run configuration for parallel programs that allows system-specific information about the parallel run to be supplied.
Parallel Debugging Developing complex parallel programs is a difficult task, and it is particularly challenging to ensure that they operate correctly. Finding errors in a parallel program is complicated because the many concurrent threads of execution make it difficult to observe the program operation in a deterministic manner. Managing large applications running on many processors may present a problem to the developer as well since the sheer volume of information may be overwhelming. The very act of debugging the program may also disturb its operation enough to make identifying the cause of an error impossible. The PTP debugger attempts to address some of these issues. In particular, the debugger provides the basic functionality needed to step through the execution of the program examining variables and program state along the way, and an easy to use interface that enables the developer to quickly obtain pertinent information about an executing application. The debugger is invoked in the same manner as any other application launch, by clicking a button. Eclipse will automatically switch to the Parallel Debug Perspective when the debugging session is ready. The Parallel Debugger Perspective comprises a number of views to facilitate the debugging task. The Parallel Debug View provides a high level view of the application, showing all the tasks currently executing. Debug operations on groups of processes, such as single stepping, can be performed using this view. The Debug View provides information about individual threads of execution, including a stack frame showing the current location. The Variables View shows the local variables from the currently selected stack frame in the Debug View, and can be used to inspect variables from a range of tasks. Other views are also available to inspect different aspects of the program state.
Parallel Tools Platform
Utilizing External Tools Although PTP provides a range of tools for developing parallel programs, there are many other tools available that could be used to the developer’s advantage if they were accessible from within the Eclipse environment. To facilitate this, PTP provides an external tools framework that allows non-Eclipse tools to be integrated in a manner that preserves the developer’s workflow. The framework defines three main integration points during the application development workflow: compile, execute, and analyze. The compile integration point specifies how the normal build commands are to be modified during the build process to perform any tool-specific actions. An example of this might be to instrument the application to collect tracing or profiling information. The execute integration point specifies how the command used for launching of the application executable can be modified to perform any tool-specific actions. An example of this might be to pass the application executable to a tool that performs data collection during the execution. The analyze integration point allows an external tool to be launched once the program execution is complete, such as a tool to analyze performance data generated during execution. In addition to these integration points, the external tools framework provides a Feedback View that enables externally generated information to be mapped back to source files and lines within the Eclipse environment. Eclipse can then display this information in the form of sortable tables, or by annotating the source code directly with markers or other forms of highlighting.
Remote Development To be an effective enhancement to conventional development processes, PTP needs to support a broad range of environments and systems. Many of these systems are scarce resources, so access is often tightly controlled from remote locations. Ideally, application developers for these systems need to be able to access these resources for testing, debugging, and performance optimization as if they were local. This enables the development process to be significantly streamlined, since the developer does not need to be concerned with copying executables and/or data files from system to system. PTP addresses this requirement by adding a Remote Development Tools (RDT) feature. This enables a project
P
to be physically located on a remote machine, but allows access to the project and its source files for editing and building as if they were local. RDT takes care of all the activities necessary to make this process appear as transparent as possible to the developer. This can also be combined with PTP’s ability to launch and debug programs on a remote target system, so the developer is able to take advantage of a fully remote enabled environment.
The Parallel Tools Platform Today PTP is an evolving project. Started in , it has an active and growing developer community that spans a range of government, academic, and commercial organizations. The number of developers contributing to the project continues to increase, including the renowned National Center for Supercomputing Applications, which has recently begun to actively participate in PTP development. Another promising development is the contribution of value-added components that enhance the functionality of PTP. Contributions have been made by the University of Oregon to add functionality to support their performance analysis tool, called Tuning and Analysis Utilities (TAU), the University of Utah to support their tool for formal verification of MPI programs, called In-situ Partial Order (ISP), and the University of Florida to support their tool for performance analysis of partitioned global-address-space (PGAS) programming models, called Parallel Performance Wizard (PPW). PTP’s main goal to date has been to demonstrate that Eclipse is a viable alternative for the development of scientific applications for large-scale computer systems. The list of features that are now available, and the growing developer and contributor community, demonstrate that this goal has been met.
Future Directions A comprehensive IDE for developing applications for a broad range of parallel platforms is a huge undertaking. PTP has begun by addressing many of the most pressing issues, such as programmer assistance, uniform access to remote systems, parallel debugging, and external tool integration. However there are still a large number of
P
P
Parallel Tools Platform
issues that remain to be dealt with, and these will form the basis of much of the ongoing development over the coming years. Some of these issues include: Scalability: Both application and systems sizes are increasing dramatically. The next generation of petascale systems will have hundreds of thousands of processing elements, and executions that exceed one million threads will be likely. The applications themselves are growing continuously, with millions of lines of code now becoming increasingly common. To continue to be effective, PTP will need to support such scales without unduly impacting developer productivity. Debugging paradigms: Conventional debugging approaches have been successfully applied to parallel programs for some years. However, as program sizes increase, it is likely that these techniques will break down. One reason for this is that the sheer volume of information (for example, a million executing threads) is likely to overwhelm the developer. New debugging paradigms will need to be introduced that address this information overload, yet still allows the developer to accurately pinpoint the location of errors within a program. Static analysis tools: Eclipse has an enormously powerful infrastructure available to the tool developer, and PTP has only really touched the surface of possible tools to exploit this. There is significant scope for much more powerful static analysis and refactoring tools that could have a very beneficial impact on developer productivity, and work is actively underway to enhance this capability. Remote development: To be a truly effective remote development platform, PTP needs to support additional paradigms for developing applications remotely. In particular, the ability to work off-line and support for synchronizing with a remote repository are two areas that would enhance developer productivity significantly. Multi-core architectures: The computing industry is beginning to see a real convergence between multicore systems, which have already reached hundreds or even thousands of cores, and parallel architectures, which already employ multi-core programming elements. Many of the techniques used for parallel programming may also be beneficial for applications that wish to exploit multi-core technology. PTP is the logical home for exploring this convergence, and work is beginning to take place in this area.
Related Entries Compilers Debugging Fortran and its Successors MPI (Message Passing Interface) OpenMP PGAS (Partitioned Global Address Space) Languages Performance Analysis Tools Scheduling TAU UPC
Bibliographic Notes and Further Reading The idea of using computers to aid in the software engineering process extends back to the late s; however, it was not until the s that the market for computer-aided software engineering or CASE-based tools became significant (the term CASE was not coined until ). CASE covers a very broad area of software engineering practices, whereas integrated development environments, which are one class of CASE tools, tend to focus on the relatively small set of software engineering practices associated with the edit-build-test-debug development lifecycle. There have been a number of papers describing the relationship between productivity and the use of IDEs, and the productivity and quality improvements that can be obtained by such environments is well documented [, , , ]. Their success is also reflected in the best practice use of IDEs for most commercial software development today. Van De Vanter, et al. [] have asserted that there are many reasons why existing developer tools are not going meet the productivity requirements of high performance computing: they are difficult to learn, they may not scale as machine sizes increase, they are different across different platforms, they are hard to develop or port to new platforms, they are often poorly supported, and they can be expensive. The use of IDEs for parallel computing has also been explored before [–, ], but all of these appear to have suffered from low levels of use, were academic research projects, or were too expensive to develop and maintain, and so have gradually died out. PTP is the first time that an open-source IDE has been tailored specifically for scientific and parallel computing [].
Parallelism Detection in Nested Loops, Optimal
The non-for-profit Eclipse Foundation was created in to oversee the stewardship of the opensource Eclipse platform. Since it was established, the Foundation membership has grown to exceed organizations. Many of these companies use Eclipse as the core of a commercial product offering, after having abandoned their own proprietary IDE technology and switched to the Eclipse platform. Eclipse is widely used across many computing fields, and arguably has the largest number of users of any IDE in history. It is also now one of the largest open-source projects and there have been over two million downloads of the latest version alone [].
Bibliography . Bemmerl T () TOPSYS for programming distributed multiprocessor computing environments. In: Proceedings of computer systems and software engineering, The Hague, pp – . Brode BQ, Warber CR () DEEP: a development environment for parallel programs. In: Proceedings of the international parallel processing symposium, Orlando, pp – . Callahan D, Cooper K, Hood R, Kennedy K, Torczon L () ParaScope: a parallel programming environment. In: Proceedings of the first international conference on supercomputing, Athens, Greece, June . Cownie J, Dunlop A, Hellberg S, Hey AJG, Pritchard D () Portable parallel programming environments – the ESPRIT PPPE project, massively parallel processing applications and development, The Netherlands, June . Frazer A () CASE and its contribution to quality. The Institution of Electrical Engineers, London . Granger MJ, Pick RA () Computer-aided software engineering’s impact on the software development process: an experiment. In: Proceedings of the th Hawaii international conference on system sciences, Jan , pp – . http://eclipse.org/downloads . Kacsuk P, Cunha JC, Dózsa G, Lourencço J et al () A graphical development and debugging environment for parallel programs. Parallel Comput ():– . Luckey PH, Pittman RM () Improving software quality utilizing an integrated CASE environment. In: Proceedings of the IEEE national aerospace and electronics conference, Dayton, May , pp – . Norman RJ, Nunamaker JF Jr () Integrated development environments: technological and behavioral productivity perceptions. In: Proceedings of the annual Hawaii international conference on system sciences, Kailua-Kona, Jan , pp – . Van De Vanter ML, Post DE, Zosel ME () HPC needs a tool strategy. In: Proceedings of the second international workshop
P
on software engineering for high performance computing system applications, ACM, May , pp – . Watson GR, DeBardeleben NA () Developing scientific applications using eclipse. Comput Sci Eng ():–
Parallelism Detection in Nested Loops, Optimal Alain Darte École Normale Supérieure de Lyon, Lyon, France
Synonyms Detection of DOALL loops; Nested loops scheduling; Parallelization
Definition Loops are a fundamental control structure in imperative programming languages. Being able to analyze, transform, and optimize loops is a key feature for compilers to handle repetitive schemes with a complexity proportional to the program size and not to the number of operations it describes. This is true for the generation of optimized software as well as for the generation of hardware, for both sequential and parallel execution. Exploiting parallelism is a difficult task that depends, among others, on the target architecture, on the structure of computations in the program, and on the way data are mapped, used, and communicated. In contrast, detecting parallelism that can be expressed as loops, i.e., transforming Fortran-like DO loops into DOALL loops (loops whose iterations are all independent) depends only on the dependences between computations in the program being analyzed. Intuitively, an algorithm is optimal for parallelism detection in loops, if it transforms the loops of a program so that each statement is surrounded, after transformation, by a maximal number of parallel (DOALL) loops. However, such a notion of optimality needs to be defined with care (see the discussion hereafter) to avoid inconsistent or inaccurate claims.
Discussion Optimal Parallelism Detection in Loops How to define optimal parallelism detection in loops? An optimality criterion solely based on the number of
P
P
Parallelism Detection in Nested Loops, Optimal
parallel loops – thus not on the number of iterations – has no meaning for loops with constant bounds as they can always be unrolled or strip-mined. For example, it is true that hyperplane scheduling can always be applied on n perfectly nested loops with constant bounds, resulting in a code with one outer sequential loop and (n − ) parallel loops. However, it does not mean that it exhibits any parallelism as, for a purely sequential code, each parallel loop has then a single iteration. Similarly, an optimality criterion should not lead to inconsistencies due to the way loops are generated, e.g., n nested loops with constant loop bounds can be fully unrolled leading to a code with no loop while applying loop tiling leads to n loops. One way to define optimality is with respect to the execution time of the parallelized code on an ideal PRAM-like machine, with an unlimited number of computation units, where any instruction takes a single unit of time. An algorithm for parallelism detection is then optimal if the corresponding code with DOALL loops has an optimal execution time for this ideal machine, which is the maximal length of a path in the dependence DAG (directed acyclic graph) obtained by fully unrolling the program. However, a full unroll of the code is too costly, often undesirable as it loses the loop structure, and sometimes impossible, e.g., when the loop bounds are parameterized. A more practical definition of optimality is to assume that each loop has a parameterized number of iterations, of order N, and to say that an algorithm is optimal if the execution time of the parallelized program on the ideal machine is O(N d ) where d is minimal. Optimality can then be proved by exhibiting a dependence path in the unroll program whose length is not O(N d− ). A more accurate definition of optimality can be given by applying the same reasoning for each statement, considering that any other statement takes no time on the ideal machine, i.e., does not count in the length of a dependence path. In general, algorithms for parallelism detection transform the code so that each statement is surrounded by the same number of loops before and after transformation. This is true, in particular, for algorithms based on unimodular transformations. In this case, one retrieves the intuitive notion of optimality which states that parallelism detection is optimal if each statement is surrounded, after transformation, by a maximal number of parallel loops. The only constraint that a
parallelism detection algorithm must respect is that the partial order of operations defined by the dependences in the program are preserved. How these dependences are computed and abstracted is not part of the algorithm, it is its input. In other words, the optimality of an algorithm can only be defined with respect to the dependence abstraction it uses. To show that it is optimal, one needs to exhibit a dependence path of the required length, not in the original program, but in its abstraction, i.e., a dependence path in an overapproximation of the actual dependence graph. A more complete discussion on the definition of optimality for parallelism detection algorithms is provided in [, , ]. The rest of this essay, with large parts borrowed from [], shows that it is possible to give, for each classic dependence abstraction (dependence level, uniform dependence vector, direction vector, dependence cone, dependence polyhedron), an algorithm which is optimal with respect to this abstraction. Moreover, this algorithm is a specialization of a more generic algorithm [], inspired by the decomposition of Karp, Miller, and Winograd for checking the computability of a system of uniform recurrence equations []. Notice that the previous optimality criterion makes no distinction between an outer parallel loop and an inner parallel loop. If a parallel loop can always be pushed down, i.e., interchanged with inner loops, the converse is not true. However, detecting parallel loops that contain sequential loops and detecting permutable loops (the base for loop tiling) are two similar problems that can be achieved by variations of the algorithms mentioned hereafter. But optimality for these goals is more difficult to define. Here is a summary of the main optimality results: Dependence level The algorithm of Kennedy and Allen [] is optimal, which implies that loop distribution is sufficient to detect maximal parallelism for dependence levels. Uniform dependences Hyperplane scheduling (as defined by Lamport []) is optimal, for perfectly nested loops, which implies that unimodular transformations with one outermost sequential loop is sufficient for uniform dependences. Direction vectors An adaptation of the algorithm of Wolf and Lam [] (which detects permutable loops) is optimal, which implies that unimodular
Parallelism Detection in Nested Loops, Optimal
transformations (loop interchange, loop reversal, loop skewing) are sufficient for direction vectors. Dependence polyhedron The algorithm of Darte and Vivien [] is optimal, which implies that unimodular transformations, combined with loop distribution and loop shifting, are sufficient for dependence polyhedra. Affine dependences The algorithm of Feautrier [] is optimal among all affine transformations, but these transformations are not sufficient to detect maximal parallelism for affine dependences. This essay is organized as follows. The first section recalls the model of system of uniform recurrence equations (SURE) and the link between the computability and scheduling problems of such a system. The second section shows how, thanks to a uniformization of dependence distance abstractions, the detection of parallelism in nested DO loops can be transferred to the model of SURE so as to derive optimality results. The last section illustrates yet another connection: the multi-dimensional scheduling techniques used to analyze a SURE and to detect parallelism in nested DO loops can also be used to prove the termination of some WHILE loops.
The Organization of Computations in a System of Uniform Recurrence Equations In , Karp, Miller, and Winograd introduced a model (system of uniform recurrence equations or SURE) to describe a set of regular computations as mathematical equations []. The goal of their paper, entitled “The Organization of Computations for Uniform Recurrence Equations,” was to study the parallelism that such a description contains implicitly, motivated by “the recent [at this time] development of computers capable of performing many operations concurrently,” in particular for regular applications such as solving partial differential equations by finite-difference methods. This work was purely theoretical: the paper does not even have a conclusion section, which would possibly mention some applications. Nevertheless, it turned out to be very prophetic, when considering the large number of developments for which it served as foundations. For example, it has some connections with accessibility problems in vector addition systems and Petri nets. The developments of systolic arrays and of high-level synthesis from
P
systems of recurrence equations are directly inspired from it. Most scheduling methods developed for automatically parallelizing DO loops are based on it, directly or indirectly. As a by-product, the theory of unimodular transformations and the different parallelism detection algorithms [, , , , ] can be seen as extensions and/or specializations of the technique of Karp, Miller, and Winograd, adapted to specific dependence abstractions and objectives. Last but not least, it has also some connections with techniques to prove the termination of imperative programs, either with the insertion of counters or by the derivation of affine ranking functions [, , , ].
Definition of a SURE A SURE is a finite set of m equations of the form ai (p) = fi (ai (p − di ,i ), . . . , aimi (p − dimi ,i ))
()
where p ∈ Zn is called an iteration vector, di,j ∈ Zn is called a dependence vector, fi is a strict function with mi arguments whose properties are not considered (only the structure of computations is analyzed, not what they do). The value ai (p) has to be computed for all integral points p in a subset P of Zn , called the evaluation region. Values are supposed to be given at p − di,j ∉ P (input variables) wherever required for the evaluation at p ∈ P. The value ai (p) is said to depend on aij (p − dij ,i ), for ≤ j ≤ im . These dependence relations define a graph Γ, called the expanded dependence graph (EDG). Its set of vertices is {, . . . , m} × P and there is an edge from (j, q) to (i, p) if ai (p) depends on aj (q). A SURE can be equivalently defined by a weighted-directed multigraph G = (V, E, w), called the reduced dependence graph (RDG), as follows: ● ●
For each ai , there is a vertex vi in V. If ai (p) depends on aj (p − dj,i ), G has an edge e = (vj , vi ) in E from vj to vi with weight w(e) = dj,i .
A SURE is then defined by an RDG G = (V, E, w) and an evaluation region P. For example, the following system, to be evaluated for P = {(i, j) ∣ ≤ i, j ≤ N}, ⎧ ⎪ ⎪ a(i, j) = c(i, j − ) ⎪ ⎪ ⎪ ⎪ ⎨ b(i, j) = a(i − , j) + b(i, j + ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ c(i, j) = a(i, j) + b(i, j)
P
P
Parallelism Detection in Nested Loops, Optimal
0 0 a
c
0 1 1 0
b
0 0
0 −1
Parallelism Detection in Nested Loops, Optimal. Fig. Corresponding RDG for the SURE example
defines a SURE with three equations on a square of size N, corresponding to the RDG of Fig. .
Computability: Definition and Properties The definition of a SURE, either by Equation (), or by the reduced dependence graph, is implicit: to compute a value ai (p), one must first compute all values that appear in the right-hand side of the definition of ai (p). These values are computed the same way, by evaluating the right-hand side of their definition, and so on. Thus, two questions arise: Computability problem Does this recursive process end, i.e., are all values ai (p) computable? Scheduling problem If the system is computable, how to organize the computations? A schedule is a function θ from the vertices of the EDG Γ to N such that θ(j, q) < θ(i, p) whenever (i, q) depends on (j, q). A SURE is computable (or explicitly defined) if there exists a schedule. It is not computable if there exists a vertex (i, p) in the EDG Γ such that the length of a dependence path leading to it is unbounded. Then, either ai (p) depends on itself (possibly through other computations) or it needs to wait “infinite” time before the complete evaluation of its right-hand side. In this case, because the in-degree in the EDG Γ is finite, there is an infinite path leading to (i, p). To make the analysis simpler, the following discussion focuses on (parametric) bounded evaluation regions. An RDG G is said computable if all SUREs defined from G on bounded evaluation regions are computable. Theorem An RDG is computable if and only if (iff) it has no cycle of zero weight.
More precisely, the fact that G has no cycle of zero weight is a sufficient condition for a SURE defined from G to be computable, whatever the bounded evaluation region. However, it is a necessary condition only if the evaluation region is “sufficiently large” (see [, ]). The case of a single equation
A uniform recurrence equation (URE) is defined by an evaluation region P ⊆ Zn and an RDG G with a single vertex, i.e., by a single equation a(p) = f (a(p − d ), . . . , a(p − dm )). Let D be the n × m matrix whose columns are the dependence vectors. Consider the following sets: T(D) = {t ∈ Zn ∣ tD ≥ } Q(D) = {q ∈ Zm ∣ q ≥ , q ≠ , Dq = } and TQ (D) and QQ (D) their rational relaxations. G is computable iff it has no cycle of zero weight, i.e., Q(D) is empty. Farkas Lemma [] shows the following result: Theorem Q(D) = / ⇔ QQ (D) = / ⇔ TQ (D) ≠ / ⇔ T(D) ≠ /. This property shows that checking if an RDG is computable can be done in polynomial time by checking that TQ (D) is non empty, i.e., that the cone generated by the dependence vectors is strictly included in a halfspace. Each t ∈ T(D) corresponds to a separating hyperplane and a schedule θ t defined by θ t (p) = t. p+K where t. p is the vector product and K = − minp∈P t. p. Indeed, if q depends on p, q = p + di , then θ t (q) = t. q + K = t. p + t. di + K > θ t (p). Thus, an RDG is computable iff there is an affine schedule, i.e., a way of computing the URE by a regular schedule, whose definition does not depend on the evaluation region P. Given a vector t ∈ TQ (D), the latency of the schedule θ t , i.e., the total number of sequential steps it induces, can be rounded to L(θ t ) = maxp,q∈P t. (p − q). Finding the “fastest” linear schedule means solving a min-max optimization problem Lmin = min{L(θ t ) ∣ t ∈ TQ (D)}. If P is a polytope {p ∣ Ap ≤ b}, then L(θ t ) = max{t. (p − q) ∣ Ap ≤ b, Aq ≤ b}. The duality theorem of linear programming [] leads to: L(θ t ) = max{t. (p − q) ∣ Ap ≤ b, Aq ≤ b} = min{(t + t ). b ∣ t , t ≥ , t A = t, t A = −t} Lmin = min{(t + t ). b ∣ t , t ≥ , t A = −t A = t, tD ≥ }
Parallelism Detection in Nested Loops, Optimal
Solving the previous linear program gives a way to produce a vector t in TQ (D) from which a fast schedule can be built. Its performance can be characterized as follows: Theorem If the evaluation region P is sufficiently large, the difference between Lmin (the latency of the fastest affine schedule) and the longest dependence path in Γ (the latency of the fastest schedule) is bounded by a constant that does not depend on the domain size. Theorems and together show that only two cases can occur for a URE defined on bounded regions: either the URE is not computable as soon as the evaluation region is large enough, or there is an affine schedule. In the latter case, for a URE defined on polyhedra {Ap ≤ Nb}, the length of the longest path is kN + O() for some positive rational k and an affine schedule with latency kN+O() can be derived. See more details in []. The case of several equations
For one equation, a vector q ∈ Q(D) can be interpreted directly as a cycle in the RDG since all edges are connected to the same vertex: they can be used in any order. For several equations, it is more difficult to express a cycle by linear constraints and to ensure that edges are traversed in a specific order. To detect cycles of zero weight in G = (V, E, w), the key is to consider G′ the subgraph of zero-weight multicycles (union of cycles), i.e., the subgraph of G generated by the edges that belong to a union of cycles whose total weight is zero. The following theorem identifies the links between G and G′ . Theorem (a) G contains a zero-weight cycle iff its subgraph G′ does. (b) G contains a zero-weight cycle iff one of its strongly connected components (SCCs) does. (c) If G′ is strongly connected, G has a zero-weight cycle. These properties give the hint for solving the problem with a recursive search. To detect a zero-weight cycle in G, it is sufficient to consider each SCC of G′ . If G′ is empty, G has no zero-weight multicycle, thus no zero-weight cycle. If G′ has more than one SCC, then G has a zero-weight multicycle if at least one SCC has a zero-weight cycle. It remains to solve the terminating case, i.e., when G′ is strongly connected, in which case G has a zero-weight cycle. Finally, this leads to the decomposition of Karp, Miller, and Winograd.
P
(Decomposition of Karp, Miller, and Winograd) Boolean KMW(G): Build the subgraph G′ of zero-weight multicycles in G. ● Compute G′ , …, G′s , the s SCCs of G′ . – If s = , G′ is empty, return TRUE. – If s = , G′ is strongly connected, return FALSE. – Otherwise return ∧i KMW(G′i ) (logical AND). ●
Then, G is computable iff KMW(G) returns TRUE.
Consider the SURE whose RDG is depicted in Fig. . The two edges (a, b) and (b, c) cannot belong to a zero-weight multicycle, since the weight of any multicycle that traverses them has a positive first component. The self-dependence on b and the cycle formed by the edges (a, c) and (c, a) define a zero-weight multicycle, thus G′ is the subgraph of G obtained by deleting the two edges from (a, b) and (b, c) (see Fig. ). It has two SCCs. For both, the subgraph of zero-weight multicycle is empty, thus the decomposition stops: both SCCs are computable, thus G′ is computable, and finally G is computable too. Let us now focus on the construction of the subgraph G′ . The cycle vector associated to a cycle in G is a vector q, with ∣E∣ components, such that qe is the number of times the edge e is traversed in the cycle. The connection matrix is a ∣V∣ × ∣E∣ matrix C such that Cv,e = (resp. Cv,e = −) if the edge e leaves (resp. enters) vertex v, and Cv,e = otherwise. A vector q ≥ such that Cq = represents a union of cycles, which is a cycle if the subgraph of G generated by the edges e such that qe > is connected. Let W be the n × ∣E∣ weight
0 0 a
c 0 1 1 0
b
0 0
0 −1
Parallelism Detection in Nested Loops, Optimal. Fig. As dotted lines, edges that do not belong to G′
P
P
Parallelism Detection in Nested Loops, Optimal
matrix whose columns are the edges weights of G. Then, the edges of G′ are exactly the edges e for which ve = in any optimal solution of the following linear program: min{ ∑e ve ∣ q ≥ , v ≥ , q + v ≥ , Cq = , Wq = }
Now, to better understand what is behind this linear program, let us interpret its dual. After algebraic manipulations, it can be written as: max{ ∑e ze ∣ ≤ z ≤ , t. w(e) + ρ y e − ρxe ≥ ze , ∀e ∈ E }
where e goes from xe to ye . The complementary slackness theorem [] shows interesting properties in the dual. Theorem For any optimal solution (z, t, ρ) of the dual: e ∈ G′ ⇔ t. w(e) + ρ y e − ρ xe = ′
e ∉ G ⇔ t. w(e) + ρ ye − ρ xe ≥
() ()
Theorem shows how the constraints t. D ≥ obtained for linear scheduling in the case of a URE (remember the set T(D) in Theorem ) can be generalized to the case of a SURE. For a URE, t was interpreted as a vector normal to a hyperplane that separates the space into two half-spaces, all dependence vectors being strictly in the same half-space. Here, the vector t defines (up to a translation given by the constants ρ) a hyperplane which is a strictly separating hyperplane for the edges not in G′ , see Inequality (), and a weakly separating hyperplane for the edges in G′ , see Equality (). Furthermore, for each subgraph G that appears in the decomposition, t defines a hyperplane that is the “most often strict”: the number of edges, for which such a hyperplane is strict, is maximal (since ∑e ze is maximal). See more details in []. Define the depth d of G = (V, E, w) as the maximal number of recursive calls generated by the initial call KMW(G) (counting the first one), except if G is acyclic, in which case d = . The depth d is a measure of the parallelism described by G: it is related both to the length of the longest paths (intrinsic sequentiality) and to the minimal latency of particular schedules, called
shifted-linear multi-dimensional schedules, i.e., mappings from V × Zn to Zd . The execution order is the lexicographic order ≤lex on vectors of dimension d: the components of the schedule can be interpreted as hours, minutes, seconds, …, described by nested loops starting from the outermost one. For each v ∈ V, involved in dv recursive calls, a sequence of vectors tv , …, tvd v and of constants ρ v , …, ρ vdv can be built by considering the dual program during the decomposition algorithm. The two sequences can be completed with zeros, if needed, to get sequences of length d. Theorem Let G = (V, E, w) be a computable, strongly connected RDG of depth d. If the evaluation region is a n-dimensional cube of size N, the mapping defined by θ(v, p) = (tv . p + ρ v , . . . , tvd v . p + ρ vdv , , . . . , ) defines a multi-dimensional schedule with latency O(N d ). Furthermore, the associated EDG Γ contains a dependence path of length Ω(N d ), whose projection onto G visits Ω(N d v ) times each vertex v ∈ V. A dependence path of length Ω(N d ) can be built in Γ, following the hierarchical structure of G′ , by traversing order N times a cycle that visits all vertices of G and by plugging, during this traversal, each SCC of G′ order of N times, in a recursive manner. Consider the example of Fig. again. Go from a(, ) to a(, N −), following (N − ) times the cycle between a and c. Then, go to b(, N − ) following the edge (a, b), and to b(, ) following (N − ) times the self-loop on b. Finally, go to c(, ) and a(, ) following the edges (b, c) and (c, a) once. This makes a path of length (N −)++(N-)+ = N. This pattern can be repeated (N − ) more times, leading to a(N, N), for a path of length N . In terms of scheduling, solving the constraints of Theorem shows that a(i, j) can be computed at time step (i + , j), b(i, j) at step (i, −j), and c(i, j) at step (i + , j + ). Indeed, for the first level of the decomposition, the constraints amount to look for a vector t such that t. (, ) = t. (, −) = (for the two cycles of G′ ) and t. (, ) ≥ (for the cycle (a, b, c, a) with two edges not in G′ ), which leads, e.g., to t = (, ) and suitable constants ρ a , ρ b , and ρ c . The second level leads to t = (, ) for a and c, and t = (, −) for b. This explicit schedule
Parallelism Detection in Nested Loops, Optimal
corresponds to the code below. It is purely sequential (dimensional schedule, for a dependence of length order N , in a square of size N). In general, if a statement is surrounded in the initial code by n loops and scheduled with a d-dimensional schedule, it will be surrounded by d sequential loops and (n − d) parallel loops in the resulting code. DO i=, N DO j=N, , - b(i,j) = a(i-,j) + b(i,j+) ENDDO DO j=, N a(i,j) = c(i,j-) c(i,j) = a(i,j) + b(i,j) ENDDO ENDDO
Loop Transformations and Automatic Loop Parallelization Linear programming methods, optimizations on polytopes, manipulations of integral matrices, are now commonly used in the field of automatic parallelization and program transformations, in particular for imperative codes with DO loops. The key is to represent, analyze, and transform loops without unrolling them, in an abstract way, thanks to polyhedral representations and transformations. This way, compilation methods can be developed with a complexity that depends on the (textual) size of the program and not on the number of operations it describes. DO loops are indeed, as SUREs, a condensed way for representing repetitive computations. Such an approach started in when Lamport introduced the hyperplane method [] to parallelize perfectly nested loops. Similar techniques were applied for the automatic synthesis of systolic arrays from uniform recurrence equations. In , Feautrier introduced PIP [], a software tool for parametric (integer) linear programming, and he demonstrated its interest for dependence analysis [], loop parallelization [], code generation [], etc. This work initiated many more developments based on polytopes for detecting dependences, scheduling computations,
P
mapping data and communications, generating code for computations and communications, etc. The following presentation, borrowed from [], recalls the link between the decomposition of Karp, Miller, and Winograd, and different algorithms for transforming (sequential) DO loops into DOALL loops, i.e., loops whose iterations can be computed in any order, in particular in parallel. These algorithms perform high-level source-to-source transformations, in the same way a user inserts parallelization directives in OpenMP. Exploiting the parallelism is another story, which requires data and communication optimizations, depending on the target architecture.
Representation of DO Loops Loop transformations apply to codes defined by nested loops, for which the control structure is simple enough to be captured with polytopes. Each loop has its own loop counter that takes, if the loop step is , any integer value from the lower to the upper bound, which are defined by affine expressions of the surrounding loop counters. The iterations of n perfectly nested loops can thus be represented by an iteration domain, set of all integer vectors in a polyhedron. When running the program, each statement S is executed for each value of the surrounding loop counters, represented by an iteration vector p. Such an execution is an operation, denoted S(p). Thus, as for SUREs, nested loops have an evaluation region, the iteration domain. However, unlike SUREs, the schedule is explicit and defines the semantics, while the dependences are implicit and must be precomputed. Indeed, because each inner loop is scanned for each iteration of an outer loop, the operations S(p) are carried out in the predefined sequential order <seq , given by the lexicographic order defined on iteration vectors plus the textual order: S(p) <seq T(q) ⇔ (̃ q) or (̃ p =̃ q and S
P
P
Parallelism Detection in Nested Loops, Optimal
dependence graph (EDG). To keep the program semantics, code transformations should preserve this partial order. In general, instead of representing all pairs (S(p), T(q)) for which S(p) ⇒ T(q), the dependences are approximated by a reduced dependence graph (RDG), with one vertex per statement, where the weight w(e) of each edge e describes a set De of dependence distances ̃ q−̃ p, in a conservative way: if S(p) ⇒ T(q) in the EDG, then there exists e = (S, T) in the RDG such that ̃ q −̃ p ∈ De . In other words, the RDG describes a superset of the EDG called apparent dependence graph (ADG). All loop transformations algorithms have to respect the dependences in the ADG. Thus, their properties, in particular their optimality for detecting parallelism, need to be analyzed with respect to the ADG (and not the EDG, which is not provided), i.e., with respect to the dependence abstraction used.
Approximations of Distances: Dependence Level and Direction Vector The simplest way to represent a dependence distance is to use the abstraction by dependence level. A dependence between S(p) and T(q) is loop-independent if it occurs for a fixed iteration of each loop surrounding both S and T (i.e., ̃ q=̃ p). Otherwise, it is loop-carried and its level is the index of the first nonzero component of ̃ q −̃ p. Then, all iterations of a loop L at depth k can be executed in any order, i.e., L is parallel, if there is no dependence at level k in the RDG that corresponds to the code surrounded by L. The main idea of Allen and Kennedy’s parallelization algorithm [] is to use loop distribution to reduce the number of statements within a loop, and thus the number of potential dependences. Briefly speaking, loop distribution separates, in different loops, the statements of the different SCCs of the RDG. Then, each SCC is treated separately and, according to the dependence levels, the outermost loop is marked as a DOALL or DOSEQ loop. Inner loops are treated the same way, recursively. Here is a sketch of the algorithm (different but equivalent to the original formulation). The initial call is AK(G, ), where G is the RDG with dependence levels. A careful analysis [] reveals that this algorithm is nothing but the decomposition of Karp, Miller, and Winograd, applied to an RDG with levels, except that parallel loops are generated at the outermost level,
(Algorithm of Allen and Kennedy) AK(G, k): ● ● ●
Remove from G all edges of level < k. Compute the SCCs of G. For each SCC C in topological order, do: – If C is reduced to a single statement S, with no edge, generate DOALL loops in all remaining dimensions, and generate code for S. – Else ● Let l be the minimal dependence level in C. ● Generate DOALL loops from level k to level l − , and a DOSEQ loop for level l. ● Call AK(C, l + ).
when possible. Indeed, if schedules are searched as in Theorem , considering that w(e) corresponds to any distance vector represented by the dependence level of the edge e, then the only valid schedules are the elementary schedules that correspond to loop distribution/parallelization. This can also be understood with the uniformization principle mentioned in the next section. Moreover, using a proof technique similar to the one used in Theorem , one can even prove the following optimality result. Theorem Allen and Kennedy’s algorithm is optimal for parallelism extraction in an RDG labeled with dependence levels. Here, the optimality means the following. Let dS be the number of DOSEQ loops generated around a statement S. Assume that each loop has order N iterations. Then, it is possible to build, in the ADG corresponding to the RDG, a dependence path that visits Ω(N dS ) times the statement S. It is even possible to build a code, with the same RDG, whose EDG contains such a path. In other words, without any other information, there is no way to extract more parallelism. Another popular dependence abstraction is the direction vector whose components belong to Z ∪ {∗, +, −}∪(Z×{+, −}). Its ith component is an approximation of the ith components of all possible distance vectors: it is equal to z+ (resp. z−) if all ith components are at least (resp. at most) z ∈ Z. It is equal to ∗ if the ith component may take any value and to z ∈ Z if it takes the unique value z. The notation + (resp. −) is a shortcut for + (resp. (−)−. A dependence of level k corresponds
Parallelism Detection in Nested Loops, Optimal
to a direction vector (, . . . , , +, ∗, . . . , ∗) where + is the k-th component. Unlike the dependence level, the direction vector gives information on all dimensions of the distance vectors. But, it is still not powerful enough to express relations between different components. When loops are perfectly nested and direction vectors are constant, loops are called uniform. Since dependences are directed according to the sequential order, the direction vector is always lexicopositive, its first nonzero component is positive, and Theorem applies: there is a linear schedule, and the code can always be rewritten with one outer sequential loop (which carries all dependences) surrounding parallel loops. Lamport’s hyperplane method [] is just a different way to build a linear schedule, without requiring linear programming. To reduce the number of sequential steps, Theorem can be applied too. Variants in which not all dependences are carried by the outermost loop lead, among others, to more subtle NP-complete retiming problems (see []). Finally, a more powerful abstraction of dependence distances is to represent them by a dependence polyhedron, defined by a set of vertices, a set of rays, and a set of lines. A direction vector is a particular polyhedral representation. For example, the direction vector (+, ∗, (−)−, ) defines the polyhedron with one vertex (, , −, ), two rays (, , , ) and (, , −, ), and one line (, , , ). Thanks to a uniformization principle, an RDG with dependence polyhedra can be reinterpreted in terms of SUREs as recalled in the next section.
Uniformization Principle: From Dependence Polyhedra to SUREs Consider the following code example: DO i=, N DO j=, N a(i,j) = a(i,j-) + a(j,i) ENDDO ENDDO
It has three dependences: a uniform flow dependence of distance (, ) from S(i, j) to S(i, j + ), a flow dependence from S(i, j) to S(j, i) if i < j, and an anti-dependence from S(j, i) to S(i, j) if j < i. The latter two dependences correspond to the same dependence
P
distances and can be combined. The uniform dependence has level and the combined dependence has level , thus the algorithm of Allen and Kennedy cannot find any parallelism. The corresponding direction vectors are equal to (, ) and (+, −). In the second dimension, the “” and the “−” are incompatible and prevent the detection of parallelism. However, there is a linear schedule θ(i, j) = i + j, which leads to a transformed code with one parallel loop: DO j = , N DOALL i=max(,⌈ j−N ⌉), min(N, ⌊ j− ⌋) a(i,j-i) = a(j-i,i) + a(i,j-i-) ENDDOALL ENDDO
To find this program transformation, one can notice that the set of distance vectors {(j − i, i − j) ∣ ≤ j − i ≤ N − } can be (over)-approximated by D = {(, −) + λ(, −) ∣ λ ≥ }, i.e., a polyhedron with one vertex v = (, −) and one ray r = (, −). Now, as for Theorem , consider t such that t. d ≥ for any dependence vector d. Thus, t. (, ) ≥ and t. d ≥ for all d ∈ D. The latter inequality is equal to t. (, −) + λt. (, −) ≥ with λ ≥ , which is equivalent to t. (, −) ≥ and t. (, −) ≥ , i.e., t. v ≥ and t. r ≥ . Therefore, a valid linear schedule is defined by a vector t that satisfies the three inequalities t. u ≥ , t. v ≥ , t. r ≥ , which leads, as desired, to t = (, ) and θ(i, j) = i + j. What is important here is the “uniformization” principle, which transforms an inequality on D into uniform inequalities on v and r. In terms of dependence path, this amounts to consider an edge e, labeled by the distance vector p = v + λr, as a path that uses once the “uniform” vector v and λ times the “uniform” vector r. Now, the dependence vectors are not necessarily lexicopositive anymore (e.g., a ray can be equal to (, −)). Thus, the uniformized dependence graph looks more like the RDG of a SURE than the RDG of a uniform loop nest. However, the constraint imposed on a ray r is weaker, it is t. r ≥ instead of t. r ≥ , and t. l = for a line l. This freedom must be taken into account in the parallelization algorithm. This leads to the algorithm of Darte and Vivien, devoted to an RDG with dependence polyhedra []. The technique is simple, it consists in uniformizing the dependences so as to transform the initial RDG into the RDG of a SURE, with some new vertices emulating rays and lines. Despite the
P
P
Parallelism Detection in Nested Loops, Optimal
fact that these new vertices must be considered in a special way, all results obtained for SUREs, such as Theorems and , and the structure of the subgraph G′ , can then be transferred to provide an optimal algorithm for parallelism detection in an RDG with dependence polyhedra. Furthermore, thanks to this uniformization principle, the algorithm can be specialized to a particular dependence abstraction, such as dependence level or direction vector, while keeping optimality with respect to the dependence abstraction. For the dependence level abstraction, it is the algorithm of Allen and Kennedy. For direction vectors and a single instruction (or block of instructions considered atomically), it leads to a variant of the algorithm of Wolf and Lam that would generate parallel loops instead of permutable loops (to find permutable loops, one needs to analyze the cone generated by the cycle weights, those in G′ form the vector space of this cone). Figures and depict the uniformized RDGs for the previous example when the non-uniform dependence is represented by a direction vector (+, −) and a dependence polyhedron along (, −). In the first case, two self-loops on the new vertex b, of weight (, ) and (, −), are introduced, resulting in a nonempty subgraph G′ , and no parallelism. In the second case, the self-loop on b has weight (, −) and the RDG has no multicycle of zero weight, and thus contains some parallelism. Similarly, the graph of Fig. is the uniformized
0 1
a
0 0
1 0 b 0 −1
1 −1
Parallelism Detection in Nested Loops, Optimal. Fig. Uniformized graph for direction vectors
0 1
a
0 0 1 −1
b
1 −1
Parallelism Detection in Nested Loops, Optimal. Fig. Uniformized graph for polyhedron vector
RDG for the following code, where b is the vertex introduced by the uniformization: DO i=, N DO j=, N a(i,j) = c(i,j-) c(i,j) = a(i,j) + a(i-,N) ENDDO ENDDO
Going Beyond, with the Affine Form of Farkas Lemma So far, the discussion focused, as in Theorem , on loop transformations based on multi-dimensional scheduling functions (with the lexicographic order) called shifted-linear, i.e., of the form θ(v, p) = (tv . p + ρ v , . . . , tvdv . p + ρ vdv , , . . . , ) where, for all vertices of the same SCC encountered at depth i of the decomposition, the linear part tvi is the same. Different constants (similar to retiming []) can be used for different statements however. The fact that the linear part is the same made life easier. Indeed, to get valid scheduling functions, one had to solve inequalities of the form θ(T, q)−θ(S, p) ≥ є (є = or as in Theorem ), i.e., t. (q − p) + ρ T − ρ S ≥ є. If the dependence distance q−p is constant, one directly ends up with a system of linear inequalities. Otherwise, as just showed, the set of all q − p can be approximated by a polyhedron and linear inequalities involving the vertices, rays, and lines of this polyhedron are obtained. Now, what if even more general functions are searched, i.e., affine functions with a different linear part for each statement? This is the approach of Feautrier []. With θ(S, p) = tS . p + ρ S , the constraints that need to be solved are then of the form tT . q − tS . p + ρ T − ρ S ≥ є for all p and q such that S(p) ⇒ T(q). The number of inequalities depends on the number of p and q, which is not practical. However, if the set of pairs (p, q) such that S(p) ⇒ T(q) can be described by a polyhedron, the affine form of Farkas lemma can be used to simplify the inequalities. This lemma states that c. p ≤ δ for all vectors p in a polyhedron {p ∣ Ap ≤ b} iff c = y. A for some vector y ≥ such that y. b ≤ δ. With this mechanism, an inequality involving all (p, q) in a polyhedron can be replaced by a finite set of inequalities. The affine form of Farkas lemma is the key tool for writing inequalities in Feautrier’s algorithm, which generates general multi-dimensional affine scheduling
Parallelism Detection in Nested Loops, Optimal
functions. The skeleton of the algorithm itself is similar to the decomposition of Karp, Miller, and Winograd: trying to find a function for which as many dependences as possible are satisfied. As for previous algorithms, some optimality result can be formulated, but of a different nature. For affine dependences, i.e., if p is expressed as an affine function of q, for q in a polyhedron, whenever S(p) ⇒ T(q), optimal parallelism detection requires more than affine functions, in particular index splitting, i.e., piecewise affine functions. Thus Feautrier’s algorithm cannot be optimal with respect to the dependence abstraction it was designed for. However, among all affine functions, it can find one with the “right” parallelism extraction, in other words, the algorithm is optimal with respect to the class of functions it considers []. Extensions to detect outer parallel loops, to derive permutable loops for tiling, and to generate codes where not all dependences are carried have also been proposed, see, e.g., [, ].
Multi-dimensional Affine Ranking Functions and Program Termination Recall the graph of Fig. . It is the RDG of a SURE with three equations, a, b, c. Starting from such a description of repetitive computations, with explicit evaluation region, implicit schedule, and explicit dependences, an explicit schedule for it was derived. As seen in the previous section, this RDG can also be interpreted as the uniformized RDG of two Fortran-like nested loops where b is a dummy vertex added to emulate the (, −) direction vector. In this case, from a description of repetitive computations, with explicit iteration domain and explicit initial schedule (the sequential order), dependences that were implicit are first computed and abstracted. Then, another schedule is derived that respects the dependences and expresses, possibly, some parallelism. Now, consider the following C-like code example: y = ; x = ; while (x ≤ N and y ≤ N) { if (unknown) { x = x + ; while (y ≥ and unknown) y = y − ; } y = y + ; }
P
This code is yet another description of repetitive computations. Here, the schedule is explicit, it is the sequential schedule. However, loop counters are not specified, no iteration domain is specified, and the program may not terminate. The program is controlled by a parameter N and the integer variables x and y whose values are modified by the program. Now, the RDG of Fig. depicts, not the dependences between computations, but how integer variables, implied in the program control, evolve. Each vertex corresponds to a state (program point + values of variables), edges represent transitions, i.e., modifications of variables. Again, this leads to a model of computations similar to the model of SUREs, for which the techniques and results previously exposed can be useful. In this example, the program can be proved to terminate, after performing O(N ) operations.
Integer Interpreted Automata and Invariants The following presentation is borrowed from []. To prove the termination of an imperative program, a standard approach is to transform it into an affine integer interpreted automaton (K, n, kinit , T ) defined by () a finite set K of control points, () n integer variables represented by a vector x of size n, () an initial control point kinit ∈ K, and () a finite set T of -tuples (k, g, a, k′ ), called transitions, where k ∈ K (resp. k′ ∈ K) is the source (resp. target) control point. The guard g : Zn ↦ B = {true, false} is a logical formula expressed with affine inequalities Gx + g ≥ and the action a : Zn ↦ Zn assigns, to each variable valuation x, a vector x′ of size n, expressed by an affine expression x′ = Ax + a. The guard g in the transition t = (k, g, a, k′ ) gives a necessary condition on variables x to traverse the transition t from k to k′ , and to apply its corresponding action a. If, for two transitions going out of k, the guards describe non-disjoint conditions, the automaton expresses some non-determinism. To approximate non-affine or non-analyzable assignments in the program, the link between x and x′ can be described by affine relations instead of functions, which introduces another form of non-determinism. Unlike for SUREs, this non-determinism can be unbounded, i.e., a single transition can give rise to an unbounded number of “successors” x′ for a given x.
P
P
Parallelism Detection in Nested Loops, Optimal
Unlike for DO loops and SUREs where the range of iteration vectors is explicitly defined, with the iteration domain and the evaluation region respectively, here, the set of all possible values for x at control point k, denoted Rk , is implicit and hard to compute exactly. However, it is possible to over-approximate Rk by an invariant at control point k, i.e., a formula true for all reachable states (k, x). Polyhedral invariants can be computed with abstract interpretation techniques, widely studied since the seminal paper of Cousot and Halbwachs []. In this case, Rk is over-approximated by the integer points within a polyhedron Pk , which represents all the information on the variables at control point k that can be deduced from the program by state-of-the-art analysis techniques.
Termination and Ranking Functions Invariants can only prove partial correctness of a program. The standard technique for proving termination is to consider a ranking function to a well-founded set, i.e., a set W with a (possibly partial) order ⪯ (the notation a ≺ b means a ⪯ b and a ≠ b) with no infinite descending chain, i.e., no infinite sequence (xi )i∈N with xi ∈ W and xi+ ≺ xi for all i ∈ N. More precisely, a ranking is a function ρ : K × Zn → W, from the automaton states to a well-founded set (W, ⪯), whose values decrease at each transition t = (k, g, a, k′ ): (x ∈ Rk ) ∧ (g(x) = true) ∧ (x′ = a(x)) ⇒ ρ(k′ , x′ ) ≺ ρ(k, x)
()
The ranking is said to be affine if it is affine in the second parameter (the variables). It is one-dimensional if its co-domain is (N, ≤) and d-dimensional (or multidimensional of dimension d) if its co-domain is (Nd , ⪯d ), where the order ⪯d is the standard lexicographic order on integer vectors. Obviously, the existence of a ranking function implies program termination for any valuation v at the initial control point kinit . A well-known property is that an integer interpreted automaton terminates for any initial valuation if it has a ranking function. Furthermore, if it terminates and has bounded non-determinism, there is a one-dimensional ranking function (but it is not necessarily affine). The problem is now very similar to the scheduling problem described for SUREs and for DO loops, except that dependences are considered in the opposite direction. In the same way, affine multi-dimensional ranking functions can be derived.
Considering rankings with d > is mandatory to be able to prove the termination of programs that induce a number of transitions, i.e., a trace length, more than linear in the program parameters. Considering rankings with a different affine function for each control point also extends the set of programs whose termination can be determined, compared, e.g., to shifted-linear rankings or to the technique of [].
A Greedy Complete Polynomial-Time Procedure A ranking function ρ of dimension d needs to satisfy two properties. First, as ρ has co-domain Nd , it should assign a nonnegative integer vector to each relevant state: x ∈ Pk ⇒ ρ(k, x) ≥ (component-wise)
()
Second, it should decrease on transitions. Let Qt be the polyhedron giving the constraints of a transition t = (k, g, a, k′ ), i.e., x ∈ Pk , g(x) is true, and x′ = a(x). Qt is built from the matrices A and G, and the vectors a and g. For an automaton whose actions are general affine relations, Qt is directly given by the action definitions. With Δ t (ρ, x, x′ ) = ρ(k, x) − ρ(k′ , x′ ), Inequality () then becomes: (x, x′ ) ∈ Qt ⇒ Δ t (ρ, x, x′ ) ≻d
()
which means Δ t (ρ, x, x′ ) ≠ and its first nonzero component is positive. If this component is the i-th, the level of Δ t (ρ, x, x′) is i. A transition t is said to be (fully) satisfied by the i-th component of ρ (or at dimension i) if the maximal level of all Δ t (ρ, x, x′ ) is i. To build a ranking ρ, the same greedy mechanism as in [, , ] can be used. The components of ρ, functions from K×Zn to N, are built from the first one to the last one. For a component σ of ρ and a transition t not yet satisfied by one of the previous components of ρ, the following constraint is considered: (x, x′ ) ∈ Qt ⇒ Δt (σ , x, x′ ) ≥ є t with ≤ є t ≤
()
and a ranking is selected for which as many transitions as possible have є t = , i.e., are now satisfied. Again, inequalities such as () are captured thanks to the affine form of Farkas lemma. The algorithm itself has the same structure as the decomposition of Karp, Miller, and Winograd, and of Feautrier’s algorithm.
Parallelism Detection in Nested Loops, Optimal
(Generation of a multi-dimensional affine ranking) : : :
: : : : :
i = ; T = T ; ▷ Initialize T to all transitions while T is not empty do Find a D affine function σ and values є t such that all inequalities () and () are satisfied and as many є t as possible are equal to ; ▷ This means maximizing ∑t∈T є t Let ρ i = σ ; i = i + ; ▷ σ defines the i-th component of ρ If no transition t with є t = , return false ▷ No multi-dimensional affine ranking. Remove from T all transitions t such that є t = ; ▷ The transitions have level i end while; d = i; return true; ▷ d-dimensional ranking found
Since nonterminating programs exist, there is no hope of proving that a ranking function always exists. Moreover, there are terminating affine interpreted automata with no multi-dimensional affine ranking. Thus, what can be proved is only that, if a multi-dimensional affine ranking exists, the algorithm finds one, i.e., it is complete for the class of multidimensional affine rankings. Also, as the sets Rk are over-approximated by the invariants Pk , completeness has to be understood with respect to these invariants, which means that if the algorithm fails when an affine ranking exists, it is because invariants are not accurate enough. Theorem If an affine interpreted automaton, with associated invariants, has a multi-dimensional affine ranking function, then the greedy algorithm finds one. Moreover, the dimension of the generated ranking is minimal.
Conclusion In [], Karp, Miller, and Winograd introduced several new concepts and techniques that gave rise to important developments in the context of loop transformations and program analysis. The key is to represent a repetitive scheme of computations, even infinite, by a finite structure, the reduced dependence graph (and its variants). This allows a compiler to manipulate a program in a time that depends on the structure of the code but not on the number of operations that it describes. In other words, there is no need to unroll loops to understand what they do. Linear programming
P
techniques and polyhedral representations can be used to analyze and optimize such programs, in a parametric way. This essay mentioned three related problems: determining if a system of uniform recurrence equations is computable, transforming DO loops so as to reveal parallel loops, and proving the termination of programs with IFs and WHILE loops, thanks to affine ranking functions. The link with program termination is still to be explored. Indeed, if the derivation of affine rankings is similar to the derivation of affine schedules, there are many subtle differences, in particular concerning the underlying iteration domains (invariants). One of the most challenging problems is to derive piecewise affine rankings to prove the termination of many more programs. For the detection of parallelism, deriving parallel codes with more data reuse and a better handling of memory transfers is still a challenge.
Related Entries Dependence Abstractions Dependence Analysis Dependences Loop Nest Parallelization Loops, Parallel Parallelization, Automatic Polyhedron Model Scheduling Algorithms Tiling Unimodular Transformations
Bibliography . Alias C, Darte A, Feautrier P, Gonnord L () Multidimensional rankings, program termination, and complexity bounds of flowchart programs. In th International Static Analysis Symposium (SAS’). Lecture notes in computer science, vol . Springer Verlag, Perpignan, pp – . Allen JR, Kennedy K () Automatic translation of Fortran programs to vector form. ACM Trans Program Lang Syst (): – . Bondhugula U, Baskaran MM, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P () Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Compiler Construction (CC’). Lecture notes in computer science, vol . Springer Verlag, pp – . Collard J-F, Feautrier P, Risset T () Construction of DO loops from systems of affine constraints. Parallel Process Lett (): –
P
P
Parallelization
. Colón MA, Sipma HB () Practical methods for proving program termination. In th International Conference on Computer Aided Verification (CAV). Lecture notes in computer science, vol . Springer Verlag, pp – . Cousot P, Halbwachs N () Automatic discovery of linear restraints among variables of a program. In th ACM Symposium on Principles of Programming Languages (POPL’). ACM, Tucson, pp – . Darte A () Understanding loops: The influence of the decomposition of Karp, Miller, and Winograd. In th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE’). IEEE Computer Society, Grenoble, pp – . Darte A, Huard G () Complexity of multi-dimensional loop alignment. In th International Symposium on Theoretical Aspects of Computer Science (STACS’), vol . Springer Verlag, pp – . Darte A, Khachiyan L, Robert Y () Linear scheduling is nearly optimal. Parallel Process Lett ():– . Darte A, Robert Y, Vivien F () Scheduling and Automatic Parallelization. Birkhauser. ISBN --- . Darte A, Vivien F () Revisiting the decomposition of Karp, Miller, and Winograd. Parallel Process Lett ():– . Darte A, Vivien F () On the optimality of Allen and Kennedy’s algorithm for parallelism extraction in nested loops. J Parallel Algorithms Appl (–):– . Darte A, Vivien F () Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. Int J Parallel Program ():– . Feautrier P () Parametric integer programming. RAIRO Rech Opérationnelle :– . Feautrier P () Dataflow analysis of array and scalar references. Int J Parallel Program ():– . Feautrier P () Some efficient solutions to the affine scheduling problem, part II: Multi-dimensional time. Int J Parallel Program ():– . Gulwani S, Mehra KK, Chilimbi T () SPEED: Precise and efficient static estimation of program computational complexity. In th ACM Symposium on Principles of Programming Languages (POPL’). ACM, Savannah, pp – . Karp RM, Miller RE, Winograd S () The organization of computations for uniform recurrence equations. J ACM ():– . Lamport L () The parallel execution of DO loops. Commun ACM ():– . Leiserson CE, Saxe JB () Retiming synchronous circuitry. Algorithmica ():– . Lim AW, Lam MS () Maximizing parallelism and minimizing synchronization with affine transforms. In th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’). ACM, New York, pp – . Podelski A, Rybalchenko A () A complete method for the synthesis of linear ranking functions. In Verification, Model Checking, and Abstract Interpretation (VMCAI’). Lecture notes in computer science, vol . Springer Verlag, pp – . Schrijver A () Theory of Linear and Integer Programming. Wiley, New York
. Vivien F () On the optimality of Feautrier’s scheduling algorithm. Concurr Comput (–):– . Wolf ME, Lam MS () A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distributed Syst ():–
Parallelization FORGE Loop Nest Parallelization Parafrase Parallelism Detection in Nested Loops, Optimal Parallelization, Automatic Parallelization, Basic Block Polaris Run Time Parallelization Speculative Parallelization of Loops
Parallelization, Automatic David Padua University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Parallelization
Definition Autoparallelization is the translation of a sequential program by a compiler into a parallel form that outputs the same values as the original program. For some authors, autoparallelization means only translation for multiprocessors. However, this definition is more general and includes translation for instruction level, vector, or any other form of parallelism.
Discussion Introduction The compilers of most parallel machines are autoparallelizers, and this has been the case since the earliest parallel supercomputers, the Illiac IV and the TI ASC, were introduced in the s. Today, there are autoparallelizers for vector processors, VLIW processors, microprocessor vector extensions, and multiprocessors.
Parallelization, Automatic
Autoparallelization is for productivity. Autoparallelizers, when they succeed, enable the programming of parallel machines with conventional languages such as Fortran or C. In this programming paradigm, code is not complicated by parallel constructs and the obfuscation typical of manual tuning. Inserting explicit parallel constructs and tuning is not only time-consuming but also produces nonportable, machine-dependent code. For example, codes written for multiprocessors and those for SIMD machines have different syntax and organization. On the other hand, with the support of autoparallelization, conventional codes could be portable across machine classes. Explicit parallelism introduces opportunities for program defects that do not arise in sequential programming. With autoparallelization, the code has sequential semantics. There is no possibility of deadlock and programs are determinate. The downside is that it is not possible to implement asynchronous algorithms, although this limitation does not affect the vast majority of applications.
Requirements for Autoparallelization A parallelizing compiler must analyze the program to detect implicit parallelism and identify opportunities for restructuring transformations, and then apply a sequence of transformations. Detection of implicit parallelism can be accomplished by () computing the dependences to determine where the sequential order of the source program can be relaxed, and () analyzing the semantics of code segments to enable the selection of alternative parallel algorithms. The transformation process is restricted by the information provided by this analysis and is guided by heuristics supported by static prediction of execution time or program profiling.
Dependence Analysis The dependence relation is a partial order between operations in the program that is computed by analyzing variable and array element accesses. Executing the program following this partial order guarantees that the program will produce the same output as the original code. For example, in for (i=0; i < n; i++) {a[i] += 1;} for (j=0; j < n; j++) {b[j] = a[j]*2;}
P
corresponding iterations of the first and the second loop must be executed in the specified order. However, these pairs of iterations do not interact with other pairs and therefore do not have to execute in the original order to produce the intended result. Only corresponding iterations of these two loops are ordered. By determining what orders must be enforced, dependence analysis tells us which reordering is valid and what can be done in parallel: two operations that are not related by the partial order resulting from the dependence analysis can be reordered or executed in parallel with each other. Dependence analysis can be done statically, by a compiler; or dynamically, during program execution. Static analysis is discussed next, while dynamic analysis is discussed below under the heading of “Runtime Resolution”. How close the dependences generated by static dependence analysis are to the minimum number of ordered pairs required for correctness depends on the information available at compile time and the algorithms used for the analysis. The loops above are examples of loops that can be analyzed statically with total accuracy because () all the information needed for an accurate analysis is available statically, and () the subscript expressions are simple, so that most analysis algorithms can analyze them accurately. Accuracy is tremendously important because when the set of dependences computed by a test is not accurate, spurious dependences must be assumed and this may preclude valid transformations including conversion into parallel form. There are numerous algorithms for dependence analysis that have been developed through the years. They typically trade off accuracy for speed of analysis. For example, some fast tests do not make use of information about the values of the loop indices, while others require them. Ignoring the loop limits works well in some cases. The loops above are an example of this situation. The value of n in these loops is not required to do an accurate analysis. However, in other cases, knowledge of the loop limits is needed. Consider the loop for (i=10; i<15; i++ ) {a[i]+=a[i-8];}
The loop limits, and , are necessary to determine that no ordering needs to be enforced between loop iterations since, for these values, the iterations do not
P
P
Parallelization, Automatic
interact with each other. A test that ignores the loop limits will report that (some) iterations in this loop must be executed in order. Some of the most popular dependence tests require for accuracy that the subscript expressions be affine expression of the loop indices and the values of the coefficients and the constant be known at compile time. For example, a test that requires knowledge of the numerical values of coefficients would have to assume that iterations of the loop if (m >0) { for (i=0; i < n; i+=2){ a[m*i]+=a[m*i+1]; } }
valid. But this will only be known to the compiler, if it can propagate array values and these values are available in the source code. Otherwise, dependences must be assumed or the analysis postponed to execution time.
Semantic Analysis Semantic analysis identifies operators or code sequences that have a parallel implementation. A good example is the analysis of array operations. For example, in the Fortran statement a(1:n) = sin(a(2:n+1))
must be executed in order, while a test with symbolic capabilities would be able to determine that the iterations do not have to be executed in any particular order to obtain correct results. Table presents the main characteristics of a few dependence tests. When the needed information is not available at compile time or the analysis algorithm is inaccurate, the decision can be postponed to execution time (see “Transformations for Runtime Resolution” below). For example, the loop for (i=0; i
can be transformed into an array operation as long as k is negative, but the compiler will not know that this is the case if k happens to be a function of the input to the program or if the value propagation analysis conducted by the compiler cannot decide that k is negative. A similar situation arises in the loop for (i=0; i < n; i++) {a[m[i]]+=a[i];}
where m[i] must be ≤ i or ≥ n and all the m[i]’s be different for a transformation into vector operation to be
the n evaluations of sin can proceed in parallel since their parameters do not depend on each other. Although array operations like this can be interpreted as parallel operations, most Fortran compilers at the time of the writing of this entry do not parallelize directly array operations, but instead translate them into loops which are analyzed by later passes for parallelization. So, in effect, they rely on semantic analysis. The compiler can also apply semantic analysis to sequence of statements with the help of a database of patterns. For example, for (i=0; i
cannot be parallelized by relying exclusively on dependence analysis, because this analysis will only state the obvious: that each iteration requires the result of the previous one (the values of sum) to proceed. However, accumulations like this can be parallelized, if assuming that + is associative is acceptable, and are frequently found in real programs. Therefore, this pattern is a natural candidate for inclusion in this database. Other frequently found patterns include: finding the minimum or maximum of an array, and linear recurrences such as x[i]=a[i]*x[i-1]+b[i]
Parallelization, Automatic. Table Characteristics of a few dependence tests Test name
# of loop indices in subscript Subscript expressions must be affine? Uses loop bounds? Ref.
ZIV
(constant)
Y
N/A
[]
SIV
Y
Y
[]
GCD
Any
Y
N
[]
Banerjee
Any
Y
Y
[]
Access Region Any
N
Y
[]
Parallelization, Automatic
Some compilers have been known to recognize more complex patterns such a matrix–matrix multiplication. Once the compiler knows the type of operation, it can choose to replace the code sequence with a parallel version of the operation.
Program Transformations Program transformations are used to . Reduce the number of dependences . Generate code for runtime resolution, that is, code that at runtime decides whether to execute in parallel . Schedule operations to improve locality or parallelism Transformations for Reducing the Number of Dependences
This class of transformations aims at reducing the number of ordered pairs to improve parallelism and enable reordering. Induction variable substitution and privatization are two of the most important examples in this class. Induction variables are those that assume values that form an arithmetic sequence. Their computation creates a linear order that must be enforced. In addition, using induction variables in subscripts hinders the dependence analysis of other computations. For example, the loop for (i=0; i
cannot be parallelized in this form since j+= must be executed in order. Furthermore, dependence analysis cannot know that each iteration of the loop accesses a different element unless it knows that j takes a different value in each iteration. Fortunately, in this example, as in most cases, the induction variable can be eliminated to increase parallelism and improve accuracy of analysis. Thus, here j may be represented in terms of the loop index and forward substituted: for (i=0; i
The effect of this transformation is that the chain of dependences resulting from the j++ statement goes away with the statement. Also, the removal of the increment makes j a loop invariant and this enables an accurate dependence analysis at compile time. The identification of induction variables was originally developed for strength reduction, which replaces
P
operations with less expensive ones. A typical strength reduction is to replace multiplications with additions. For parallelism, the replacement goes in the opposite direction. For example, additions are replaced by multiplications as shown in the last example. Induction variable identification relies on conventional compiler data-flow analysis. Privatization can be applied when the a variable is used to carry values from one statement to another within the one iteration of the loop. For example, in for(i=0;i
the use of a single variable, a, in all iterations demands that the iterations be executed in order to guarantee correct results because, a should not be reassigned until its value has been obtained by the second statement of the loop body. The privatization transformation simply makes a private to the loop iteration and thus eliminates a reason to execute the iterations in order. An alternative to privatization is expansion. This transformation converts the scalar into an array and has the same effect on the dependence as privatization. For the previous loop, this would be the result: for (i=0;i
Privatization is applied when generating code for multiprocessors, and expansion is necessary for vectorization. The main difficulty with expansion is the increase in memory requirements. While privatization increases the memory requirements proportionally to the number of processors, expansion does so proportionally to the number of iterations, a number that is typically much higher. However, expansion can be applied together with a transformation called stripmining to reduce the amount of additional memory. Privatization and expansion require analysis to determine that the variable being privatized or expanded is never used to pass information across iterations of the loop. This analysis can be done using conventional data flow analysis techniques.
P
P
Parallelization, Automatic
Transformations for Runtime Resolution
In its simplest form, runtime resolution transformations generate if statements to select between a parallel or serial version of the code. For example, do i=m,n a(i+k)=a(i)*2 end do
as discussed above, can be vectorized if k ≤ . The compiler may then generate a two-version code if (k<=0) then a(k+m:k+n)+=a(m:n) else do i=m,n a(i+k)=a(i)*2 end do end if
If it was not, the execution is undone and the components executed at a later time either in the right order or again speculatively, in parallel. Run time resolution is also used to check for profitability, i.e., that parallel execution will make execution faster. For example, if the number of iterations of a parallel loop is not known at compile time, runtime resolution can be used to decide whether to execute a loop in parallel as a function of the number of iterations. Also, runtime resolution can be used to guarantee that vector operations are only executed if the operands are or can be properly aligned in memory when this is required for performance. For example, SSE vector operations sometimes perform better when the operands are aligned on double word boundaries. Scheduling Transformations
Two-version code can be also be used in other situations. Thus, if the loop contains an assignment statement that accesses memory through pointers in the right- and left-hand sides, such as the loop for (i=0; i
the if statement should check that address a is either less than address b or greater than address (b+n-). More complex runtime resolution would be needed for loops like
An important class are the transformations that schedule the execution of program operations or partition these operations into groups. To enforce the order, the compiler typically uses the barriers implicit in array operations or multiprocessor synchronization instructions. One such transformation is stripmining. It partitions the iterations of a loop into blocks by augmenting the increment of the loop index and adding an inner loop as follows: for (i=0; i
for (i=0; i
↓
where the m[i]’s must be ≤ i or ≥ n and all distinct for vectorization to be possible, or m[i] either = i or outside the values in the iteration space and all distinct for transformation into a parallel loop. In
for (i=0; i < (n/q)*q; i+=q) for(j=i; j< i+q, j++) { a[j]=a[j]+1; }
for (i=0; i
for (i=(n/q)*q; i< n, i++) {a[i]=a[i]+1;}
the m[i]’s and q[i]’s must be such that m[i] ≤ q[i] and the m[i]’s all distinct for vectorization, or m[i] ≠ q[j] whenever i ≠ j for parallelization. Two-version loops can be generated also in this case, but the if condition is somewhat more complex as it must analyze a collection of addresses. In this last case, the technique is called inspector-executor. Another approach to runtime resolution is speculation, which attempts to execute in parallel and optimistically expects that there will be no conflicts between the different components executing in parallel. During the execution of the speculative parallel code or at the end, the memory references are checked to make sure that the parallel execution was correct.
Stripmining is useful to enhance locality and reduce the amount of memory required by the program. In particular, it can be used to reduce the memory consumed by expansion. If the goal is vectorization and the size of the vector register is q, this transformation will not reduce the amount of parallelism. Another type of loop partitioning transformation is that developed for a class of autoparallelizing compilers targeting distributed memory operations. These compilers, including High-Performance Fortran and Vienna Fortran, flourished in the s but are no longer used. The goal of partitioning was to organize loop iterations groups so that each group could
Parallelization, Automatic
be scheduled in the node containing the data to be manipulated. An important sequencing transformation is loop interchange, which changes the order of execution by exchanging loop headers. This transformation can be useful to reduce the overhead when compiling for multiprocessors and to enhance memory behavior by reducing the number of cache misses. For example, the loop for (i=0; i < (n/q)*q; i+=q) for(j=i; j< i+q, j++) { a[j]=a[j]+1; }
can be correctly transformed by loop interchange into for (j=0; j
The outer loop of the original nest cannot be executed in parallel. If nothing else is done, the only option of the compiler targeting a multiprocessor is to transform the inner loop into parallel form and while this could lead to speedups, the result would suffer of the parallel loop initiation overhead once per iteration of the outer loop. Exchanging the loop headers makes the iteration of the outer loop independent so that the outer loop can be executed in parallel and the overhead is only paid once per execution of the whole loop. Furthermore, the resulting loop has a better locality since the array is traverse in the order it is stored, so that the elements of the array in a cache line are accessed in consecutive order, improving in this way spatial locality. A third example of sequencing transformation is instruction level parallelization. Consider, for example, a VLIW machine with a fixed point and a floating point unit. The sequence r1=r2+r3 r4=r4+r5 f1=f1+f2 f3=f4+f5
contains two fixed point operations (those operating on the r registers) and two floating point operations. Exchanging the second and the third operation is necessary to enable the creation of two (VLIW) instructions each making use of both computational units.
P
In some cases, the partitioning and sequencing of the operations is not completely determined at compile time. For example the sum reduction for (i=0; i
once identified as such by semantic analysis, may be transformed into a form in which subsets of iterations are executed by different threads and the elements of a are accumulated into different variables, one per thread of execution. These variables are then added to obtain the final sum. The number of these threads can be left undefined until execution time. In OpenMP notation, this can be represented as follows: #pragma omp parallel {float sp=0; #pragma omp for for (i=0; i
or, more simply, #pragma omp parallel for reduction (+: sum) for (i=0; i
It should be pointed out that in this example, it has been assumed that floating point addition is associative, but because of the finite precision of machines, it is not. In some cases, it is correct to do this transformation, even if the result obtained is not exactly the same as that of the original program. However, this is not always the case and transformations like this require authorization from the programmer. Table contains a list of important transformations not discussed above.
Autoparallelization Today Most of today’s compilers that target parallel machines are autoparallelizers. They can generate code for multiprocessors and vector code. Although autoparallelization techniques have become the norm, the few empirical studies that exist as well as anecdotal evidence indicate that these compilers often fail to generate highquality parallel code. There are two reasons for this. First, sometimes the compiler fails to find parallelism due to limitations of its dependence/semantic analysis or transformation modules. In other cases, it is unable to generate good quality code because of limitations in
P
P
Parallelization, Automatic
Parallelization, Automatic. Table An incomplete list of transformations for autoparallelization Name
Description
Alignment
Reorganizes computation so that Reduce synchronization costs values produced in one iteration are consumed by the same iteration
Distribution
Partitions a loop into multiple Separates sequential from parallel parts loops
Fusion
Merges two loops
Skewing
Partitions the set of iterations into Enhance parallelism groups that are not related by dependences (i.e. are not ordered)
Node Splitting
Breaks a statement into two
Software pipelining
Reorders and partitions the execu- Enhance instruction level parallelism tions of operations in a loop into groups that are independent from each other
Tiling
Partitions the set of iterations of a Enhance locality multiply nested loop into blocks or tiles
Trace scheduling
Reorder and partition the execu- Enhance instruction level parallelism tions of operations in a loop into groups that are independent from each other
Unroll and Jam
Partitions the set of iterations of a Enhance locality multiply nested loop into blocks or tiles with reuse of values
its profitability analysis. That is, the compiler incorrectly assumes that transforming into parallel form would slow the program down. To circumvent these limitations, compilers accept directives from programmers to help the analysis or guide the transformation and code generation process. A few vectorization directives for the Intel C++ compiler and IBM XLC compiler are shown in Table . The programmer can also influence the result by modifying the program into a form that can be recognized by the compiler. Despite their limitations, autoparallelizers today contribute to productivity by . Saving labor. As mentioned, manual intervention in the form of directives or rewriting is typically necessary, but programmers can often rely on the autoparallelizing compiler for some sections of code and in some cases the whole program.
Example of use
Reduce parallel loop initiation overhead
Reduce dependence cycles and thus enable transformations
. Portability. Sequential code complemented with directives is portable across classes of machines with the support of compilers. Portability is after all one of the purposes of compilers and autoparallelization brings this capability to the parallel realm. . As a training mechanism. Programmers can learn about what can and cannot be parallelized by interacting with an autoparallelizer. Thus, the compiler report to the programmer is not only useful for manual intervention, but also for learning.
Future Directions Autoparallelization has only been partially successful. As previously mentioned, in many cases today’s compilers fail to recognize the existence of parallelism, or having recognized the parallelism, incorrectly assume that transforming into parallel form is not profitable. Although autoparallelization is useful and
Parallelization, Automatic
P
Parallelization, Automatic. Table Vectorization directives for the IBM (XLC) and Intel (ICC) compilers Vectorization directive
Purpose
#pragma vector always (ICC)
Vectorize the following loop whenever dependences allow it, disregarding profitability ananlysis
#pragma nosimd (XLC)
Preclude vectorization of the following loop
#pragma novector (ICC) __assume_aligned (A, ); ICC
The compiler is told to assume that the vector (A in the examples) start at
__alignx(, A); (XLC)
addresses that are a multiple of a given constant ( in the examples)
effective when guided by user directives, there is clearly much room for improvement. Research in the area has decreased notably in the recent past, but it is likely that there will be more work in the area due to the renewed interest in parallelism that multicores have initiated. Two promising lines of future studies are . Empirical evaluation of compilers to improve parallelism detection, code generation, compiler feedback, and parallelization directives. Evaluating compilers using real applications is necessary to make advances in autoparallelization of conventional languages. Although there has been some work done in this area, much more needs to be done. This type of work is labor intensive since the best and perhaps the only way to do it is for an expert programmer to compare what the compiler does with the best code that the programmer can produce. This process is likely to converge since code patterns repeat across applications []. These costs and risks are worthwhile given the importance of the topic and the potential for an immense impact on productivity. . Study programming notations and their impact on autoparallelization. Higher level notations, such as those used for array operations, tend to facilitate the task of a compiler while at the same time improving productivity. Language-compiler codesign is an important and promising direction not only for autoparallelization but for compiler optimization in general.
Related Entries Banerjee’s Dependence Test Code Generation Dependence Analysis Dependences
GCD Test HPF (High Performance Fortran) Loop Nest Parallelization Modulo Scheduling and Loop Pipelining Omega Test Parallelization, Basic Block Run Time Parallelization Scheduling Algorithms Semantic Independence Speculative Parallelization of Loops Speculation, Thread-Level Trace Scheduling Unimodular Transformations
Bibliographic Notes And Further Reading As mentioned in the introduction, work on autoparallelization started in the s with the introduction of Illiac IV and the Texas Instrument Advanced Scientific Computer (ASC). The Paralyzer, an autoparallelizer for Illiac IV developed by Massachusetts Computer Associates, is discussed in []. This is the earliest description of a commercial autoparallelizer in the literature. Since then, there have been numerous papers and books describing commercial autoparallelizers. For example, [] describes an IBM vectorizer of the s, [] discusses Intel’s vectorizer for their multimedia extension, and [] describes the IBM XLC compiler autoparallelization features. Many of the autoparallelization techniques were developed at universities. Pioneering work was done by David Kuck and his students at the University of Illinois [, ] . The field has benefited from the contributions of numerous researchers. The contributions of Ken Kennedy and his coworkers [] at Rice University have been particularly influential.
P
P
Parallelization, Basic Block
There have been only a few papers evaluating the effectiveness of autoparallelizers. In [], different vectorizing compilers are compared using a collection of snippets, and in [] the effectiveness of parallelizing compilers is discussed using the Perfect Benchmarks. More information on autoparallelization, can be found in the related entries or in books devoted to this subject [, , , ]. Reference [] contains a discussion of compiler techniques for High-Performance Fortran.
Parallelization, Basic Block Utpal Banerjee University of California at Irvine, Irvine, CA, USA
Synonyms Parallelization
Definition Bibliography . Allen R, Kennedy K () Automatic translation of FORTRAN programs to vector form. ACM Trans Program Lang Syst ():–. DOI= http://doi.acm.org/./. . Banerjee UK () Loop transformations for restructuring compilers: dependence analysis. Kluwer Academic, Norwell . Bik AJC (May ) The software vectorization handbook. Intel, Hillsboro . Eigenmann R, Hoeflinger J, Padua D (Jan ) On the automatic parallelization of the perfect Benchmarks . IEEE Trans Parallel Distrib Syst ():–. DOI=http://dx.doi.org/./. . Kennedy K, Allen JR () Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann, San Francisco . Kuck DJ () Parallel processing of ordinary programs. Adv Comput :– . Kuck DJ, Kuhn RH, Padua DA, Leasure B, Wolfe M () Dependence graphs and compiler optimizations. In: Proceedings of the th ACM SIGPLAN-SIGACT symposium on principles of programming languages. POPL ’. Williamsburg, – Jan , ACM, New York, pp –. DOI=http://doi.acm.org/./. . Levine D, Callahan, D, Dongarra, J () A comparative study of automatic vectorizing compilers. Parallel Comput :– . Paek Y, Hoeflinger J, Padua D (Jan ) Efficient and precise array access analysis. ACM Trans Program Lang Syst () –. DOI=http://doi.acm.org/./. . Presberg DL () The Paralyzer: Ivtran’s parallelism analyzer and synthesizer. In: Proceedings of the conference on programming languages and compilers for parallel and vector machines, New York, – March , pp –. DOI=http://doi.acm.org/./. . Scarborough RG, Kolsky HG (March ) A vectorizing Fortran compiler. IBM J Res Dev ():– . Wolfe M ()High performance compilers for parallel computing. Addison-Wesley, Reading . Zhang G, Unnikrishnan P, Ren J () Experiments with auto-parallelizing SPECFP benchmarks. LCPC, pp – . Zima H, Chapman B () Supercompilers for parallel and vector computers. ACM, New York
A basic block in a program is a sequence of consecutive operations, such that control flow enters at the beginning and leaves at the end without halt. Basic block parallelization consists of techniques that allow execution of operations in a basic block in an overlapped manner without changing the final results.
Discussion Introduction The operations in a basic block are to be executed in the prescribed sequential order. This execution order imposes a dependence structure on the set of operations, based on how they access different memory locations. A new order of execution is valid if whenever an operation B depends on an operation A in the block, execution of B in the new order does not start until after the execution of A has ended. The basic assumption is that executing the operations in any valid order will not change the final results expected from the basic block. To reduce the total execution time of the block, one needs to find a new valid order where operations are overlapped. Among all such orders, one must choose only those that are compatible with the physical resources of the given machine. Even when the simultaneous processing of two or more operations is permissible by dependence considerations alone, there may not be enough resources available to process them simultaneously. There are many algorithms for basic block parallelization. This essay presents four of them: two for a hypothetical machine with unlimited resources, and two for a machine with limited resources. It starts with a section on basic concepts, and after developing the algorithms, ends with a simple example that compares
Parallelization, Basic Block
the actions of all four on a given basic block. References to more algorithms are given in the bibliographic notes.
Basic Concepts By an operation one means an atomic operation that a machine can perform. An assignment operation reads one or more memory locations and writes one location. It has the general form: A:
x=E
where A is a label, x a variable, and E an expression. Such an operation reads the memory locations specified in E, and writes the location x. A basic block in a program is a sequence of assignment operations, where flow of control enters at the top and leaves at the bottom. There are no entry points except at the beginning, and no branches, except possibly at the end. The object of study in this essay is a basic block of n operations. The set of those operations is denoted by B. There is a mapping c : B → {, , , . . . } that gives the cycle times of the operations. An operation B in the block depends on another operation A, and one writes A δ B, if A is executed before B, and one of the following holds: . B reads the memory location written by A. . B writes a location read by A. . A and B both write the same location. An operation B is indirectly dependent on an operation A, and one writes A δ¯ B, if there exists a sequence of operations A , A , . . . , Ak , such that A = A , A δ A , . . . , Ak− δ Ak , Ak = B. Two operations A and B are mutually independent if A δ¯ B and B δ¯ A are both false. The dependence graph of the basic block is a directed acyclic graph, such that the nodes correspond to the operations, and there is an edge from a node A to a node B if and only if A δ B. Thus, A δ¯ B means there is a directed path from the node A to the node B in the dependence graph. A new execution order for the operations in B is valid if whenever A, B in B are such that A δ B, B is executed after A in the new order. If the prescribed sequential order of execution for B is changed to any valid order, the results would still be the same. The goal is to find a valid execution order where operations
P
are overlapped as much as possible. However, any such order must also be compatible with the resources of the given machine. Control steps for execution of the basic block are numbered , , , . . . . Scheduling an operation in the block means assigning a control step to it where it can start executing. An instruction for a given machine is a set of operations that the machine can perform simultaneously. An instruction may be empty. Scheduling the basic block means creating a sequence of m instructions (I , I , . . . , Im ), where Ik starts in control step k, such that . Each instruction consists of operations in B, and each operation in B appears in exactly one instruction. . The operations in each instruction are pairwise mutually independent. . If an operation B in an instruction Ik depends on an operation A in an instruction Ij , then j + c(A) ≤ k. . In any control step, the given machine has enough resources to process simultaneously all operations being executed. Such a sequence of instructions is a schedule for the basic block. A schedule for the block can be specified indirectly by scheduling each individual operation. Then all operations starting in control step k constitute the instruction Ik . The weight of a path (A , A , . . . , Ak ) in the dependence graph for the basic block B is the expression [c(A )+c(A ) +⋯+c(Ak )]. A path is critical if it has the greatest possible weight among all paths in the graph. Let T denote the weight of a critical path. Then, any schedule for B will need at least T cycles to finish. For each operation A ∈ B, the set of all immediate predecessors (in the dependence graph) is denoted by Pred(A) and the set of all immediate successors by Succ(A): Pred(A) = {B ∈ B : B δ A}, Succ(A) = {B ∈ B : A δ B}. The number of members of a set S is denoted by ∣S∣.
Unlimited Resources In this section, it is assumed that the given machine has an unlimited supply of resources (functional units, registers, etc.). Consequently, operations in the basic block
P
P
Parallelization, Basic Block
can be scheduled subject only to the dependence constraints between them. The two algorithms considered in this section complete the execution of B in exactly T cycles. The ASAP algorithm schedules an operation as soon as possible so that the basic block can be processed in the shortest possible time. It creates a simple function ℓ : B → {, , . . . } such that ℓ(A) is the earliest possible step when A can start executing. The ALAP algorithm schedules an operation as late as possible within the constraint of executing B in the shortest possible time. It creates a function L : B → {, , . . . } such that L(A) is the latest possible step when A can start executing. The ASAP label of A is ℓ(A) and its ALAP label is L(A). It is clear that ℓ(A) ≤ L(A) for each operation A. The range of consecutive integers from which the control step for A may be chosen is {ℓ(A), ℓ(A) + , . . . , L(A)}.
ASAP Algorithm The goal of the ASAP algorithm (Fig. ) is to compute the ASAP label ℓ for each operation in the given basic block B. If A and B are two operations such that A δ B, then after starting A, one must wait at least until A has finished before starting B. This means one must have ℓ(B) ≥ ℓ(A) + c(A). Algorithm Given a basic block B, its dependence graph, and a machine with unlimited resources, this algorithm computes the ASAP label ℓ of each operation, and the total number m of instructions needed to replace B. For each A ∈ B, the sets Pred(A) and Succ(A) are assumed to be known. m←1 V←B for each operation A ∈ V do Pcount(A) ← |Pred(A)| (A) ← 1 endfor while V = ∅ do for each operation A ∈ V do if Pcount(A) = 0 then for each B ∈ Succ(A) do Pcount(B) ← Pcount(B) − 1 (B) ← max{(B), (A) + c(A)} endfor V ← V − {A} m ← max{m, (A)} endif endfor endwhile
Parallelization, Basic Block. Fig. The ASAP algorithm
An operation is scheduled only after all its predecessors have been scheduled. An integer-valued function Pcount on B is defined as follows: at any point in the algorithm, Pcount(A) is the number of immediate predecessors of an operation A that have not been scheduled yet. Initially, all operations are assigned the control step , that is, ℓ(A) is initialized to for each A ∈ B. Operations A for which Pcount(A) = keep this value of ℓ(A); they have been scheduled. If Pcount(A) = and B is a successor of A, then reduce Pcount(B) by , and increase ℓ(B) to [ℓ(A) + c(A)] if it is smaller. The operations whose Pcount is now zero keep their ASAP label; they have been scheduled. This process continues until all operations in B have been scheduled. The earliest control step where an operation B can start is given by ℓ(B) =
max [ℓ(A) + c(A)].
A∈Pred(B)
The total number of cycles needed to execute the block B is max [ℓ(A) + c(A)] − A∈B
which is clearly equal to the weight T of a critical path in the dependence graph. The total number of instructions needed to replace the basic block is given by m = maxA∈B ℓ(A).
ALAP Algorithm The goal of the ALAP algorithm (Fig. ) is to compute the ALAP label L for each operation in the given basic block B. The entire block has to be completed in the shortest possible time in such a way that each operation starts as late as possible. For each operation A, let f (A) denote the number of cycles from the point when A is scheduled to start to the point when execution of the entire block has been completed. The idea then is to minimize f (A) for each A. When operation A starts, [L(A) − ] cycles have already elapsed. Hence, the total number of cycles T needed to complete B is [L(A) − + f (A)], so that L(A) = T + − f (A).
()
The ALAP algorithm first computes T, and f (A) for each A, and then evaluates L(A) from this equation. If A and B are two operations such that A δ B, then after starting A, one must wait at least until A has finished
Parallelization, Basic Block
Algorithm Given a basic block B, its dependence graph, and a machine with unlimited resources, this algorithm computes the ALAP label L of each operation, and the total number m of instructions needed to replace B. For each A ∈ B, the sets Pred(A) and Succ(A) are assumed to be known. T←0 V←B for each operation A ∈ V do Scount(A) ← |Succ(A)| f(A) ← 0 endfor while V = ∅ do for each operation A ∈ V do if Scount(A) = 0 then f(A) ← f(A) + c(A) T ← max{T, f(A)} for each B ∈ Pred(A) do Scount(B) ← Scount(B) − 1 f(B) ← max{f(B), f(A)} endfor V ← V − {A} endif endfor endwhile for each operation A ∈ V do L(A) ← T + 1 − f(A) endfor m ← maxA∈B L(A)
Parallelization, Basic Block. Fig. The ALAP algorithm
before starting B. This means L(B) ≥ L(A) + c(A), or f (A) ≥ f (B) + c(A) by (). Thus, the minimum possible value for f (A) is f (A) = c(A) + max f (B). B∈Succ(A)
An operation is scheduled only after all its successors have been scheduled. An integer-valued function Scount on B is defined as follows: at any point in the algorithm, Scount(B) is the number of immediate successors of an operation B, that have not been scheduled yet. Initialize T to , and f (A) to for each A ∈ B. If an operation A has Scount(A) = , then f (A) is increased by c(A) to reach f (A) = + c(A) = c(A). This operation has now been scheduled. The value of T is also increased to f (A) if T < f (A). If Succ(A) = and B is a predecessor of A, then reduce Scount(B) by , and increase f (B) to f (A) if f (B) < f (A). The operations A for which Scount(A) is now zero are handled next. This process continues until all operations in B have been
P
processed. When the final value of f (A) for each operation A and the final value of T are known, the ALAP labels are found from the equation L(A) = T + − f (A). The total number of cycles needed to execute the block is T = maxA∈B f (A) which is equal to the weight T of a critical path in the dependence graph. The total number of instructions needed to replace the basic block is given by m = maxA∈B L(A).
Limited Resources In this section, the reality is acknowledged that any given machine has limited amount of functional resources. While scheduling the operations in the basic block B, one now needs to worry about the potential resource conflicts between two operations, in addition to the dependence constraints that may exist between them. For simplicity, a pipelined implementation is assumed for each multi-cycle operation. Register allocation is not treated here; it should be done either before or after scheduling.
List Scheduling List Scheduling employs a greedy approach to schedule as many operations as possible among those whose predecessors have been scheduled. Each operation is assigned a priority. Operations that are ready to be scheduled are placed on a ready list ordered by their priorities. At each control step, the operation with the highest priority is scheduled first. If there are two or more operations with the same priority, then a selection is made at random. List scheduling encompasses a family of different algorithms based on the choice of the priority function. In the algorithm described here (Fig. ), the priority of an operation A is defined by the difference μ(A) = L(A) − ℓ(A) between its ALAP and ASAP labels, called the mobility of the operation. An operation with a lower mobility has a higher priority. First, Algorithm and Algorithm are used to find the ASAP and ALAP labels of each operation in the given basic block. Operations without predecessors are placed on a ready list S and arranged in the order of increasing mobility. They are taken from the ready list and scheduled one by one subject to the availability of machine resources. After one round, if there is still an operation A left over in S, then its ASAP label is increased by without exceeding its ALAP label. This will reduce the mobility of A, if it is not already zero.
P
P
Parallelization, Basic Block
Algorithm Given a basic block B, its dependence graph, and a machine with limited resources, this algorithm finds a schedule for B. For each A ∈ B, the sets Pred(A) and Succ(A) are assumed to be known. V←B k←1 for each operation A ∈ V do Compute the labels (A) and L(A) by Algorithm 1 and Algorithm 2 endfor while V = ∅ do S ← all operations in V whose predecessors have finished executing before control step k for each A ∈ S do μ(A) ← L(A) − (A) endfor
(
Arrange the operations of S in a sequence Ar1, Ar2,..., Ar|S|
)
in the increasing order of their mobilities Create an empty instruction Ik for i = 1 to |S| do if Ar does not have resource conflicts with operations in Ik then i
Put Ar in Ik i
V ← V − {Ar } i
S ← S − {Ar } i
endif endfor k←k+1 for each A ∈ S do (A) ← min{(A) + 1, L(A)} endfor endwhile
Parallelization, Basic Block. Fig. List scheduling algorithm
Linear Analysis The Linear Algorithm (Fig. ) checks the operations of the basic block linearly in their order of appearance, and puts them into instructions observing dependence constraints and avoiding resource conflicts. Arrange the operations in the block in their prescribed sequential order: A , A , . . . , An . For an easy description of the algorithm, it is convenient to assume that a control step could be any integer. Start with a sequence of empty instructions {Ik : −∞ < k < ∞} arranged in an imaginary vertical column. The operations A , A , . . . , An are taken in this order and put in instructions one by one. At any point, all operations already scheduled are in a range of instructions (It , It+ , . . . , Ib ), where t ≤ and b ≥ . The value of t keeps decreasing and the value of b keeps increasing. At the end of the algorithm, the final sequence (It , It+ , . . . , Ib ) can be renumbered to get a sequence of instructions (I′ , I′ , . . . , I′m ), where m = b − t + .
Both t and b are initialized to . The first operation A is put in instruction I . Suppose operations A , A , . . . , Ai− have already been scheduled, and the time has come to schedule Ai , where ≤ i ≤ n. Let k denote the smallest integer ≥ t, such that all predecessors of Ai finish executing before control step k. If Ai has no predecessors, then k = t. Otherwise, if a predecessor Ar is in an instruction Ij , where t ≤ j ≤ b, then k ≥ j + c(Ar ). The exact value of k is found by taking the maximum of all such expressions. So, operation Ai can be put in instruction Ik without violating any dependence constraints. If k ≤ b, start checking the instructions Ik , Ik+ , . . . , Ib , in this order, for potential resource conflicts. If an instruction is there in this range with which Ai does not conflict, then put Ai in the first such instruction. Otherwise, Ai conflicts with all instructions in the range and k = b + . If k > b was true before the checking could start, then its value did not change. At this point, if Ai had no predecessors
Parallelization, Basic Block
P
Algorithm Given a basic block (A , A , . . . , An ) of n operations, its dependence graph, and a machine with limited resources, this algorithm finds a schedule for the block. At any point in the algorithm, all scheduled operations lie in the sequence of instructions (It , It+ , . . . , Ib ), where t ≤ and b ≥ . t←1 b←1 Put the operation A1 in instruction I1 for i = 2 to n do k ← earliest control step ≥ t before which all predecessors of Ai finish executing while k ≤ b and Ai has a resource conflict with operations in Ik do k←k+1 endwhile if k ≤ b then put Ai in Ik else if Pred(Ai) = ∅ then Put Ai in It−1 t←t−1 else Put Ai in Ik b←k endif endif endfor
Parallelization, Basic Block. Fig. The linear algorithm
in the first place, then put it at the top in instruction It− and decrease t to t − . If Ai had predecessors, then put it in instruction Ik and increase b to k. The process ends when all operations in the given basic block have been scheduled.
An Example Consider a basic block B consisting of eight operations A , A , . . . , A arranged in this order. The dependence graph of B is given in Fig. . The type, cycle time, the immediate predecessors, and the immediate successors of each operation are listed in Table . It is assumed that the cycle time of an addition is , and that of a multiplication is . The four algorithms described in this essay are applied to B one by one. The list scheduling and linear algorithms are customized for a machine with one adder and one multiplier. Note that the critical path in the dependence graph is A → A → A → A . Since its weight is T = , the total number of cycles taken by each algorithm to process B must be at least . ASAP Algorithm. At the beginning, the ASAP label ℓ(Ai ) of each operation Ai is initialized to . Operations A and A have no predecessors. They keep their ASAP labels, that is, they are scheduled to start in control step . Since A and A are successors of A , and
A1
A2
A3
A4
A5
P A6
A7
A8
Parallelization, Basic Block. Fig. Dependence graph of basic block B
A and A are successors of A , their ASAP values are increased as follows: ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = .
P
Parallelization, Basic Block
Parallelization, Basic Block. Table Details of the basic block of example OP A
Type
c(A)
Pred(A)
A
+
A
∗
A
+
A
A
+
A
A
+
A
A
+
A , A
A
+
A
A
∗
A
Succ(A)
f (A)
ℓ(A)
L(A)
μ(A)
A , A
A , A
A
A
A
After A and A have been scheduled, operations A , A , and A do not have any predecessors remaining to be scheduled. So, they can be scheduled in steps determined by their current ASAP values. Since A is a successor of A and A a successor of A , their ASAP values are increased as follows: ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = . After A and A have been scheduled, operations A and A do not have any predecessors remaining to be scheduled. So, they can be scheduled in steps determined by their current ASAP values. Since A is a successor of A , its ASAP value is increased as follows:
and A have no successors, f (A ), f (A ), and f (A ) are increased as follows: f (A ) ← f (A ) + c(A ) = + = f (A ) ← f (A ) + c(A ) = + = f (A ) ← f (A ) + c(A ) = + = . These operations have now been scheduled. The value of T is increased to . Since A is a predecessor of A , A of A , and A of A , their f values are increased as follows: f (A ) ← max{ f (A ), f (A )} = max{, } = f (A ) ← max{ f (A ), f (A )} = max{, } = f (A ) ← max{ f (A ), f (A )} = max{, } = .
ℓ(A ) ← max{ℓ(A ), ℓ(A ) + c(A )} = max{, } = .
Since A has no successors remaining to be scheduled, f (A ) is changed as follows:
After A has been scheduled, operation A does not have any predecessors remaining to be scheduled. So, A keeps its current ASAP value, and is scheduled to start in step . All the ASAP values are shown in Table . The total number of control steps taken to execute B is
f (A ) ← f (A ) + c(A ) = + = .
max[ℓ(Ai ) + c(Ai )] − = . ≤i≤
The total number of instructions in the schedule is maxi ℓ(Ai ) = . The instructions are (I , I , . . . , I ), where I is empty, and I = {A , A }, I = {A }, I = {A }, I = {A , A }, I = {A }, I = {A }. ALAP Algorithm. At the beginning, T is initialized to , and f (Ai ) is initialized to for each Ai . Since A , A ,
A is now scheduled. The value of T remains unchanged at . Since A is a predecessor of A , f (A ) is increased as follows: f (A ) ← max{ f (A ), f (A )} = max{, } = . Since A has no successors remaining to be scheduled, f (A ) is changed as follows: f (A ) ← f (A ) + c(A ) = + = . A is now scheduled. The value of T is increased from to . Since A and A are predecessors of A , their f values are increased as follows: f (A ) ← max{ f (A ), f (A )} = max{, } = f (A ) ← max{ f (A ), f (A )} = max{, } = .
Parallelization, Basic Block
Now A and A do not have any successors remaining to be scheduled. Their f values are increased again as follows: f (A ) ← f (A ) + c(A ) = + = f (A ) ← f (A ) + c(A ) = + = . These two operations are now scheduled. The value of T is increased from to . Since A is a predecessor of A , f (A ) is increased from to f (A ) or . Since A now has no successors remaining to be scheduled, f (A ) is increased again by c(A ) to + , or . The value of T is also increased from to . Now that the value of T (total number of cycles), and the value of f (Ai ) for each Ai are known, the ALAP labels are computed from the equation L(A) = T +−f (A). The values of f (Ai ) and L(Ai ) for each Ai are shown in Table . The total number of instructions in the schedule is maxi L(Ai ) = . The instructions are (I , I , . . . , I ), where I , I , I , I are empty, and
I = {A , A }.
List Scheduling Algorithm. Initially, V has all operations. The initial value of the mobility of each operation is listed in Table . At step , S = {A , A }. The operations are arranged in the order (A , A ), since μ(A ) < μ(A ). Both operations can be put in instruction I , since there is no resource conflict between them. At step , S = {A }. So, A goes into I . Similarly, A goes into I . At step , S = /. At step , S = {A , A }. The operations are arranged in this order, since μ(A ) < μ(A ). Only A can be put into I . Take A out of S and decrease μ(A ) to . At step , S = {A , A }. The operations are arranged in the order (A , A ), since μ(A ) < μ(A ). Only A can be put into I . Decrease μ(A ) to . At step , S = {A , A }. The operations are arranged in the order (A , A ), since μ(A ) < μ(A ). Both operations can be put in instruction I since there is no resource conflict between them. The total number of instructions in the schedule for B is . Those instructions are (I , I , . . . , I ), where I = {A , A }, I = {A }, I = {A }, I = {A },
I = {A }, I = {A , A }.
Total number of cycles = [ + max{c(A ), c(A )} − ] = . Linear Algorithm. Initially, t = b = . Put operation A in instruction I . Since A has no predecessors and does not conflict with A , put A also in I . Operation A has only one predecessor, namely, A . Since c(A ) = , A can be put in instruction I . Increase b to . The sole predecessor of A is A and c(A ) = . Hence, A can go into instruction I . Increase b to . Operation A also has A as its only predecessor. But, A cannot go into I , since it conflicts with A . So, put A in I and increase b to . Operation A has two predecessors: A and A . The earliest instruction that can take A without violating dependence constraints is I . But I already has A and it conflicts with A . So, put A in I and increase b To . Now it is easy to see that A can go into I and A into I . Increase b to . The total number of instructions in the schedule for B is . Those instructions are (I , I , . . . , I ), where I = {A , A }, I = {A }, I = {A }, I = /,
I = {A }, I = {A , A }, I = {A }, I = {A }, I = {A },
P
I = /,
I = {A },
I = {A }, I = {A }, I = {A }.
Total number of cycles = [ + c(A ) − ] = .
Related Entries Code Generation Loop Nest Parallelization Parallelization, Automatic Unimodular Transformations
Bibliographic Notes and Further Reading Basic block parallelization was first studied in the context of microprogramming. The book by Agerwala and Rauscher [] gives a good introduction to microprogramming. The theory of job scheduling in operations research turned out to be a rich source of algorithms for the microprogramming research community [, , ]. For linear analysis and list scheduling as covered in this entry, a good place to start is the paper by Landskov and others []. (The algorithms presented above try to factor in explicitly the cycle times of operations.) Other standard references are [, , , ]. Only two algorithms are given in this entry for the limited resource case; there are many more. For example, for Force-directed Scheduling, see [].
P
P
Parallelization, Loop Nest
The ten references listed here constitute a small percentage of what is available.
Bibliography . Agerwala AK, Rauscher TG () Foundations of microprogramming architecture, software, and applications. Academic, New York . Agerwala T (Oct ) Microprogram optimization: a survey. IEEE Trans Comput C-():– . Coffman EG Jr () Computer and job-shop scheduling theory. Wiley, New York . Conway RW, Maxwell WL, Miller LW () Theory of scheduling. Addison-Wesley, Reading . Dasgupta S, Tartar J (Oct ) The identification of maximal parallelism in straightline microprograms. IEEE Trans Comput C-():– . Gonzalez MJ Jr (Sep ) Deterministic processor scheduling. ACM Comput Surv ():– . Hu TC (Nov-Dec ) Parallel sequencing and assembly line problems. Oper Res ():– . Landskov D, Davidson S, Shriver B, Mallett PW (Sep ) Local microcode compaction techniques. Comput Surv ():– . Paulin PG, Knight JP (June ) Force-directed scheduling for the behavioral synthesis of ASIC’s. IEEE Trans CAD Integ Circ Syst ():– . Ramamoorthy CV, Chandy KM, Gonzalez MJ (Feb ) Optimal scheduling strategies in a multiprocessor system. IEEE Trans Comput C-():–
Parallelization, Loop Nest Loop Nest Parallelization
ParaMETIS METIS and ParMETIS
PARDISO ¨ Olaf Schenk , Klaus Gartner University of Basel, Basel, Switzerland Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany
Definition PARDISO, short for “PARallel DIrect SOlver,” is a thread-safe software library for the solution of large
sparse linear systems of equations on shared-memory multicore architectures. It is written in Fortran and C and it is available at www.pardiso-project.org. The solver implements an efficient supernodal method, which is a version of Gaussian elimination for large sparse systems of equations, especially those arising, for example, from the finite element method or in nonlinear optimization. It is the only sparse solver package that supports all kinds of matrices such as complex, real, symmetric, nonsymmetric, or indefinite. PARDISO can be called from various environments including MATLAB (via MEX), Python (via pypardiso), C/C++, and Fortran. PARDISO version .. was released in October .
Discussion Introduction The solution of large sparse linear systems lies at the heart of many calculations in computational science and engineering and is also of increasing importance in computations in the medical imaging and financial sectors. Today, systems of equations with millions to hundreds of millions of unknowns are solved. To do this within reasonable time requires efficient use of powerful parallel computers and advanced combinatorial and numerical algorithms based on direct or approximate direct factorizations. To date, only very limited software for such large systems is generally available. The PARDISO software addresses this issue. In this chapter, some important combinatorial aspects and main algorithmic features for solving sparse systems will be reviewed. The algorithmic improvements of the past years have reduced the time required to factor general sparse matrices by almost three orders of magnitude. Combined with significant advances in the performance to cost ratio of computing hardware during this period, current sparse solver technology makes it possible to solve those problems quickly and easily, which might have been considered by far too large until recently. This chapter discusses the basic and the latest developments for sparse direct solution methods that have lead to modern LU decomposition techniques. The PARDISO development started in the context of a PhD Project [] at ETH Zurich in Switzerland.
PARDISO
The first aim was to improve parallel sparse factorization methods for highly ill-conditioned matrices arising in the semiconductor device simulation area. The close collaboration with the simulation community resulted in a large amount of test cases and industrial use of the solver. The library version of PARDISO is publicly released since March . Ongoing research projects resulted in the current version .., available since October . This new version also includes novel state-of-theart incomplete factorization-type preconditioners. The results of the research have appeared in several scientific journals, including the paper “On Large Scale Diagonalization Techniques for the Anderson Model of Localization” in the SIGEST section of the SIAM Review Journal []. It is the purpose of this chapter to describe the main combinatorial and numerical algorithms used in an efficient parallel solver.
Sparse Gaussian Elimination in PARDISO To introduce the notations, the LU-factorization of a nonsymmetric matrix A with pivoting is described. A simple description of the algorithm for solving sparse linear equations by sparse Gaussian elimination in PARDISO is as follows: ●
Compute the triangular factorization Pr Dr Pf APfT Dc Pc = LU. Here Dr and Dc are diagonal matrices to equilibrate the system, Pf is a permutation matrix that minimizes the number of nonzeros in the factor, and Pr and Pc are permutation matrices. Premultiplying A by Pr reorders the rows of A, and premultiplying A by Pc reorders the columns of A. Pr and Pc are chosen to enhance sparsity, numerical stability, and parallelism. L is a unit lower triangular matrix with Lii = and U is an upper triangular matrix. ● Solve AX = B by evaluating − − − − −T − X = A− B = (P− f Dr Pr LUPc Dc Pf ) B,
X = PfT Dc Pc U − L− Pr Dr Pf B. This is done efficiently by multiplying from right to left in the last expression: the rows of B are permuted with Pf and scaled by Dr . Multiplying Pr B means permuting the rows of Dr B. Multiplying L− (Pr Dr B) means solving triangular systems of equations with
P
nr right-hand sides with matrix L by substitution. Similarly, multiplying U − (L− (Pr Dr Pf B)) means solving triangular systems with U. For symmetric indefinite matrices, PARDISO performs a numerical factorization into LDLT , in which the factorization is also stabilized by extended numerical pivoting techniques. In addition to the complete factorization, advanced support for incomplete inversebased factorization preconditioners, in which the factors are computed approximately, are also included in PARDISO. The efficient computation of the LU decomposition is of utmost importance in Gaussian elimination. Typically, a compact representation of the elimination tree can be used to derive all information concerning fill-in and numerical dependencies. In particular, the fact that pivoting and the factorization must be interlaced requires a completely different treatment than in the case without pivoting. Consider the situation when computing the LU decomposition column by column.
for k = , . . . , n update column k of L and U via the equation A:n,k = L:n,:k− U:k−,k − L:n,k ukk . step. solve L:k−,:k− U:k−,k = A:k−,k . step. Lk:n,k := Ak:n,k for i < k such that uik =/ Lk:n,k := Lk:n,k − Lk:n,i uik . step. ukk = lkk Lk:n,k = Lk:n,k /ukk
Note that it is easy to add pivoting after step by interchanging lkk with some sufficiently large ∣lmk ∣, where m ⩾ k before ukk is defined. The art of efficiently computing column k of L and U consists of how the sparse forward solve L:k−,:k− U:k−,k = A:k−,k in step is efficiently implemented and, a fast update of Lk:n,k in step . The dense LU decomposition as it is described so far can be implemented in a way that it makes heavily use of level- BLAS. One important aspect that allows raising efficiency and speeding up the sparse numerical factorization will be discussed now. It is the recognition
P
P
PARDISO
of an underlying block structure with dense submatrices caused by the factorization and the fill. The block structure allows to collect parts of the matrix in dense blocks and to treat them commonly using higher levels of BLAS. As a consequence of the LU decomposition, parts of the triangular factors can be encountered that are dense or become dense by the factorization. This key structure in sparse Gaussian elimination is based a on supernodal representation of the columns and rows in the factors L and U. A sequence k, k + , . . . , k + s − of s subsequent columns in L that form a dense lower triangular matrix is called a supernode. To illustrate the use of supernodes, Figs. and illustrate the underlying dense block structure. Supernodes in PARDISO are stored in rectangular dense matrices. Beside the storage scheme as dense matrices, the nonzero row indices for these blocks need only be stored once. Next, the use of dense submatrices allows the usage of level- BLAS routines. To understand this, one can easily verify that the update process Lk:n,k := Lk:n,k − Lk:n,i uik
• • • • ◦ •
• • • • •
• • • • ◦ •
• ◦ ◦ • • • • • • • •
PARDISO. Fig. Supernodes in L, symmetric case (“○” denotes fill-in) • • • • ◦ • •
• • • • •
• • ◦ • ◦ •
• • ◦ •
• • • • • • • • •
• ◦ • •
PARDISO. Fig. Supernodes in L + U, general case (“○” denotes fill-in)
that computes Lk:n,k in the algorithm can easily be extended to the block case. Assumung that columns , . . . , k + s − are collected in p supernodes with column indices K , . . . , Kp . Apparently, the column set is Kp = {, . . . , k + s − }. Then updating Lk:n,Kp can be rewritten as Lk:n,Kp := Lk:n,Kp − Lk:n,Ki UKi ,Kp , where the sum has to be taken over all i in the block version of the elimination tree. Depending on whether one would like to compute the diagonal block LKp ,Kp as full or as lower triangular matrix one is able to use level- BLAS or even level- BLAS subroutines. This allows exploiting machine-specific properties, such as caches to accelerate the computation. As discussed above, dynamic pivoting has been a central tool by which nonsymmetric sparse linear solvers gain stability. Therefore, improvements in speeding up direct factorization methods were limited to the uncertainties that have arisen from using pivoting. Certain techniques, like the column elimination tree [], have been useful for predicting the sparsity pattern despite pivoting. However, in the symmetric case, the situation becomes more complicated since only symmetric reorderings, applied to both columns and rows, are required, and no a priori choice of pivots is given. This makes it almost impossible to predict the elimination tree in a sensible manner, and the use of cache-oriented level- BLAS is impossible. With the introduction of symmetric maximum weighted matchings [] as an alternative to complete pivoting [], it is now possible to treat symmetric indefinite systems similarly to symmetric positive definite systems. This allows to predict fill using the elimination tree, and thus allows to set up the data structures that are required to predict dense submatrices (also known as supernodes). This in turn means that one is able to exploit level- BLAS applied to the supernodes. Consequently, the classical Bunch–Kaufman pivoting approach needs to be performed only inside the supernodes. This approach has recently been successfully implemented in symmetric indefinite version of PARDISO []. As a major consequence of this novel approach, the sparse indefinite solver has been improved to become almost as efficient as its symmetric positive counterpart.
PARDISO
Finally, iterative refinement is the last option to test and enhance the precision of the solution. To encourage the use of L and U as preconditioner in solving continuously dependent families of problems, PARDISO also offers a CG and CGS branch.
Reordering Algorithms and Software in PARDISO The efficiency of a direct solver depends strongly on the order in which the variables of the matrix are eliminated. This largely determines the computational amount of work executed during the numerical factorization and hence the overall time for the solution. Furthermore, the elimination order of the variables also determines the number of entries in the computed factors and the amount of main storage required. Experiments have shown that the multiple minimum degree (MMD) algorithm of J. Liu [] is very well suited for D problems, but nested dissection type orderings do a much better job on larger problems from D discretizations. Thus, the ordering METIS package from Karypis and Kumar [] and a constrained minimum fillin orderings developed by Schenk and Gärtner [] have
P
been added to PARDISO. In a somewhat more loosely coupled way, orderings from other packages such as approximate minimum degree [] can also be used.
Parallelization Strategies in PARDISO Parallelism can be expressed by the elimination tree: different branches are independent (see Fig. ). The length of each vertical line is proportional to the size of a dense diagonal block. The hight counted in levels characterizes the longest chain of sequentially depending steps. Memory writes should be minimized and asynchronous scheduling is possible in PARDISO on SMP and NUMA architectures. It is used to reduce load balance requirements. Hence, the small amount of synchronization data is passed towards the root (or to the right in L, U), while the numerical data is read from the left of the actual supernode. Especially large supernodes close to the root of the elimination tree are split into panels to increase the number of parallel tasks. PARDISO uses METIS by default to generate nested dissection permutations and maps all sparse computations on dense matrix operations to achieve level- BLAS performance for the numerical factorization. Scheduling
P
PARDISO. Fig. Typical elimination tree with a MMD reordering
P
PARDISO
distinguishes two levels in the elimination tree to reduce synchronization events: updates within the lower level are conflict free, hence must not be synchronized. In PARDISO, complete lower level subtrees of the elimination are mapped onto a single core of the multiprocessing architecture. The columns and rows associated with this complete subtree are factorized independently with respect to other subtrees. The nodes near the root of the elimination tree normally involve more computation than supernodes further away from the root. In practical examples more than % of the floating-point computations are performed on the supernodes close to the root the tree. Unfortunately, the number of independent supernodes near the root of the tree is small, and so there is less parallelism to exploit. For example, the root in Fig. has only two neighboring nodes and hence only two processors would perform the corresponding work while the other processors remain idle. The introduction of the panels and the execution of partial updates as tasks for supernodes close to the root increases the parallelism. An additional constrained that has to be taken into account is that PARDISO is designed to solve a wide range of application problems including symmetric and nonsymmetric matrices, and numerical pivoting is performed within the numerical factorization. This means that only a static analysis of the sparsity pattern and a static allocation of supernodes to cores could be very inefficient if numerical pivoting is required. As a solution, a novel left-right looking factorization has been implemented that uses a dynamic allocation of threads during the numerical factorization. This has the benefit of enabling the linear scalability of the code to perform well on multicore architectures with up to cores. The parallel execution in PARDISO does not change the constant in the leading order of the complexity bound compared with that of the sequential algorithm.
Approximate Sparse Gaussian Factorization in PARDISO The solver also includes a novel preconditioning solver []. The preconditioning approach for symmetric indefinite linear system is based on maximum weighted matchings and algebraic multilevel incomplete LDLT factorization. These techniques can be seen as a complement to the alternative idea of using more complete
pivoting techniques for the highly ill-conditioned symmetric indefinite matrices. In considering how to solve the linear systems in a manner that exploits sparsity as well as symmetry, a diverse range of algorithms is used that includes preprocessing the matrix with symmetric weighted matchings, solving the linear system with Krylov subspace methods, and accelerating the linear system solution with multilevel preconditioners based upon incomplete inverse-based factorizations.
General Software Issues in PARDISO The PARDISO package was originally designed for structurally symmetric matrices arising in the semiconductor device simulation area, but, in the newer version, all other types of matrices such as real and complex symmetric, real or complex nonsymmetric systems, or complex Hermitian systems are permitted. The PARDISO software is written in C and Fortran. However, in recognition that some users prefer a Matlab or Python programming environment, an appropriate interface has been developed for these programming languages. It requires OpenMP threading capabilities and makes heavy use of level- BLAS and LAPACK subroutines. For the parallel version of PARDISO, a singlethreaded version of level- BLAS and LAPACK routines is used. PARDISO has been ported to a wide range of computers from previous generations of vectorcomputers such as Cray or NEC, to all kind of multicore architectures including Intel, AMD, IBM, SUN, and SGI [] and also to recent throughput manycore processors such as GPUs from NVIDIA []. The PARDISO package has an excellent performance relative to other parallel sparse solvers. An extensive evaluation of the performance of the numerical factorization in comparison to a wide range of other sparse direct linear solver packages is given by Gould, Hu, and Scott []. Other independent comparisons can be found in []. PARDISO is currently heavily used in linear and nonlinear optimization solvers. Recent results can be found in []. The current version of the solver and a manual including several examples is available at www.pardiso-project.org.
Example The example illustrates the complexity gap between numerical factorization and solution in the sparse direct
PARDISO
solution process. As shown in Table , the numerical factorization is the most time-consuming phase. This is demonstrated on a D eigenvalue problem. The symmetric eigenvalue problem Ax = λBx is chosen with up to . × unknowns to be close to the size limits of a today’s GB SMP machines. The eigenvalue problem is solved by inverse iteration, polynomial acceleration, a few different spectral shifts, and the same number of factorizations. On average, five eigenvalues are computed per factorization. The spectral shifts can be chosen to result in
PARDISO. Table The serial computational complexity of the various phases of solving a sparse system of linear equations arising from D and D constant node-degree graphs with n vertices Phase
Dense D complexity D complexity
Reordering:
–
O(n)
O(n)
Symbolic factorization
–
O(n log n)
O(n/ )
Numerical factorization
O(n )
O(n/ )
O(n )
Triangular solution O(n ) O(n log n)
O(n/ )
P
highly ill-conditioned but sufficiently regular sparse linear systems. The symmetric indefinite linear systems are solved by the Bunch-Kaufman pivoting. The quadratic complexity of the sparse direct factorization is shown in Fig. .
Future Research Directions Clearly, the quadratic complexity bound is limiting the problem size in D for sparse direct methods. However, from the user perspective, the robustness and the nearly achieved black box behavior are very convenient. The emergence of multicore architectures and scalable petascale architectures does not change both points. Instead, it motivates the development of novel algorithms and techniques that emphasize both concurrency and robustness for the solution of sparse linear systems. Since direct solvers do not generally scale to large problems and machine configurations, efficient application of preconditioned iterative solvers are warranted but involve more a priori knowledge. These hybrid solvers must optimize parallel performance, processor (serial) performance, as well as memory requirements, while being robust across specific classes of applications and systems.
log(ops)
7.0
P
CPU-time-factor(n) 6.0
operations(n) c n^2
5.0
CPU-time-solve(n) fill_in(n)
4.0
c n^4/3
3.0 2.0 1.0 0.0 0.0
1.0
2.0
3.0
log(n/23494)
PARDISO. Fig. Unstructured tetrahedral meshes with n = ′ , ′ , ′ , ′ , ′ and ′ tetrahedra for a discrete Laplace problem. The left graphic shows the measured times for the numerical factorization and the solution, the number of floating-point operations, and the complexity bounds. The first eigenvector is shown in the right graphic. The selfsimilar tetrahedral meshes are generated by TetGen, http://www.tetgen.org
P
PARSEC Benchmarks
The research groups at Purdue University and University of Basel are currently developing a new parallel solver PSPIKE [] that combines the desirable characteristics of direct methods (robustness) and effective iterative solvers (low computational cost), while alleviating their drawbacks (limited scalability, memory requirements, lack of robustness). The hybrid solver is based on the general sparse solver PARDISO and the Spike family of hybrid solvers []. The resulting PSPIKE algorithm is an extension of PARDISO to distributedmemory architectures. Results can be found in, for example, [].
Related Entries BLAS (Basic Linear Algebra Subprograms) Graph Partitioning Load Balancing, Distributed Memory Linear Algebra Software Nonuniform Memory Access (NUMA) Machines Shared-Memory Multiprocessors LAPACK
Further Reading . Davis T () Direct methods for sparse linear systems. Society for industrial mathematics, ISBN:
Bibliography . Amestoy R, Davis TA, Duff IS () An approximate minimum degree ordering algorithm. SIAM J Matrix Anal Appl :– . Demmel JW, Eisenstat SC, Gilbert JR, Li XS, Liu JWH () A supernodal approach to sparse partial pivoting. SIAM J Matrix Anal Appl :– . Duff IS, Koster J () The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J Matrix Anal Appl ():– . Gould NIM, Hu Y, Scott JA () A numerical evaluation of sparse direct solvers for the solution of large sparse, symmetric linear systems of equations. ACM Trans Math Software (TOMS) ():– . Grasedyck L, Hackbusch W, Kriemann R () Performance of H-LU preconditioning for sparse matrices. Comput Methods Appl Math ():– . Hagemann M, Schenk O () Weighted matchings for preconditioning of symmetric indefinite linear systems. SIAM J Sci Comput :– . Karypis G, Kumar V () A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput :– . Liu J () Modification of the minimum-degree algorithm by multiple elimination. ACM Trans Math Software ():–
. Ng E, Peyton B () Block sparse Cholesky algorithms on advanced uniprocessor computers. SIAM J Sci Comput : – . Manguoglu M, Sameh A, Schenk O () PSPIKE Parallel sparse linear system solver. In: Proceedings of the th international Euro-Par conference on parallel processing. Lecture Notes in Computer Science, vol , pp –, DOI ./--- (vol , pp –) . Polizzi E, Sameh AH () A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput ():– . Schenk O () Scalable parallel sparse LU factorization methods on shared memory multiprocessors. PhD thesis, ETH Zürich . Schenk O, Bollhöfer M, Römer RA () On large-scale diagonalization techniques for the Anderson model of localization. SIAM Rev :– . Schenk O, Christen M, Burkhart H () Algorithmic performance studies on graphics processing unit. J Parallel Distrib Comput :– . Schenk O, Gärtner K () Solving unsymmetric sparse systems of linear equations with PARDISO. J Future Gener Comput Syst ():– . Schenk O, Gärtner K () On fast factorization pivoting methods for symmetric indefinite systems. Electron Trans Numer Anal :– . Schenk O, Wächter A, Hagemann M () Matching-based preprocessing algorithms to the solution of saddle-point problems in large-scale nonconvex interior-point optimization. Comput Optim Appl (–):–
PARSEC Benchmarks The PARSEC benchmarks [] are mutithreaded codes that represent important applications of multicores. PARSEC stands for Princeton Application Repository for Shared-Memory Computers. Currently, the benchmark suite contains programs: blackscholes, bodytrack, canneal, dedup, facesim, ferret, fluidanimate, freqmine, raytrace, streamcluster, swaptions, vips, and x.
Bibliography . Bienia C () Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University
Partial Computation Trace Theory
PASM Parallel Processing System
Particle Dynamics N-Body Computational Methods
Particle Methods N-Body Computational Methods
Partitioned Global Address Space (PGAS) Languages PGAS (Partitioned Global Address Space) Languages
PASM Parallel Processing System Howard Jay Siegel, Bobby Dalton Young Colorado State University, Fort Collins, CO, USA
Definition PASM was a partitionable mixed-mode parallel system designed and prototyped in the s at Purdue University to study three dimensions of dynamic reorganization: mixed-mode parallelism, partitionability, and flexible interprocessor communications.
Discussion Introduction PASM was a partitionable mixed-mode parallel system designed and prototyped in the s at Purdue University [, ]. Research was conducted about numerous software, hardware, parallel algorithm, and application aspects of PASM. In the s, two of the dominant organizations for parallel machines were “SIMD” and “MIMD” []. Be forewarned that our use of the term “SIMD” is more general than the way it is used with current multicore systems, where it refers to operating on subfields of a long data word within a single processor. We will use the following definition: An SIMD (single instruction stream, multiple data stream) machine consists
P
of N processor-memory pairs, an interconnection network, and a control unit. The control unit broadcasts a sequence of instructions to the N processors that execute these instructions in lockstep (this is the “single instruction stream”). All “enabled” (active) processors execute the same instruction at the same time, but each processor does it on its own data. The operands for these instructions are fetched from the local memory associated with each processor (the fetching of operands from the collection of local memories form the “multiple data streams”). The interconnection network supports interprocessor communication. Thus, the SIMD definition used here assumes a collection of separate processors operating in lockstep, with all enabled processors following the same single instruction stream, but each processor operating on its own local data, resulting in multiple data streams. Examples of SIMD systems that have been constructed are Illiac IV [], MasPar MP- and MP- [], and MPP []. A multiple-SIMD system can be structured as a single SIMD machine or as two or more independent SIMD machines of various sizes. The Thinking Machines CM- [] and original design of the Illiac IV [] are examples of MSIMD systems. The MIMD (multiple instruction stream, multiple data stream) mode of parallelism [] uses N independent processor-memory pairs that can communicate via an interconnection network (i.e., a multicomputer). Each processor fetches its own instructions and its own operands from its local memory; thus, there are “multiple instruction streams” and “multiple data streams.” The nCUBE [] and IBM RP [] are examples of MIMD systems that have been constructed. One mode of operation possible for MIMD is SPMD (single program, multiple data stream), where all processors independently execute the same program on different data sets []. A mixed-mode system can dynamically change between the SIMD and MIMD modes of parallelism at instruction-level granularity with negligible overhead. This allows different modes of parallelism to be used to execute various portions of an algorithm. Because the mode of parallelism has an impact on performance (see Fig. ), a mixed-mode system may outperform a singlemode machine with the same number of processors for a given algorithm. A partitionable mixed-mode system can dynamically restructure to form independent
P
P
PASM Parallel Processing System
PASM Parallel Processing System. Fig. Trade-offs between the SIMD and MIMD modes of parallelism []. Processing Elements (PEs) are processor-memory pairs. Examples of variable-time instructions are floating-point operations on the prototype’s Motorola MC processors, and function calls (such as floating-point trigonometric operations) on current GPUs
or communicating submachines of various sizes, where each submachine can independently perform mixedmode parallelism (e.g., TRAC []). PASM is a PArtitionable-SIMD/MIMD system concept that was developed at Purdue University as a design for a large-scale partitionable mixed-mode machine based on commodity microprocessors []. PASM used a flexible multistage interconnection network for interprocessor communication. Thus, PASM could be dynamically reorganized along these three dimensions: partitionability, mode of parallelism, and connections among processors. This ability to be reorganized
allowed the system to match the computational structure of a problem, and also provided for fault tolerance in many situations. This fault tolerance is a form of robustness [, ], where the robust behavior requirement is continued system operation given the uncertainty of any single component failure, and quantified as the portion of the system that can still be used after a fault occurs. A small-scale prototype was completed in as a proof-of-concept machine for the PASM design ideas, and was still running in , with over , hours of execution time logged (Fig. ) []. It was used
PASM Parallel Processing System
P
PASM Parallel Processing System. Fig. A photograph of the completed PASM prototype circa and some of the PASM team. Left to right: Pierre Perrot (Technician), Tom Casavant (Professor), Wayne Nation (PhD Student), H.J. Siegel (Professor, team leader)
for research and in a parallel programming course at Purdue. The goal of the PASM research team was to design, develop, and build a unique research tool for studying the combined three dimensions of dynamic reorganization mentioned above: mixed-mode parallelism, partitionability, and flexible interprocessor communications. An overview of the organization of PASM is given in section “The Overall PASM Organization.” Section “The Parallel Computation Unit” describes the processing elements, memory, and interconnection network in the Parallel Computation Unit. The Memory Storage and Memory Management Systems are discussed in section “The Memory Storage and Management Systems.” Section “Using the PASM System” mentions some PASM prototype software tools and application studies. We conclude in section “Conclusions” with a list of advantages of the PASM approach. A list of PASM-related publications is available at http://hdl.handle.net//.
The Overall PASM Organization The PASM concept was a distributed-memory machine with a computational engine consisting of processormemory pairs referred to as PEs (Processing Elements). The PASM design concepts could support at least , PEs. The small-scale prototype built at Purdue University ( Motorola MC , processors, in
the computational engine) supported experimentation with all three dimensions of reconfigurability mentioned earlier, and produced insights not observed from earlier simulations and theoretical studies. The PASM system concept consists of six basic components shown in Fig. . The System Control Unit is the overall system coordinator and is the part of PASM with which the user directly interacts. The Q PE Controllers (PECs) serve as the PE control units in SIMD mode and may coordinate the PEs in MIMD mode. The Parallel Computation Unit contains the N = n PEs, physically numbered to N − , and the interconnection network used by the PEs. The Memory Storage System provides secondary storage for the PEs, and consists of N/Q secondary storage devices, each with an associated processor for file management. It is used to store data files for SIMD mode and both program and data files for MIMD mode. The Memory Management System contains multiple processors for controlling the transferring of files between the Memory Storage System and the PEs. Control Storage consists of a secondary storage device and an associated file server processor. It supports the PECs and the System Control Unit by holding all code and data used by these components, including the instructions to be broadcast to the PEs from the PECs in SIMD mode. The PASM prototype had a total of processors: N = PEs, Q = PECs, N/Q = Memory Storage System processors, four Memory Management
P
P
PASM Parallel Processing System
Memory Mgmt. System
System Control Unit Control Storage Processing Element Controllers
Memory Storage System
Parallel Computation Unit
PASM Parallel Processing System. Fig. The high-level architectural organization of the PASM design concept
System processors, the Control Storage processor, and the System Control Unit. The tasks to be performed by the System Control Unit include support for program development, job scheduling, general system coordination, management of system configuration and partitioning, assignment of user jobs to submachines, and connection to the host computer network. The hardware needed to combine and synchronize the PECs and PEs to form SIMD submachines of various sizes resides in the System Control Unit. Its functions include combining information from multiple PECs when collective conditionals, discussed in the next section, are performed. It also is responsible for coordinating the loading of the PE memories from the Memory Storage System with the loading of the PEC memories from the Control Storage. We now describe how the PECs are connected to the PEs, and how this connection scheme is used to form independent mixed-mode submachines. The organization we use provides an efficient PEC/PE interface and supports the partitioning of the PE Interconnection Network. The PECs, shown in Fig. , are the multiple control units required to form a multiple-SIMD system. There are Q = q PECs, physically numbered from to Q − . Each PEC controls a fixed group of N/Q PEs in the Parallel Computation Unit. A PEC and its associated PEs form a PEC Group. PEC i is connected to the N/Q PEs whose low-order q bits of their PE physical number have the value I (for reasons discussed in section “The Parallel Computation Unit”). In an N = , system, Q may be ; for the N = prototype, Q = . Figure shows the composition of the physical number of a PE. The memory units in each PEC are double-buffered so that computation and memory I/O can be overlapped
(see Fig. ). For example, the PEC processor can execute a job in one memory unit while the next job is preloaded into the other memory unit from PEC secondary storage (the Control Storage). In MIMD mode, a PEC fetches from its memory the instructions and data used to coordinate the operation of its PEs. In SIMD mode, a PEC fetches the instructions and common PE data from its memory units. In general, in SIMD mode, control-flow instructions are executed in the PEC, and data processing instructions are broadcast to the PEC’s group of PEs. Submachines are formed by one or more PEC Groups. The partitioning rule in PASM is that the numbers of all PEs in a submachine of size p must agree in their n − p low-order bit positions (for the reasons discussed in section “The Parallel Computation Unit”). The p high-order bits of the physical number of a PE correspond to the PE’s logical number within the partition. Thus, a submachine containing R = r PEC Groups (R∗ N/Q = p PEs), where ≤ r ≤ q, is formed by combining the PEs connected to the R PECs whose addresses agree in their q − r low-order bits. The PECs within a submachine of R∗ N/Q PEs are logically numbered from to R − . For R > , the logical number of a PEC is the high-order r bits of its physical number. Similarly, the PEs assigned to a submachine are logically numbered from to (R∗ N/Q)− ( to p −). The logical number of a PE is the high-order r + n − q = p bits of its physical number (see Fig. ). There is a maximum of Q submachines, each of size N/Q. Each submachine can operate as an independent mixed-mode system. PEs can switch between modes as often as desired; however, all PEs in a submachine must be in the same mode (SIMD or MIMD) at any point in time. When a submachine is operating in SIMD mode, the R PECs must execute and broadcast the same instructions, and all PEs in the submachine must be synchronized. The PECs are given the same instruction simply by loading the memory units of the R PECs with the same program. The PEs are synchronized by providing a small amount of special circuitry in the System Control Unit that coordinates instruction broadcasts among these PECs. Some advantages of this fixed PE to PEC mapping as compared with a dynamic PE to PEC interconnection (e.g., a crossbar switch []) include reducing
PASM Parallel Processing System
P
PASM Parallel Processing System. Fig. PASM Processing Element Controllers (PECs), and how they are connected to the Parallel Computation Unit (PCU) Processing Elements (PEs)
PEC/PE interface hardware, eliminating the overhead of maintaining a record of PE to PEC assignments, scheduling only Q PECs instead of N PEs, allowing the partitioning of the PE interconnection network into independent subnetworks (section “ The Parallel Computation Unit”), and supporting the structure of the efficient parallel primary to secondary memory connections (section “The Memory Storage and Management Systems”). The main disadvantage to this approach is that the size of each submachine must be a power of two, with a minimum of N/Q PEs. However, for PASM’s intended experimental environment, flexibility at a “reasonable” cost is the goal, not maximum PE utilization.
The Parallel Computation Unit The Parallel Computation Unit, shown in Fig. , contains N = n PEs and an interconnection network. Similar to the double-buffering for the PECs, two memory units are used in each PE so that computation and memory I/O can be overlapped. For example, the PE processor can execute a job in one memory unit while the next job is preloaded into the other memory unit from PE secondary storage (the Memory Storage System). These memory units compose the primary memory of the system. In SIMD or MIMD mode, each PE can perform indirect memory addressing using its own PE logical number or local data. Consider how PASM switches dynamically between the SIMD and MIMD modes of parallelism in our implemented prototype (Fig. ). In MIMD mode, a PE processor fetches instructions from its local memory by
PASM Parallel Processing System. Fig. Physical number of a Processing Element (PE) for a system where N = = PEs, Q = = Processing Element Controllers (PECs), and N/Q = = Memory Storage Units (MSUs). The PE is in a submachine of size R∗ N/Q = PEC Groups (corresponding to = p = PEs). For this example, the physical numbers of all PEs in this submachine agree in bit position
P placing its program counter value on the address bus and latching the data placed on the data bus by the memory device holding that instruction word. A PE is forced into SIMD mode by executing a jump to the “logical” SIMD instruction space, which is not a physical space. Figure illustrates the address space of a PE, showing the logical SIMD instruction space, physical RAM, and I/O space. Logic in the PE detects read accesses to the logical SIMD instruction space, and any such access sends a SIMD instruction request to the PE’s PEC. The SIMD instruction is issued by the submachine’s PECs only after all the PEs of a submachine have requested the instruction (recall the System Control Unit has special hardware to synchronize the PECs that are part of the same submachine). The instructions are sent by the submachine PECs’ Fetch Units to the PEs (Fig. ). All enabled PEs in the submachine
P
PASM Parallel Processing System
PASM Parallel Processing System. Fig. The PASM Parallel Computation Unit
PASM Parallel Processing System. Fig. How the Processing Element (PE) logical address space in the PASM prototype is used to support mixed-mode computation
then execute the instruction simultaneously, each PE on its own data. Thus, all the PEs of the submachine are synchronized when a SIMD instruction is broadcast. The PE continues to access instructions from the SIMD instruction space until an instruction is broadcast that causes each PE to jump to an instruction in its physical RAM (Fig. ), changing to MIMD mode. Such flexibility in mode switching allows mixed-mode programs to be written that change modes at instruction-level granularity with negligible overhead. The SIMD instruction fetch mechanism also can be used to support a form of MIMD “barrier synchronization” []. Masking schemes are used in SIMD mode to enable and disable PEs. The PASM system uses PE-address
masks (originated by the PASM project []) and dataconditional masks. A PE-address mask enables a set of PEs based solely on the logical addresses (numbers) of PEs; it does not depend on local PE data. A PE-address mask has n positions, where each position contains either a , , or X (“don’t care”). A PE is enabled only if the binary representation of its number matches the mask. For example, for N = , [{X}{}] is equivalent to [XXXXX] and enables all even-numbered PEs. A negative PE-address mask enables all PEs whose addresses do not match the mask. For example, for N = , [{X}{}] = [XXXX] enables all PEs whose numbers are not multiples of four. PE-address masks are a convenient notation for enabling PEs in a largescale parallel machine, and proved very useful in image processing applications. Data-conditional masks enable PEs based on some condition that is dependent on local PE data. The resulting data condition may be true in some PEs and false in other PEs. An example use of data-conditional masking is the where statement, the SIMD counterpart to the if-then-else statement. When the segment where
is executed, each PE independently evaluates the
PASM Parallel Processing System
execute the <else-part>, and the PEs where the condition evaluates true are idle. This type of masking is used in many SIMD machines (e.g., CM-, MP-). Data-conditional and PE-address masks can be used together. There are situations in which the combined conditional status of all enabled PEs in a SIMD submachine is needed, e.g., if-none, if-any, and if-all. For example, if the PEs of a submachine are each examining a portion of an image to find certain objects, it may be necessary to know whether any of the PEs have found such an object. Because PASM is a partitionable system, operations such as these require conditional results to be communicated from the PEs to their PECs and subsequently combined among the PECs comprising a SIMD submachine. These operations are efficiently supported by providing a small amount of additional circuitry in the System Control Unit that combines PEC Group results according to the current system partitioning. The Interconnection Network allows PEs to communicate with each other. As described earlier, the partitioning rule in PASM requires that the physical numbers of all PEs in a submachine of p PEs agree in their n − p low-order bit positions (see Fig. ). Thus, the p highorder bits of a PE’s physical number form its logical number within a submachine of p PEs. The low-order partitioning rule was chosen for PASM so that either the multistage cube [] or the augmented data manipulator (ADM) [] networks could be used. However, the multistage cube was selected because of its comparative cost-effectiveness. The multistage cube network has N input ports, N output ports, and contains n = log N stages of N/ twoinput/two-output interchange boxes. Each interchange box can be set to one of the four states shown in Fig. . The network can be used in both the SIMD and MIMD modes of parallelism.
PASM Parallel Processing System. Fig. The four valid states of a multistage cube network interchange box
P
Figure shows the multistage cube network for N = . PE i is connected to network input port i and output port i. The stages are numbered consecutively from n − at the input stage to at the output stage. The upper interchange box output label is the same as the upper input, and the lower interchange box output label is the same as the lower input. The interconnection pattern between stages is such that at stage j the two links whose labels differ only in bit j are connected to the same interchange box. Figure shows the network set for a “permutation” connection (input i → output(i + ) mod ) for an N = multistage cube network. The network can be controlled in a distributed fashion using routing tags for specifying permutations, one-to-one connections (e.g., PE a to PE b), one PE to many multicast connections, and combinations of these []. A network is partitionable if it can be divided into independent subnetworks of smaller sizes that have all the properties of the original network []. In general, a size N multistage cube network can be partitioned into multiple subnetworks of different sizes, where the size of each subnetwork is a power of two. For example, Fig. shows an N = network partitioned into two subnetworks by setting the interchange boxes in stage to straight. Because stage is straight, even-numbered inputs cannot reach odd-numbered outputs, and odd
P
PASM Parallel Processing System. Fig. A multistage cube network for N = Processing Elements (PEs) set to perform the “permutation” of sending data from PE i to PE (i + ) mod N for ≤ i < N
P
PASM Parallel Processing System
PASM Parallel Processing System. Fig. A multistage cube network for N = Processing Elements (PEs) partitioned into two independent subnetworks, each of size eight. Subnetwork A consists of the even-numbered input and output ports, and subnetwork B the odd
inputs cannot reach even outputs. The two resulting subnetworks are subnetwork A, which contains physical ports , , . . ., , and subnetwork B, which contains physical ports , , . . ., . Because each of these subnetworks is an independent multistage cube network (of size N/ = ), either or both subnetworks may be further partitioned by forcing all interchange boxes in stage of the subnetwork to be straight. This process can be applied recursively. The Extra Stage Cube network [] (designed as a part of the PASM project) is a single-fault tolerant variation of the multistage cube network. There is an extra stage of interchange boxes at the input, labeled stage n. Links whose labels differ in the th bit position are paired at these extra stage boxes in the same way that they are at stage . This Extra Stage Cube network fault-tolerant variation has all of the useful properties described above for the multistage cube, and was constructed in the PASM prototype. It is robust in the sense
that it can still provide all of the capabilities of a fully functional multistage network despite the failure of any single network component.
The Memory Storage and Management Systems The Memory Storage System (Fig. ) is the secondary storage for the Parallel Computation Unit, storing data files in SIMD mode and both program and data files in MIMD mode (programs for SIMD mode are stored in the Control Storage). The Memory Storage System is comprised of N/Q independent Memory Storage Units (MSUs), numbered from to (N/Q) − . Each of the N/Q MSUs is connected to the memory modules of Q PEs. An example for the prototype size of N = and Q = is shown in Fig. . The benefit of this MSU to PE connection scheme is that the memories of all N/Q PEs connected to any one PEC can be loaded or unloaded in one parallel block transfer. For
PASM Parallel Processing System
PASM Parallel Processing System. Fig. The organization of the PASM Memory Storage System for the prototype size of N = Processing Elements (PEs), Q = Processing Element Controllers (PECs), and N/Q Memory Storage Units (MSUs)
example, in Fig. , PEC ’s group of PEs (PEs , , , and ) are loaded by having MSU load PE , MSU load PE , MSU load PE , and MSU load PE , all simultaneously. In the general case, MSU i is connected to and stores files for the Q PEs whose n − q high-order bits of their PE physical numbers are equal to i (see Fig. ). This high-order mapping is used so that each of the N/Q PEs connected to a given PEC is connected to a different MSU. Recall that the low-order q bits of the physical number of a PE correspond to the number of the PEC to which it is attached (see Fig. ). Thus, the full bandwidth of the Memory Storage System can be used when transferring files between the Memory Storage System and the Parallel Computation Unit. If there are only N/(Q∗ D) distinct MSUs, where ≤ D ≤ N/Q, then this scheme can be scaled so that D parallel block transfers are required to load or unload the memories of the PEs connected to any one PEC. Each MSU contains a mass storage unit and a processor to manage the file system and to transfer files to and from its associated PE memory units. Similarly, a submachine formed by combining the (R∗ N)/Q PEs controlled by R PECs can be loaded or unloaded in R parallel block transfers if there are N/Q MSUs. For example, in Fig. , a submachine comprised of the PEs of PEC and PEC can be loaded in R = parallel block loads as follows: First PEC ’s PEs are loaded in one parallel block transfer, and then PEC
P
’s PEs are loaded. If there are only N/(Q∗ D) distinct MSUs, only R∗ D parallel block transfers are required. Figure demonstrates which MSU connects to which PE. MSU i is connected to the PE whose highorder n − q bits of its logical number are i (which equal the n − q high-order bits of its physical number). As a result, no matter which PECs are assigned to a task, the data will be in the appropriate MSUs. For example, in Fig. , for any submachine of size N/Q = , MSU is connected to logical PE (which could be physical PE , , , or ). The Memory Management System (Fig. ) is a set of processors that send file system requests from the System Control Unit, PECs, and PEs to the appropriate MSUs; control data transfers between the Memory Storage System and the PEs; supervise input/output operations involving peripheral devices; and enforce consistent file naming and placement across the multiple Memory Storage System disks.
Using the PASM System ELP (Explicit Language for Parallelism) was a C-based language and associated compiler designed as part of the PASM project for programming mixed-mode parallel machines []. ELP provided constructs for both SIMD and MIMD parallelism, and an ELP application program could perform computations that use these parallelism modes in an instruction-level interleaved fashion for mixed-mode operation. ELP provided a vehicle for the exploration of and experimentation with mixed-mode parallelism on the PASM prototype. CAPS (Coding Aid for the PASM System) was designed to assist in the development and evaluation of application and system software for the PASM prototype []. CAPS integrated hardware support and software tools to provide a remote execution and program debugging/monitoring environment for the PASM prototype. The PASM prototype was accessible over the Internet using CAPS, and multiple windows could be displayed to monitor different processors simultaneously. The programming challenges for partitionable mixed-mode systems are a superset of those for SIMD and MIMD single-mode systems. Application and algorithm research activities to explore PASM’s three dimensions of flexibility included theoretical analyses,
P
P
PASM Parallel Processing System
simulations, and experiments on the PASM prototype []. These studies examined issues such as mapping tasks onto multistage cube network–based parallel processing systems; trade-offs among the SIMD, MIMD, and mixed-mode classes of parallelism; PEC/PE computational overlap in SIMD mode; impact of increasing the number of PEs used; and partitioning for improved performance. Applications considered include edgeguided thresholding, FFTs, global histogramming, image correlation, image smoothing, matrix multiplication, and range-image segmentation (references are given in Armstrong, Watson, and Siegel []). Here, we will summarize three application studies that utilized the PASM prototype. Fineberg, Casavant, and Siegel [] used the PASM prototype to study the benefits of mixed-mode parallel architectures for bitonic sequence sorting. Four variations of the bitonic sequence sorting algorithm were developed: a SIMD version, a MIMD version, a MIMD version with hardware barrier synchronizations, and a mixed-mode version that allowed switching between SIMD and MIMD modes during execution. The algorithm variations were then coded and profiled on the PASM prototype using varying problem and partition sizes. The research demonstrated that, by supporting hardware barrier synchronization and mode switching, a machine can achieve better performance than with only SIMD or MIMD parallelism on problems that are computationally similar to bitonic sequence sorting. The authors also proposed a modification to the PASM prototype condition code logic that would increase the performance of the SIMD version. Saghi, Siegel, and Gray [] programmed cyclic reduction (a method for solving a general tridiagonal set of irreducible linear algebraic equations) on three parallel systems, including the PASM prototype, to study the trade-offs between SIMD and MIMD parallelism, the effects of increasing the number of processors used on execution time, the impact of the interconnection network on performance, and the advantages of a partitionable system. The cyclic reduction algorithm was implemented using SIMD parallelism on the MasPar MP- [], using MIMD parallelism on the nCUBE [], and using the same four versions of parallelism on the PASM prototype as mentioned above. The authors also developed a mechanism
to predict the algorithm performance on each machine using algorithm analysis and performance measurements from each machine. By using the PASM prototype as a common basis to compare the modes of parallelism, the authors concluded that a cyclic reduction algorithm using mixedmode was better than a purely SIMD or purely MIMD algorithm. Execution-time measurements from the other two parallel machines emphasized the need to carefully choose the number of processing elements and distribution of data to obtain efficient computation. Tan, Siegel, and Siegel [] used several parallel machines to study block-based motion vector estimation. A SIMD MasPar MP- [], a MIMD Intel Paragon XP/S, a MIMD IBM SP, and the mixed-mode PASM prototype were used with different data partitioning schemes to perform the estimation, and the results were analyzed to contrast the benefits of each mode of parallelism. In this research, the PASM prototype used SIMD, SPMD, and mixed-mode parallelism. The research demonstrated a method to analytically predict the performance of various parallel implementations of the algorithm. It also described the impact of the number of processors and mode of parallelism used on algorithm performance. The authors found that, using the PASM prototype, the mixed-mode implementation slightly outperformed the pure SPMD implementation and significantly outperformed the pure SIMD implementation.
Conclusions Designing, simulating, prototyping, using, and evaluating the PASM partitionable SIMD/MIMD mixed-mode system, based on a multistage cube network, was a great research and educational experience for a large group of faculty and students at Purdue University from the s through the s. As stated earlier, PASM was flexible along three dimensions: partitionability, mode of parallelism, and variable connectivity among PEs. We found that advantages of systems with such flexibility over a pure SIMD machine or a pure MIMD machine included the following []: . Multiple simultaneous users: Because there can be multiple simultaneous independent submachines, there can be multiple simultaneous users of the
PASM Parallel Processing System
.
.
.
.
.
.
.
.
system, each executing a different program (not allowed in a pure SIMD machine). Program development: Rather than trying to debug a new parallel program on, for example, , PEs, it can be debugged on a smaller size submachine of PEs, and then extended to , PEs. Variable submachine size for increased utilization: If a task requires only N ′ of N available PEs, the other N − N ′ can be used for another task (not allowed in a pure SIMD machine). Variable submachine size for decreased execution time: There are some algorithms for which the minimum execution time is obtained when fewer than N PEs are used due to inter-PE communication overhead (e.g., image smoothing []); thus, it is desirable to create a submachine consisting of the optimal number of PEs. Subtask parallelism: Two independent subtasks that are part of the same job can be executed in parallel, sharing resources if necessary, which may result in improved overall task execution time [] (not allowed in a pure SIMD machine). Multiple processing modes: An algorithm can be executed by using a combination of SIMD and MIMD control with the same set of PEs (mixed-mode parallelism), using the mode that best matches the computations required at each step of the program (Fig. ). Matching inter-PE connectivity to the task: The multistage cube allows different connection patterns among PEs to be established depending on the task (as opposed to, for example, having a fixed mesh network). System fault tolerance: If a single PE fails, only those submachines that include the failed PE are affected. This provides some robustness, as mentioned in section “Introduction,” and is not allowed in a pure SIMD machine. Submachine fault tolerance: If a PE in a submachine fails, it may be possible to redistribute data and make use of mixed-mode parallelism (i.e., changing from SIMD to MIMD mode) and the variable connectivity (i.e., to establish connection patterns that do not include the faulty PE) so that the job executing on the submachine may continue on that submachine with minimal degradation (this is another form of robustness).
P
The references cited in this list of advantages, and the complete reading list of PASM-related publications (http://hdl.handle.net//), give much more detail about our experiences.
Related Entries Connection Machine Distributed-Memory Multiprocessor Flynn’s Taxonomy Illiac IV MasPar MPP nCube Networks, Multistage
Acknowledgments The preparation of this entry was supported by the National Science Foundation under grants CNS and CNS-, and by the Colorado State University George T. Abell Endowment. The large group of faculty and students who have participated in the PASM project are the coauthors of the papers listed in the PASM-related reading list (http://hdl.handle.net/ /). Numerous agencies supported aspects of PASM-related research: Air Force Office of Scientific Research, Army Research Office, Ballistic Missile Defense Agency, Defense Mapping Agency, Naval Ocean Systems Center, Naval Research Laboratory, National Science Foundation, Office of Naval Research, and Rome Laboratory. IBM provided a grant for much of the prototype equipment. Donations for various parts for the prototype were provided by Amphenol Products, Augat Inc., Belden, Motorola, and Power One.
Bibliographic Notes and Further Reading This entry is a brief summary of the PASM architecture; details are available in the papers in the PASMrelated reading list available at http://hdl.handle.net/ /.
Bibliography . Ali S, Maciejewski AA, Siegel HJ, Kim J () Measuring the robustness of a resource allocation. IEEE Trans Parallel Distrib Syst ():– . Armstrong JB, Watson DW, Siegel HJ () Software issues for the PASM parallel processing system. Software for parallel computation. Springer, Berlin
P
P
Path Expressions
. Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA () The ILLIAC IV computer. IEEE Trans Comput C ():– . Batcher KE () Bit serial parallel processing systems. IEEE Trans Comp C-():– . Blank T () The MasPar MP- architecture. In: Proceedings IEEE compcon spring ’, pp – . Bouknight WJ, Denenberg SA, McIntyre DE, Randall JM, Sameh AH, Slotnick DL () The ILLIAC IV system. Proc IEEE ():– . Fineberg SA, Casavant TL, Siegel HJ () Experimental analysis of a mixed-mode parallel architecture using bitonic sequence sorting. J Parallel Distrib Comput ():– . Flynn MJ () Very high-speed computing systems. Proc IEEE ():– . Hayes JP, Mudge T () Hypercube supercomputers. Proc IEEE ():– . Lipovski GJ, Malek M () Parallel computing: theory and comparisons. Wiley, New York . Lumpp JE, Fineberg SA, Nation WG, Casavant TL, Bronson EC, Siegel HJ, Pero PH, Schwederski T, Marinescu DC () CAPS – a coding aid used with the PASM parallel processing system. Commun ACM ():– . Nation WG, Maciejewski AA, Siegel HJ () A methodology for exploiting concurrency among independent tasks in partitionable parallel processing systems. J Parallel Distrib Comput (): – . Nichols MA, Siegel HJ, Dietz HG () Data management and control-flow aspects of an SIMD/SPMD parallel language/compiler. IEEE Trans Parallel Distrib Syst ():– . Nutt GJ () Microprocessor implementation of a parallel processor. In: Proceedings th annual symposium on computer architecture, pp – . Pfister GF, Brantley WC, George DA, Harvey SL, Kleinfelder WJ, Mcauliffe KP, Melton ES, Norton VA, Weiss J () The IBM research parallel processor prototype (RP): introduction and architecture. In: Proceedings International conference parallel processing, pp – . Saghi G, Siegel HJ, Gray JL () Predicting performance and selecting modes of parallelism: a case study using cyclic reduction on three parallel machines. J Parallel Distrib Comput (): – . Shestak V, Smith J, Maciejewski AA, Siegel HJ () Stochastic robustness metric and its use for static resource allocations. J Parallel Distrib Comput ():– . Siegel HJ () Interconnection networks for large-scale parallel processing: theory and case studies, nd edn. McGraw-Hill, New York . Siegel HJ, Armstrong JB, Watson DW () Mapping computervision-related tasks onto reconfigurable parallel-processing systems. IEEE Comput ():– . Siegel HJ, Schwederski T, Nation WG, Armstrong JB, Wang L, Kuehn JT, Gupta R, Allemang MD, Meyer DG, Watson DW () The design and prototyping of the PASM reconfigurable parallel processing system. Parallel computing: paradigms and applications. International Thomson Computer Press, London
. Siegel HJ, Siegel LJ, Kemmerer F, Mueller PT Jr, Smalley HE Jr, Smith SD () PASM: a partitionable SIMD/MIMD system for image processing and pattern recognition. IEEE Trans Comput C ():– . Tan M, Siegel JM, Siegel HJ () Parallel implementations of block-based motion vector estimation for video compression on four parallel processing systems. Int J Parallel Program (): – . Tucker LW, Robertson GG () Architecture and applications of the connection machine. IEEE Comput ():–
Path Expressions Roy H. Campbell University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Coordination; Concurrency control; Mutual exclusion; Process synchronization
Definition A path expression is a declaration of the permitted sequences of operations on an object that is shared by parallel processes. For example, consider a path expression based on a regular expression notation for a shared buffer. The shared buffer may have operations read and write that may only occur in a sequence specified by the regular expression (write; read)∗ where a write to the buffer may be followed by a read before it can be repeated. The reads and writes are mutually exclusive. The expression indicates the sequence of actions that may occur independent of the number of processes or the order in which the processes invoke the operations.
Discussion Path expressions provide a declarative specification of synchronization and coordination of parallel processes performing operations on an object in parallel computing systems []. They provide a language mechanism to separate the specification of synchronization from the algorithmic means of providing that synchronization []. Although the regular expression is used in our example, other languages that are well defined can be used instead.
Path Expressions
In general, the synchronization for an object being manipulated by parallel processes involves mutual exclusion, sequence, and repetition and these may be combined to form path expressions. In more detail, a path expression language based on regular expressions can been described as: . A sequence of operations on a resource. . A selection from one or more operations on a resource. . A repetition of a sequence or selection. . A path expression composed of (), (), or () above. Other primitives besides these three have been considered including: . A burst of parallel sequences of executions: If there is one execution of the operations in a path expression, then there can be parallel executions of the sequences of operations in a path expression up until the event when there are no more parallel executions of those operations occurring. For example, the expression (write, {read})∗ for a buffer would specify that a write or a burst of parallel reads can occur, but not a read and a write in parallel []. . A concurrency restriction: This allows parallel executions of the path expression up to the limit of the restriction. For example, the expression :(write; read) would allow up to parallel instances of a write followed by a read []. . Guarded selection: A path expression selection that describes the synchronization of the parallel execution of its operations when a guard is true. For example, the expression #buffers < ∣ (put), #buffers > (get) would describe synchronization where if the variable #buffers is less than , a put may occur or if the variable #buffers is greater than , a get may occur (see also []). More general forms of path expression may also be described []. For example, a general path expression can be composed of independent path expressions such that if all the path expressions allow the execution of an operation, it may occur. The general path expression: path (A; B)∗ end path (A; C)∗ end
P
requires an operation A to be followed by operation B and C. Operations B and C may possibly occur in parallel since there are no constraints between B and C but there are constraints between A and B and between A and C.
Implementation An operation on an object can be considered as a transition that changes the state of the object. When a process invokes an operation on an object, it can be viewed as a request for a transition from one state of the object to another. The implementation of a path expression decides whether to accept the request and proceed or whether to delay the request until an appropriate state and block the process invocation of that operation. When a path expression is built upon a regular expression that has the property that an operation can only occur when the object is in one particular state, then a simple implementation can be built using semaphores to represent the states. In this exposition, the implementation for open path expressions is described. An open path expression allows much of the power of a semaphore to be used in a declarative expression and forms the basis for the implementation of “Path Pascal,” a version of Pascal that includes concurrency and synchronization []. Open path expressions are composed of four constraints: sequence, selection, restriction, and derestriction. The open path expression notation allows a simple rewriting scheme to translate a path expression synchronization declaration into the prologues and the epilogues of semaphore operations of each of the operations mentioned in the path expression: [path
P
P
Path Expressions
Rule Selection (comma): α
.
Rule Sequence (semicolon): α
The following are some examples of the implementations of open path expressions: . path x, y end creates empty prologues and epilogues in the methods x and y allowing them to be executed without synchronization constraints. . path : (x) end rewrites to: ∣ S .P(); <x_body>; S .V();
.
.
Semaphore S = ; The body of x is executed in mutual exclusion. path : (x, y) end rewrites to: ∣ S .P(); <x_body>; S .V(); ∣ S .P();
The body of write is executed in mutual exclusion as is the body of read. However, there must always be more writes than reads and there may be up to more writes than reads.
Uses of Path Expressions Path expressions as a declarative form of specifying synchronization have found use in several parallel programming languages [, , ]. They have also been employed in single-assignment languages [], real-time control languages [], distributed objects [], event systems [], multimedia systems [], debugging [], VLSI design [], synchronization contracts for web services [], and workflow languages []. Typically,
PaToH (Partitioning Tool for Hypergraphs)
processes and path expressions are dual approaches to specifying the traces created in a computation; the processes generate sequences of actions which are constrained by path expressions to achieve particular synchronization constraints associated with the operations on abstract data types.
Summary In practice, synchronization of parallel processes remains a difficult problem that becomes ever more complex with increased concurrency in architecture. Path expressions provide a separate specification of synchronization from the code of processes that, under some circumstances, can simplify parallel programming. Although not widely adopted, their declarative approach to specifying synchronization has been found useful in various parallel applications. Path expressions have been used to synchronize parallel operations on objects in parallel languages, real-time languages, event systems, VSLI, workflow, and debugging.
Bibliographic Notes and Further Reading The semantics of parallel programs may be described using trace-based semantics. Path expression implementations restrict or recognize the permitted traces of parallel programs as described by Lauer and Campbell []. A semantics for path expressions is given by Dinning and Mishra using partially ordered multisets [] and the authors provide a fully parallel implementation for a path expression language on MIMD shared memory architectures. Path Pascal and open path expressions were used as pedagogical tools for teaching parallel programs [, ].
. Campbell RH, Habermann AN () The specification of process synchronization by path expressions. In: Gelenbe E, Kaiser C (eds) Operating systems. Lecture notes in computer science, vol . Springer, Berlin, pp – . Lauer PE, Campbell RH () Formal semantics of a class of high-level primitives for coordinating concurrent processes, Acta Informatica, ():– . Comte D, Durrieu G, Gelly O, Plas A, Syre JC () Parallelism, control and synchronization expression in a single assignment language. ACM SIGPLAN Not (): – . Dinning A, Mishra B () A fully parallel algorithm for implementing path expressions. J Parallel Distrib Comput :– . Dowsing RD, Elliott R () Programming a bounded buffer using the object and path expression constructs of path pascal. Comput J ():– . Heinlein C () Workflow and process synchronization with interaction expressions and graphs. Ph. D. Thesis (in German), Fakultät für Informatik, Universität Ulm . Hoepner P () Synchronizing the presentation of multimedia objects. Comput Commun ():– . Kidd M-EC () Ensuring critical event sequences in high integrity software by applying path expressions. Sandia Labs, Albuquerque . Kolstad RB, Campbell RH () Path Pascal user manual. SIGPLAN Not ():– . Laure E () ParBlocks – a new methodology for specifying concurrent method executions in opus. In: Amestoy P, Berger P, Dayde M, Ruiz D, Duff I, Fraysse V, Giraud L (eds) Euro-Par’. Lecture notes in computer science, vol . Springer, Berlin, pp – . Preiss O, Shah AP, Wegmann A () Generating synchronization contracts for web services. In: Khosrow-Pour M (ed) Information technology and organizations: trends, issues, challenges & solutions, vol . Idea Group Publishing, Hershey, pp – . Rees O () Using path expressions as concurrency guards. Technical report, ANSA . Schoute AL, Luursema JJ () Realtime system control by means of path expressions. In: Proceedings Euromicro ’ Workshop on Real Time, Horsholm, Denmark, pp – . Shaw AC () Software description with flow expressions. IEEE Trans Softw Eng SE-():–
Bibliography . Anantharaman TS, Clarke EM, Mishra B () Compiling path expressions into VLSI circuits. Distrib Comput :– . Andler S () Predicate path expressions. In: POPL’ proceedings of the th ACM SIGACT-SIGPLAN symposium on principles of programming languages. ACM Press, New York, pp – . Bruegge B, Hibbard P () Generalized path expressions: a high-level debugging mechanism. J Syst Softw :– (Elsevier Science Publishing) . Campbell RH () Path expressions: a technique for specifying process synchronization. PhD. Thesis, The University of Newcastle Upon Tyne
P
PaToH (Partitioning Tool for Hypergraphs) Ümit Çatalyürek , Cevdet Aykanat The Ohio State University, Columbus, OH, USA Bilkent University, Ankara, Turkey
Synonyms Partitioning tool for hypergraphs (PaToH)
P
P
PaToH (Partitioning Tool for Hypergraphs)
Definition PaToH is a sequential, multilevel, hypergraph partitioning tool that can be used to solve various combinatorial scientific computing problems that could be modeled as hypergraph partitioning problem, including sparse matrix partitioning, ordering, and load balancing for parallel processing.
Discussion Introduction Hypergraph partitioning has been an important problem widely encountered in VLSI layout design []. Recent works since the late s have introduced new application areas, including one-dimensional and twodimensional partitioning of sparse matrices for parallel sparse-matrix vector multiplication [–, ], sparse matrix reordering [, ], permuting sparse rectangular matrices into singly bordered block-diagonal form for parallel solution of LP problems [], and static and dynamic load balancing for parallel processing []. PaToH [] has been developed to provide fast and highquality solutions for these motivating applications. In simple terms, the hypergraph partitioning problem can be defined as the task of dividing a hypergraph into two or more roughly equal sized parts such that a cost function on the hyperedges connecting vertices in different parts is minimized. The hypergraph partitioning problem is known to be NP-hard [], therefore a wide variety of heuristic algorithms have been developed in the literature to solve this complex problem [, , , , ]. Following the success of multilevel partitioning schemes in ordering and graph partitioning [, , ], PaToH [] has been developed as one of the first multilevel hypergraph partitioning tools.
Preliminaries A hypergraph H = (V, N ) is defined as a set of vertices (also called cells) V and a set of nets (hyperedges) N among those vertices. Every net n ∈ N is a subset of vertices, that is, n ⊆ V. The vertices in a net n are called its pins in PaToH. The size of a net, s[n], is equal to the number of its pins. The degree of a vertex is equal to the number of nets it is connected to. Graph is a special instance of hypergraph such that each net has exactly two pins. Vertices and nets of a hypergraph can be associated with weights. For simplicity in the presentation,
net weights are refered as cost here and denoted with c[. ], whereas w[. ] will be used for vertex weights. Π = {V , V , . . . , VK } is a K-way partition of H if the following conditions hold: Each part Vk is a nonempty subset of V, that is, Vk ⊆ V and Vk ≠ / for ≤ k ≤ K. ● Parts are pairwise disjoint, that is, Vk ∩ Vℓ = / for all ≤ k < ℓ ≤ K. ● Union of K parts is equal to V, that is, ⋃Kk= Vk =V. ●
In a partition Π of H, a net that has at least one pin (vertex) in a part is said to connect that part. Connectivity λ n of a net n denotes the number of parts connected by n. A net n is said to be cut (external) if it connects more than one part (i.e., λn > ), and uncut (internal) otherwise (i.e., λn = ). In a partition Π of H, a vertex is said to be a boundary vertex if it is incident to a cut net. A K-way partition is also called a multiway partition if K > and a bipartition if K = . A partition is said to be balanced if each part Vk satisfies the balance criterion: Wk ≤ Wavg ( + ε), for k = , , . . . , K.
()
In (), weight Wk of a part Vk is defined as the sum of the weights of the vertices in that part (i.e., Wk = ∑v∈Vk w[v]), Wavg denotes the weight of each part under the perfect load balance condition (i.e., Wavg = (∑v∈V w[v])/K), and ε represents the predetermined maximum imbalance ratio allowed. The set of external nets of a partition Π is denoted as NE . There are various [, ] cutsize definitions for representing the cost χ(Π) of a partition Π. Two relevant definitions are: (a) χ(Π) = ∑ c[n]
and
n∈NE
(b) χ(Π) = ∑ c[n](λn − ).
()
n∈NE
In (a), the cutsize is equal to the sum of the costs of the cut nets. In (b), each cut net n contributes c[n](λn − ) to the cutsize. The cutsize metrics given in (a) and (b) will be referred to here as cut-net and connectivity metrics, respectively. The hypergraph partitioning problem can be defined as the task of dividing a hypergraph into two or more parts such that the cutsize is minimized, while a given balance criterion () among part weights is maintained. A recent variant of the above problem is the multiconstraint hypergraph partitioning [, , , , ] in which each vertex has a vector of weights associated
PaToH (Partitioning Tool for Hypergraphs)
with it. The partitioning objective is the same as above, and the partitioning constraint is to satisfy a balancing constraint associated with each weight. Let w[v, i] denote the C weights of a vertex v for i = , . . . , C. Then balance criterion () can be rewritten as: Wk,i ≤ Wavg,i ( + ε) for k = , . . . , K and i = , . . . , C , () where the ith weight Wk,i of a part Vk is defined as the sum of the ith weights of the vertices in that part (i.e., Wk,i = ∑v∈Vk w[v, i]), and Wavg,i is the average part weight for the ith weight (i.e., Wavg,i = (∑v∈V w[v, i])/K), and ε again represents allowed imbalance ratio. Another variant is the hypergraph partitioning with fixed vertices, in which some of the vertices are fixed in some parts before partitioning. In other words, in this problem, a fixed-part function is provided as an input to the problem. A vertex is said to be free if it is allowed to be in any part in the final partition, and it is said to be fixed in part k if it is required to be in Vk in the final partition Π.
Using PaToH PaToH provides a set of functions to read, write, and partition a given hypergraph, and evaluate the quality of a given partition. In terms of partitioning, PaToH provides a user customizable hypergraph partitioning via multilevel partitioning scheme. In addition, PaToH provides hypergraph partitioning with fixed cells and multi-constraint hypergraph partitioning. Application developers who would like to use PaToH can either directly use PaToH through a simple, easy-to-use C library interface in their applications, or they can use stand-alone executable.
PaToH Library Interface PaToH library interface consists of two files: a header file patoh.h which contains constants, structure definitions, and functions proto-types, and a library file libpatoh.a. Before starting to discuss the details, it is instructive to have a look at a simple C program that partitions an input hypergraph using PaToH functions. The program is displayed in Fig. . The first statement is a function call to read the input hypergraph file which is given by the first command line
P
argument. PaToH partition functions are customizable through a set of parameters. Although the application user can set each of these parameters one by one, it is a good habit to call PaToH function PaToH_Initialize_Parameters to set all parameters to one of the three preset default values by specifying PATOH_SUGPARAM_<preset>, where <preset> is DEFAULT, SPEED, or QUALITY. After this call, the user may prefer to modify the parameters according to his/her need before calling PaToH_Alloc. All memory that will be used by PaToH partitioning functions is allocated by PaToH_ Alloc function, that is, there will be no more dynamic memory allocation inside the partitioning functions. Now, everything is set to partition the hypergraph using PaToH’s multilevel hypergraph partitioning functions. A call to PaToH_Partition (or PaToH_MultiConst_Partition) will partition the hypergraph, and the resulting partition vector, part weights, and cutsize will be returned in the parameters. Here, variable cut will hold the cutsize of the computed partition according to cutsize definition (b) since, this metric is specified by initializing the parameters with constant PATOH_CONPART. The user may call partitioning functions as many times as he/she wants before calling function PaToH_Free. There is no need to reallocate the memory before each partitioning call, unless either the hypergraph or the desired customization (like changing coarsening algorithm, or number of parts) is changed. A hypergraph and its representation can be seen in Fig. . In the figure, large circles are cells (vertices) of the hypergraph, and small circles are nets. xpins and pins arrays store the beginning index of pins (cells) connected to each net, and IDs of the pins, respectively. Hence, xpins is an array of size equal to the number of nets plus one ( in this example), and pins is an array of size equal to the number of pins in the hypergraph ( in this example). Cells connected to net nj are stored in pins[xpins[j]] through pins[xpins[j+1]-1].
Stand-Alone Program Distribution includes a stand-alone program, called patoh, for single constraint partitioning (this executable will not work with multiple vertex weights; for multi-constraint partitioning there is an interface and some sample source codes). The program patoh gets
P
P
PaToH (Partitioning Tool for Hypergraphs)
its parameters from command line arguments. PaToH can be run from command line as follows: > patoh
This output shows that the cutsize (cut cost) according to cut-net metric is . Final imbalance ratios (in parentheses) for the least loaded and the most loaded parts are % (perfect balance with four vertices in each part), and partitioning only took about ms. The input hypergraph and resulting partition is displayed in Fig. . A quick summary of the input file format (the details are provided in the PaToH manual []) is as follows: the first non-comment line of the file is a header containing the index base ( or ) and the size of the hypergraph, and information for each net (only pins in this case) and cells (none in this example) follows. All of the PaToH customization parameters that are available through library interface are also available as command line options. PaToH manual [] contains details of each of those customization parameters.
Customizing PaToH’s Hypergraph Partitioning PaToH achieves K-way hypergraph partitioning through recursive bisection (two-way partition), and at each bisection step it uses a multilevel hypergraph bisection
> patoh sample.u 3 RA=6 UM=U +++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++ PaToH v3 (c) Nov 1999-, by Umit V. Catalyurek +++ Build # 872 Date: Fri, 09 Oct 2009 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ ******************************************************************************** Hypergraph : sample.u #Cells : 12 #Nets : 11 #Pins : 31 ******************************************************************************** 3-way partitioning results of PaToH: Cut Cost: 2 Part Weights : Min= 4 (0.000) Max= 4 (0.000) -------------------------------------------------------------------------------I/O : 0.000 sec I.Perm/Cons.H: 0.000 sec ( 2.9%) Coarsening : 0.000 sec ( 1.1%) Partitioning : 0.000 sec (75.8%) Uncoarsening : 0.000 sec ( 3.7%) Total : 0.001 sec Total (w I/O): 0.001 sec --------------------------------------------------------------------------------
PaToH (Partitioning Tool for Hypergraphs)
P
#include <stdio.h> #include "patoh.h" int main(int argc,char *argv[]) { PaToH_Parameters args; c, n, nconst, *cwghts, *nwghts, int *xpins, *pins, *partvec, cut, *partweights;
PaToH_Read_Hypergraph(argv[1], &_c, &n, &_nconst, &cwghts, &nwghts, &xpins, &pins); printf("Hypergraph %10s -- #Cells=%6d #Nets=%6d #Pins=%8d #Const=%2d\n", argv[1], _c, _n, xpins[_n], _nconst); PaToH_Initialize_Parameters(&args, PATOH_CONPART, PATOH_SUGPARAM_DEFAULT); args._k = atoi(argv[2]); partvec = (int *) malloc(_c*sizeof(int)); partweights = (int *) malloc(args._k*sizeof(int)); PaToH_Alloc(&args, _c, _n, _nconst, cwghts, nwghts, xpins, pins); if (_nconst==1) PaToH_Partition(&args, _c, _n, cwghts, nwghts, xpins, pins, partvec, partweights, &cut); else PaToH_MultiConst_Partition(&args, _c, _n, _nconst, cwghts, xpins, pins, partvec, partweights, &cut); printf("%d-way cutsize is: %d\n", args._k, cut); free(cwghts); free(nwghts); free(xpins); free(pins); free(partweights); free(partvec);
P
PaToH_Free(); return 0; }
PaToH (Partitioning Tool for Hypergraphs). Fig. A simple C program that partitions an input hypergraph using PaToH functions
algorithm. In the recursive bisection, first a bisection of H is obtained, and then each part of this bipartition is further partitioned recursively. After lg K steps, hypergraph H is partitioned into K parts. Please note that, K is not restricted to be a power of . For any K > , one can achieve K-way hypergraph partitioning through recursive bisection by first partitioning H into two parts with a load ratio of ⌊K/⌋ to (K−⌊K/⌋), and then recursively partitioning those parts into ⌊K/⌋ and (K − ⌊K/⌋) parts, respectively, using the same approach.
A pseudo-code of the multilevel hypergraph bisection algorithm used in PaToH is displayed in Algorithm . Mainly, the algorithm has three phases: coarsening, initial partitioning, and uncoarsening. In the first phase, a bottom-up multilevel clustering is successively applied starting from the original hypergraph until either the number of vertices in the coarsened hypergraph reduces below a predetermined threshold value or clustering fails to reduce the size of the hypergraph significantly. In the second phase, the coarsest
P
PaToH (Partitioning Tool for Hypergraphs)
v11 v9
v3
n3 v0
n8
v8
n7
n5
n0
n2 v1
v12
n9
v7 n6
v2
v4
n1
v5
n10 4
5
n4
0
1
2
0
5
7 11 13 15 19 21 25 27 29 31
xpins:
3
v6
6
7
8
9 10 11
pins: 0 2
1 3
2
3
5
6
4 6
5 0
6 1
7 0
8 1
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 2
3
1
3
4
5
4
5
6
7
6
7
8
9 10 11 8
9
9 11 2
5
PaToH (Partitioning Tool for Hypergraphs). Fig. A sample hypergraph and its representation % base:(0/1) #cells #nets #pins
V0
0 12 11 31 % pins of each net in the hypergraph
n6
v11 n9
23569
v6
v9
01
n7 n8
0123
v9
n5
n0
v12
v4
v5 n4
n10
13
V1 v7
45 v2
4567 v3
67 8 9 10 11
v0 n2
n3
n1
8 10 v1
8 11
a
25
V2
b
PaToH (Partitioning Tool for Hypergraphs). Fig. Text file representation of the sample hypergraph in Fig. and illustration of a partition found by PaToH
hypergraph is bipartitioned using one of the initial partitioning techniques. In the third phase, the partition found in the second phase is successively projected back towards the original hypergraph while it is being improved by one of the iterative refinement heuristics. These three phases are summarized below. . Coarsening Phase: In this phase, the given hypergraph H =H =(V , N ) is coarsened into a sequence of smaller hypergraphs H = (V , N ), H = (V , N ), . . ., Hℓ = (Vℓ , Nℓ ) satisfying ∣V ∣ > ∣V ∣> ∣V ∣ > . . . > ∣Vℓ ∣. This
coarsening is achieved by coalescing disjoint subsets of vertices of hypergraph Hi into clusters such that each cluster in Hi forms a single vertex of Hi+ . The weight of each vertex of Hi+ becomes equal to the sum of its constituent vertices of the respective cluster in Hi . The net set of each vertex of Hi+ becomes equal to the union of the net sets of the constituent vertices of the respective cluster in Hi . Here, multiple pins of a net n∈Ni in a cluster of Hi are contracted to a single pin of the respective net n′ ∈ Ni+ of Hi+ . Furthermore, the single-pin
PaToH (Partitioning Tool for Hypergraphs)
Algorithm Multilevel Bisection. function PaToHMLevelPartition(H = (V, N )) H ← H ℓ← /* Coarsening Phase: */ while ∣Vℓ ∣ > CoarseTo do find a clustering, C ℓ , using one of the coarsening algorithms construct Hℓ+ using C ℓ if (∣Vℓ+ ∣ − ∣Vℓ ∣)/∣Vℓ ∣ < CoarsePercent then break else ℓ ← ℓ+ end if end while /* Initial Partitioning Phase: */ find an initial partitioning Π ℓ of H ℓ /* Uncoarsening Phase: */ while ℓ > do refine Π ℓ using one of the refinement algorithms if ℓ > then project Π ℓ to Π ℓ− end if ℓ ←ℓ− end while return Π end function
nets obtained during this contraction are discarded. The coarsening phase terminates when the number of vertices in the coarsened hypergraph reduces below the predetermined number or clustering fails to reduce the size of the hypergraph significantly. In PaToH, two types of clusterings are implemented, matching-based, where each cluster contains at most of two vertices; and agglomerative-based, where clusters can have more than two vertices. The former is simply called matching in PaToH, and the latter is called clustering. The matching-based clustering works as follows. Vertices of Hi are visited in a user-specified order (could be random, degree sorted, etc.). If a vertex u ∈ Vi has not been matched yet, one of its unmatched adjacent vertices is selected according to a criterion. If such a vertex v exists, the matched pair u and v are merged into a cluster. If there is no unmatched adjacent vertex of
P
u, then vertex u remains unmatched, that is, u remains as a singleton cluster. Here, two vertices u and v are said to be adjacent if they share at least one net, that is, nets[u] ∩ nets[v] ≠ /. In the agglomerative clustering schemes, each vertex u is assumed to constitute a singleton cluster Cu = {u} at the beginning of each coarsening level. Then, vertices are again visited in a user specified order. If a vertex u has already been clustered (i.e., ∣Cu ∣>) it is not considered for being the source of a new clustering. However, an unclustered vertex u can choose to join a multivertex cluster as well as a singleton cluster. That is, all adjacent vertices of an unclustered vertex u are considered for selection according to a criterion. The selection of a vertex v adjacent to u corresponds to including vertex u to cluster Cv to grow a new multi-vertex cluster Cu =Cv =Cv ∪ {u}. PaToH includes a total of coarsening algorithms: eight matchings and nine clustering algorithms, and the default method is a clustering algorithm that uses absorption metric. In this method, when selecting the adjacent vertex v to cluster with vertex u, vertex v v ∩n∣ is selected to maximize ∑n∈nets[u]∩nets[Cv ] ∣C , where s[n]− nets[Cv ] = ∪w∈Cv nets[w]. . Initial Partitioning Phase: The goal in this phase is to find a bipartition on the coarsest hypergraph Hℓ . PaToH includes various random partitioning methods as well as variations of Greedy Hypergraph Growing (GHG) algorithm for bisecting Hℓ . In GHG, a cluster is grown around a randomly selected vertex. During the coarse of the algorithm, the selected and unselected vertices induce a bipartition on H ℓ . The unselected vertices connected to the growing cluster are inserted into a priority queue according to their move-gain [], where the gain of an unselected vertex corresponds to the decrease in the cutsize of the current bipartition if the vertex moves to the growing cluster. Then, a vertex with the highest gain is selected from the priority queue. After a vertex moves to the growing cluster, the gains of its unselected adjacent vertices that are currently in the priority queue are updated and those not in the priority queue are inserted. This cluster growing operation continues until a predetermined bipartition balance criterion is reached. The quality of this algorithm is sensitive to the choice of the initial random vertex. Since the coarsest hypergraph H ℓ is small, initial partitioning heuristics can be run multiple times and select the best bipartition for refinement during the uncoarsening
P
P
PaToH (Partitioning Tool for Hypergraphs)
phase. By default, PaToH runs different initial partitioning algorithms and selects the bipartition with lowest cost. . Uncoarsening Phase: At each level i (for i = ℓ, ℓ − , . . . , ), bipartition Πi found on Hi is projected back to a bipartition Πi− on Hi− . The constituent vertices of each cluster in Hi− is assigned to the part of the respective vertex in Hi . Obviously, Π i− of Hi− has the same cutsize with Πi of Hi . Then, this bipartition is refined by running a KL/FM-based iterative improvement heuristics on Hi− starting from initial bipartition Π i− . PaToH provides refinement algorithms that are based on the well-known Kernighan–Lin (KL) [] and Fiduccia–Mattheyses (FM) [] algorithms. These iterative algorithms try to improve the given partition by either swapping vertices between parts or moving vertices from one part to other, while not violating the balance criteria. They also provide heuristic mechanisms to avoid local minima. These algorithms operate on passes. In each pass, a sequence of unmoved/unswapped vertices with the highest gains are selected for move/swap, one by one. At the end of a pass, the maximum prefix subsequence of moves/swaps with the maximum prefix sum that incurs the maximum decrease in the cutsize is constructed, allowing the method to jump over local minima. The permanent realization of the moves/swaps in this maximum prefix subsequence is efficiently achieved by rolling back the remaining moves at the end of the overall sequence. The overall refinement process in a level terminates if the maximum prefix sum of a pass is not positive. PaToH includes original KL and FM implementations, hybrid versions, like one pass FM followed by one pass KL, as well as improvements like multilevelgain concept of Krishnamurthy [] that adds a lookahead ability, or dynamic locking of Hoffman [], and Dasdan and Aykanat [] that relaxes vertex moves allowing a vertex to be moved multiple times in the same pass. PaToH also provides heuristic trade-offs, like early-termination in a pass of KL/FM algorithms, or boundary KL/FM, which only considers vertices that are in the boundary, to speed up the refinement. The default refinement scheme is boundary FM+KL.
Related Entries Chaco Data Distribution
Graph Algorithms Graph Partitioning Hypergraph Partitioning Linear Algebra, Numerical Preconditioners for Sparse Iterative Methods
Bibliographic Notes and Further Reading Latest PaToH binary distributions, including recently developed MATLAB interface [], and related papers can be found on the Web site listed in []. The “Hypergraph Partitioning” entry contains some use cases of hypergraph partitioning.
Bibliography . Alpert CJ, Kahng AB () Recent directions in netlist partitioning: a survey. VLSI J (–):– . Aykanat C, Cambazoglu BB, Uçar B (May ) Multi-level direct k-way hypergraphy partitioning with multiple constraints and fixed vertices. J Parallel Distrib Comput ():– . Aykanat C, Pinar A, Çatalyürek UV () Permuting sparse rectangular matrices into block-diagonal form. SIAM J Sci Comput ():– . Bui TN, Jones C () A heuristic for reducing fill-in sparse matrix factorization. In: Proceedings of the th SIAM conference on parallel processing for scientific computing, Norfolk, Virginia, pp – . Catalyurek U, Boman E, Devine K, Bozdag D, Heaphy R, Riesen L (Aug ) A repartitioning hypergraphy model for dynamic load balancing. J Parallel Distrib Comput ():– . Çatalyürek UV () Hypergraph models for sparse matrix partitioning and reordering. Ph.D. thesis, Bilkent University, Computer Engineering and Information Science, Nov . http://www.cs. bilkent.edu.tr/tech-reports//ABSTRACTS..html. . Çatalyürek UV, Aykanat C (Dec ) A hypergraph model for mapping repeated sparse matrixvector product computationsonto multicomputers. In: Proceedings of international conference on high performance computing . Çatalyürek UV, Aykanat C () Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans Parallel Distrib Syst ():– . Çatalyürek UV, Aykanat C () PaToH: a multilevel hypergraph partitioning tool, version .. Bilkent University, Department of Computer Engineering, Ankara, Turkey. PaToH. http://bmi. osu.edu/~umit/software.html, (accessed on November , ) . Çatalyürek UV, Aykanat C () A hypergraph-partitioning approach for coarse-grain decomposition. In: ACM/EEE SC, Denver, CO, November . Çatalyürek UV, Aykanat C, Kayaaslan E () Hypergraph partitioning-based_ll-reducing ordering. Technical Report
PCI Express
.
.
.
.
.
.
.
.
. . . . .
.
. .
OSUBMI-TR--n and BU-CE-, The Ohio State University, Department of Biomedical Informatics and Bilkent University, Computer Engineering Department, . submitted for publication Çatalyürek UV, Aykanat C, Ucar B () On two-dimensional sparse matrix partitioning: models, methods, and a recipe. SIAM J Sci Comput ():– Cheng C-K, Wei Y-C () An improved two-way partitioning algorithm with stable performance. IEEE Trans Comput Aided Des ():– Dasdan A, Aykanat C (February ) Two novel multiway circuit partitioning algorithmsusing relaxed locking. IEEE TransComput Aided Des ():– Fiduccia CM, Mattheyses RM () A linear-time heuristic for improving network partitions. In: Proceedings of the th ACM/IEEE design automation conference, pp – Hendrickson B, Leland R () A multilevel algorithm for partitioning graphs. Technical reports, Sandia National Laboratories Hoffmann A () Dynamic locking heuristic – a new graph partitioning algorithm. In: Proceedings of IEEE international symposium on circuits and systems, pp – Karypis G, Kumar V () A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput ():– Karypis G, Kumar V () Multilevel algorithms for multiconstraint graph partitioning. Technical Report -, University of Minnesota, Department of Computer Science/Army HPC Research Center, Minneapolis, MN , May Kernighan BW, Lin S (Feb ) An efficient heuristic procedure for partitioning graphs. Bell SystTech J ():– Krishnamurthy B (May ) An improved min-cut algorithm for partitioning VLSI networks. IEEE Trans Comput ():– Lengauer T () Combinatorial algorithms for integrated circuit layout. Willey–Teubner, Chichester, UK Sanchis LA (Jan ) Multiple-way network partitioning. IEEE Trans Comput ():– Schloegel K, Karypis G, Kumar V () Parallel multilevel algorithms for multi-constraint graph partitioning. In: Euro-Par, pp – Schweikert DG, Kernighan BW () A proper model for the partitioning of electrical circuits. In: Proceedings of the th ACM/IEEE design automation conference, pp – Uçar B, Çatalyürek ÜV, Aykanat C () A matrix partitioning interface to PaToH in MATLAB. Parallel Comput (–):– WeiY-C,ChengC-K(July)Ratiocutpartitioningforhierarchical designs. IEEE Trans Comput Aided Des ():–
Partitioning Tool for Hypergraphs (PaToH) PaToH (Partitioning Tool for Hypergraphs)
P
PC Clusters Clusters
PCI Express Jasmin Ajanovic Intel Corporation, Portland, OR, USA
Synonyms GIO; PCI-Express; PCIe; PCI-E
Definition PCI (Peripheral Component Interconnect) Express is a highly scalable interconnect technology that is the most widely adopted IO interface standard used in the computer and communication industry []. By providing scalable speed/width, extendable protocol capabilities, a common configuration/software model, and various mechanical form-factors, PCI Express supports a broad range of applications. It allows implementation of flexible connectivity between a processor/memory complex and an IO subsystems, including peripheral controllers, such as graphics, networking, storage, etc. PCI Express technology development is managed by PCI-SIG (PCI Special Interest Group), an industry association comprising of over member companies.
Discussion Introduction – A Brief History of PCIe PCI Express has his roots in Peripheral Component Interconnect (PCI), an open standard specification that was developed by the computing industry in . PCI was a replacement for the ISA bus which was a mainstream PC architecture IO expansion standard at the time. Although there were several alternative solutions, such as MicroChannel, EISA, and VL-bus, that were aiming to replace/supplement ISA, none of them fully addressed the needs of an evolving PC industry. The PCI specification covered both the hardware and software interfaces between PC’s CPU/memory complex and add-in cards, such as graphics, network, and disk controllers. One of the most important aspects of PCI was support for the so called “plug-and-play” mechanisms
P
P
PCI Express
which allowed operating system software to detect installed hardware components, including both add-in cards and motherboard-down devices, configure system resources required for their operations, including memory address ranges and interrupts, and install appropriate software device drivers. From a hardware perspective, PCI was initially defined as a -bit multiplexed address/data bus that was shared among multiple devices attached to it, operating at Volt and at speed of MHz. Over time, faster variants of PCI evolved to support ever increasing performance requirements of CPUs which rose operational frequencies from MHz in to over GHz by . These variants include: ● ●
MHz - and -bit PCI operating at V or V. AGP (Accelerated Graphic Port), a graphicsoptimized interface operating at a common clock speed of MHz and carrying data transfers at x, x, and x of that nominal speed resulting in maximum bandwidth of ., , and GB/s over the -bit interface. ● PCI-X, a server-optimized solution operating at clock speeds of , , and MHz and resulting in maximum bandwidth of ., ., and GB/s over the -bit interface. Towards the end of the twentieth century, neither of these solutions proved to be an adequate answer to emerging applications that required higher scalability (performance and protocol capabilities) and a more efficient interface (pins, bandwidth, and power). In , Intel Corporation, together with several other industry leaders, spearheaded the development of the next generation IO architecture. Initially called GIO [], this architecture was later endorsed by the PCI-SIG as PCI Express (PCIe). Appearing in systems starting in , PCI Express was rapidly adopted by the PC/PCI ecosystem replacing entirely AGP and PCI-X, and in some instances PCI, within or product generations. The key to the PCI Express success was in leveraging of PCI base architecture by supporting full backwards compatibility at the software level while providing a more optimized interconnect solution. The PCI Express physical interface is based on width- and speed-scalable, point-to-point,
differential serial signaling technology operating initially at . GT/s (Giga Transfers/second). As fast as the new PCI Express standard was, processor architectures have continued to accelerate to meet the demands of emerging applications, and PCI Express needed to keep up. For an example, increasing bandwidth was needed to improve the performance of data-intensive graphics workloads as well as server-class storage and network solutions. PCI Express continued to improve by scaling-up speed as well adding architectural/protocol extensions. Figure shows the evolution of PCI Express technology.
Technology Overview Basic Elements and Concepts Link and Lane
PCI Express is a point-to-point interconnect technology where Link represents a fundamental element of PCIe -based interconnect fabric. A basic Link is a dualsimplex communications channel between two components that represents a single Lane. Lane consists of two low-voltage, differentially driven signal pairs: a Transmit pair and a Receive pair as shown in Fig. . To aggregate the bandwidth of a PCIe Link, multiple Lanes can be grouped together to provide a “wider” Link. In PCIe terminology, Link width is expressed using xN (“by N”) denotation, where N is the number of Lanes that form the Link. PCIe architecture specification defines operations for x, x, x, x, x, x, and x Link widths. Companion form-factor specifications (add-in card and connector) define operations for a subset of operational widths (x, x, x, and x). Signaling, Speed, and Bandwidth
To carry communication over the Link, PCIe uses LowVoltage Differential Signaling (LVDS) with embedded clocking. Embedded clocking is a mechanism where clock information is embedded within the transmitted data by providing a guaranteed number of transitions between “”s and “”s that are required to correctly extract the clock on the receiver side. Revision . (Gen) and Revision . (Gen) of PCIe Specification use standard b/b encoding scheme [] which is a common mechanism for a number of industry standard interfaces based on serial signaling technology (e.g., Infiniband, Fibre Channel, SATA, etc.). This scheme
PCI Express
Rev 2.1 Protocol Extensions: • Atomic Ops, Caching Hints • Improved PM, Multicast • Enhanced SW Model
60
IOV Extensions: • IO Virtualization • Device Sharing
50
GB/sec *
P
PCIe 3.0 Gen3 @ 8GT/s 130/128b encoding
PCIe 2.0 Gen2 @ 5GT/s 8b/10b encoding
40
30
PCIe 1.0 Gen1 @ 2.5GT/s 8b/10b encoding
20
10 PCI/PCI-X 1999
2001
2003
2005
2007
2009
2011
2013
*Bandwidth shown for x16 PCIe
PCI Express. Fig. PCI express technology roadmap
comes with an overhead of % because it uses bits to carry bits of actual information. Since it is relatively simple and very robust, b/b was used by PCIe as a reasonable tradeoff between efficiency and complexity allowing PCIe Gen and Gen to hit target effective bandwidths while operating at moderately high speeds of . GT/s and GT/s respectively. However, continuing with x increase of operational speed proved to be difficult from a technology enabling/adoptability and ecosystem perspective. For PCIe . (Gen ), the PCISIG defined a new more efficient b/b encoding scheme, which with ∼% overhead effectively doubles the bandwidth of GT/s Gen while operating only at GT/s instead of GT/s (Note that % overhead of b/b brings effective data transfers of Gen from GT/s down to GT/s.) The following Table shows raw and effective speeds of all three generations of PCIe, including a total bandwidth based on an example of x PCIe link. To support backwards compatibility at the system and component/add-in card level, PCI Express Specification requires newer generation devices that operate at higher speeds to support operations at
lower speed. Examples: Gen PCIe device operating at nominal GT/s must support operations at . GT/s, Gen device must support operations at both GT/s and . GT/s.
P Link Configuration PCI Express architecture allows devices that support different Link widths and speeds to be configured for proper operation. The negotiation of width and speed is done as a part of Link initialization process where two devices exchange information about their capabilities using a lowest common denominator method to determine the operational width and speed. For an example, if Device A that supports x width at . GT/s is connected to Device B that supports x width at . GT/s, the Link that connects them will be initialized for x width operating at . GT/s. Note that PCIe allows only configurations of the devices with symmmetric Link width. Link width/speed connfiguration occurs at the Link hardware level without any involvement of software.
P
PCI Express
⫻1 PCIe Link VCC
Ref. clock
+ –
Device B
Device A
Packet
+ –
Ref. clock
Packet
VSS ⫻4 PCIe Link VCC
Lane
Link
+ – + – Ref. clock
+ – + –
Device B
Ref. clock Device A
+ – + – + – + –
VSS
PCI Express. Fig. PCI express link examples - x and x PCI Express. Table PCI express speeds and bandwidths Total bandwidth∗ For x link
PCIe Generation
Raw bit rate
Effective bit rate
Bandwidth per lane Per direction
PCIe .x
. GT/s
Gb/s
∼ MB/s
∼ GB/s
PCIe .
. GT/s
Gb/s
∼ MB/s
∼ GB/s
PCIe .
. GT/s
Gb/s
∼ GB/s
∼ GB/s
∗ Total
bandwidth represents the aggregate interconnect bandwidth in both directions
Once Link is initialized and configured for proper width/speed operations, it may be re-configured during the run-time as a result of a RAS (Reliability, Serviceability, Availability) event due to, for example, data integrity problems. Speed and/or width can be reduced in an attempt to correct the problem and keep the system running with reduced performance/functionality. Note that the PCIe Specification defines a mechanism where software, through aceess to configuration/control registers, can overide the established hardware configuration of the Link and force Link to operate at lower speed than nominally capable.
Packet-Based Protocol
Instead of using dedicated signals for address, control, and data (such as its predecessor PCI), PCI Express uses packets to communicate information between components connected to the PCIe fabric. Packets contain all of the information related to transactions such as: source and target identification/address, type (e.g., Memory Read, Memory Write, Message), attributes (e.g., Isochronous, No Snoop), and data. In addition to this information, Link hardware logic inserts additional information required for correct packet transfer across the Link such as framing, sequencing and packet
PCI Express
Sequence number
Request packet
CRC
Frame
Device A
Device B
Frame
P
Frame
CRC
Completion packet
Sequence number
Frame
PCI Express. Fig. Packet-based protocol
integrity protection (CRC). Packets can be differentiated as Request and Response packets. To complete the entire transaction (e.g., Memory Read), a single Request packet may cause one or multiple Response packets. Note that some request transaction types (e.g., Message) do not require responses (Fig. ).
PCI Express Layering Overview The PCIe interface stack is defined using a layered approach with PCIe Transaction, Data Link, and Physical Layers being formally defined in PCIe Base Specification and the rest of the stack related to Software and Mechanical being covered in companion documents. The following simplified Fig. shows PCIe interface stack and highlights major aspects of each layer.
Software
Transaction
PCI compatible configuration and device driver software model Packet-based split-transaction protocol
Data link
Reliable data transport services
Physical
Electrical interface and signaling mechanisms
Mechanical
Market segment specific form factors
PCI Express. Fig. PCI express interface stack
requests, etc., PCIe uses Messages as “virtual wires” that carry information in-band. This improves overall efficiency of the interface by eliminating large number of pins/wires.
Transaction Layer
The Transaction Layer is responsible for assembly (for transmit) and disassembly (upon receive) of Transaction Layer Packets (TLPs). PCIe supports load-store as well as message-based transactions, where TLPs are used to communicate transactions such as reads, writes, messages, and events using different type of semantics (Memory, I/O, Configuration, and Message), address types/formats (-bit/-bit, device ID), and attributes (No Snoop, Relaxed Ordering, and ID-Based Ordering). The Transaction Layer manages flow of packets between transmitting and receiving devices using credit-based flow control scheme. Instead of using physical pins/wires to support sideband signals, such as interrupts, power-management
Data Link Layer
This layer provides Link management and data integrity, including error detection and error correction. On the transmit side the Data Link Layer calculates and applies a CRC (Cyclic Redundancy Check) data protection code on the TLP submitted by Transaction Layer. It also adds a TLP sequence number, and passes entire packet to Physical Layer for transmission across the Link. On the receive side the Data Link Layer checks the integrity of received TLPs before passing them to the Transaction Layer for further processing. In the case if an error is detected, this Layer requests retransmission of TLPs until information is correctly received, or the Link is determined to have failed.
P
P
PCI Express
Physical Layer
PCI Express Platform Examples
The Physical Layer converts information received from the Data Link Layer into an appropriate serialized format and transmits it across the Link. This Layer consists of Electrical and Logical functional blocks. Electrical block includes all circuitry required for interface operation: output and input buffers, parallel-toserial and serial-to-parallel conversion, PLL(s), and transmitter/receiver signal conditioning circuitry. Logical block supports functions required for interface initialization and maintenance, including configuration of Link speed and width.
The PCIe architecture supports a variety of platform configurations by allowing the mixing and matching of PCIe link speeds and widths to support different applications. Figure shows examples of PCIe -based client and server computer platforms. To support high-performance graphics applications, client/workstation computer platforms typically utilize x PCIe. Lower bandwidth functions such as network adapters typically use x PCIe. Various additional functions (e.g., enhanced audio) can be provided using addin card slots. Note that an IO Bridge shown on a client platform in Fig. also provides support for some legacy functions, such as old V/ MHz PCI bus that is required until the technology transition completes. Server platforms typically use x and x PCIe ports that are required by the high-bandwidth application such as high-performance network (GigE) and storage (SAS) adapters, Infiniband, Fibre Channel, etc. An important building block for servers is PCIe Switch which is architecturally supported by the PCIe Specification. PCIe Switch is typically used for expansion of IO subsystem by providing additional PCIe interface ports.
Packet Flow Through the Layers
The TLPs are formed in the Transaction Layer to carry the information from the transmitting component to the receiving component. As the transmitted packets flow through the other layers, they are extended with additional information necessary to carry out transfers. At the receiving side the reverse process occurs and packets get transformed from their Physical Layer representation to the Data Link Layer representation and finally to the form that can be processed by the Transaction Layer of the receiving device. Figure shows the conceptual flow of transaction -level packet information through the layers. Note that for the purpose of Link management, a simpler form of packet communication is supported between two Data Link Layers that are connected to the same Link. These packets are referred to as Data Link Layer Packets (DLLP).
Frame 1 byte
Sequence# 2 byte
Header 12/16 byte
Architecture Features PCI Express is a scalable architecture that provides a highly flexible and expandable set of capabilities. This section highlights essential features of PCIe.
Data 0–4 Kbyte
Transaction layer
Data link layer
Physical layer
PCI Express. Fig. Packet formation
ECRC 4 byte
LCRC 4 byte
Frame 1 byte
PCI Express Client platform
CPU
Memory
PCIe ⫻16 Memory bridge
Memory
PCIe ⫻4/⫻8
10Gb ethernet
PCIe ⫻8/⫻16 IO Hub
PCI
PCIe ⫻8
PCIe ⫻1 PCIe ⫻1
Wireless NIC
Memory
CPU
PCIe ⫻8
PCI express expansion 1Gb ethernet
Server platform
CPU
Graphics
P
IO Bridge
USB SATA
SAS RAID
PCIe switch
Infiniband
HDD
PCIe ⫻1 Add ins PCIe ⫻1
RAID storage PCIe ⫻4 expansion slots
Add ins
PCI Express. Fig. Examples of PCIe based platforms
Scalable Protocol PCIe uses a fully packetized split-transaction protocol as described under the Technology Overview section. In addition to load-store semantics, it provides messaging support that is used by the Virtual Wire mechanism to eliminate side-band signals by converting them into in-band messages. PCIe uses a credit-based scheme to control the flow of packets within the PCIe fabric. This scheme applies per Link, and it is used to prevent buffer overflow. To minimize the dependency between different traffic flows in a system, PCI Express provides Virtual Channel (VC) mechanism and transaction ordering relaxations. These mechanisms mitigate the overhead of a strict producer-consumer ordering model (that may create head-of-line blocking conditions) and allow support for differentiated Quality of Service at the platform level. For future scalability of supported packet formats, PCIe defines Transaction Layer Packet Prefix, a mechanism for extending TLP headers. PCI Compatible Software Model PCIe leverages standard mechanisms defined in the PCI Plug-and-Play specification for device initialization, enumeration, and configuration. This allows legacy
software operating systems to boot without modifications on a PCIe -based system. It also preserves compatibility with device driver software model, enabling existing API, and application software to execute unchanged. To remove platform overhead and limitations associated with legacy PCI software configuration, PCIe provides Enhanced Configuration mechanism. This mechanism needs to be supported by newer operating systems to fully take the advantage of PCIe advanced capabilities (e.g., Advanced Error Logging/Reporting, Virtual Channel Mechanism). Enha nced Configuration also allows new capabilities to be added in the future.
Scalable Performance Primary factor in PCIe performance scaling as measured by the interface bandwidth is function of interface width and operational frequency and can be expressed using the following formula: Total Link Bandwidth [GB/s] = Link_Width[Number of lanes] × Effective_Signaling_Rate[Gb/s] × [directions]/[bits]
P
P
PCI Express
GB/s = Giga Byte per second, Gb/s = Giga bit per second, GT/s = Giga Transfer per second for Gen and Gen: Effective_Signaling_Rate[Gb/s] = Raw_ Signaling_Rate[GT/s] × . for Gen: Effective_Signaling_Rate [Gb/s] ≈ Raw_ Signaling_Rate [GT/s] Secondary factor of PCIe performance is related to the fact that PCIe is point-to-point interconnect with two unidirectional signaling paths which are contentionfree i.e., do not require arbitration. This means that in systems that contain multiple PCIe Links, traffic on each Link can occur independently and simultaneously which improves total system throughput. An additional contributor to performance is related to PCIe -pin/bandwidth efficiency. By improving this aspect, PCIe allows a tighter IO integration within the platform, e.g., direct PCIe attach to high-integration CPUs. This can result in lower IO system latencies in an optimized system. Note that the above formula shows only theoretical bandwidths, but that actual performance as seen at the application level depends on number of factors. These factors include PCIe -interface and system implementation aspects such as: supported data payload size, interface-level data buffering/queuing and effectiveness of flow-control, arbitration mechanisms throughout the platform (e.g., competing for host memory access), power -management mechanisms and policies, software device driver and API overheads, etc.
Advanced Power Management PCIe defines Link level power -management scheme including the active-state power -management (ASPM) protocol. The ASPM manages power -state transitions on the Link transparently to run-time software by detecting idle conditions on the Link, i.e., when no actual data is being communicated over the Link. Note that in signaling schemes that use embedded clocking, data needs to be transmitted continuously to maintain synchronization between transmitter and receiver. To reduce the power consumption during “idle,” PCIe defines low-power link states. These states save power but transitions into and out of them require recovery time to resynchronize the transmitter and receiver which potentially may impact the Link latency and affect the overall system performance. In addition to
Link level, power-management PCIe defines related system level power-management mechanisms such as: ●
Latency Tolerance Reporting – reduces platform power based on PCIe device service requirements. ● Opportunistic Buffer Flush and Fill – aligns PCIe -device activity with platform power-management events to further reduce platform power. ● Dynamic Power Allocation–dynamic control of power/thermal budget per PCIe device.
Reliability, Availabilty, Serviceability (RAS) Support RAS capabilities are defined to: ensure data integrity, provide ability to identify/manage errors, and allow installation/removal of components in the running system without requiring operating system shutdown. ●
Data Integrity – As a basic/required mechanism, PCIe defines a data integrity scheme where -bit CRC (Cyclic Redundancy Check) is used to protect packets transmitted over the single Link. For platforms that use more complex topologies (i.e., with PCIe Switches) and require end-to-end data integrity, PCIe defines an optional end-to-end -bit CRC. This CRC code is used in addition to Link local -bit CRC to protect the data in high -reliability server applications. ● Hot Plug and Hot Swap – PCIe defines native support for hot plugging or hot swapping of IO cards/modules. Solutions based on PCIe hot plug/ swap address requirements of both server and portable computer platforms and support industry standard software stacks for platform management and configuration. ● Advanced Error Reporting/Handling – PCIe defines support for error logging/reporting to improve system -fault isolation and enable recovery solutions.
Differentiated Quality of Service (Qos) Support Using Virtual Channel (VC) mechanism as a foundation, PCIe supports eight levels of QoS differentiated traffic including isochronous traffic type. PCIe specification defines VC-system configuration and programmable-VC arbitration mechanism necessary to support an end-to-end solution designed for applications, such as isochronous that require real-time delivery of voice and video related data.
PCI Express
IO Virtualization and Device Sharing Support PCIe defines an IO interface-level mechanism for support of platform virtualization. In virtualized platforms, a single instance of platform hardware is mapped in to a multiple independent Virtual Machines, each running their own operating system and application software stacks. In addition to PCI Express Base Specification, the PCI-SIG maintains a set of specifications that cover PCIe support for:
Note that IO Virtualization capabilities require new software, both at the PCIe fabric configuration/management level as well as at the system level.
Support for Heterogeneous Processing and Application Acceleration There is a set of capabilities that are provided to support high-performance applications which require efficient co-processing and data sharing, such as GP-GPU, computational accelerators, and network/storage accelerators. These include: ●
Transaction Layer Packet Processing Hints – Request hints from PCIe device to enable optimized processing within host memory/cache hierarchy. ● Atomic Read-Modify-Write Transactions – Reduce synchronization overhead for shared data structures. ● Address Based Multicast – Allows significant gains in efficiency compared to multiple unicast transactions by carrying transaction between the single source and multiple destinations.
Form-Factors In addition to chip-chip connectivity usage, PCI Express supports a variety of system board and add-in
card form-factors. To address diversity of applications across many segments (desktop and mobile PCs, servers, embedded/communications, etc.), the following add-in form-factors are defined: ● ● ●
●
Address Translation Services: allow PCIe devices to obtain copies of translated system addresses to mitigate address translation latencies in IO Virtualized platform. ● Single-Root IO Virtualization and Device Sharing: improves system performance scaling by allowing a single PCIe device to be presented to the software as multiple independent/pseudo-independent hardware devices. ● Multi-Root IO Virtualization: allows multiple independent PCIe-based platforms (such as in bladeserver systems) to overlap/share system resources.
P
● ●
PCIe Card Electromechanical (CEM) used for desktops, workstations, and servers. PCIe Mini Card used for portable computers. PCIe Module also known as Server IO Module (SIOM) optimized for server/communication systems. ExpressCard used for portable computers and small form-factor desktops. PCIe External Cabling Spec used for embedded/ communication.
Figure shows a sample of form-factors including the most common PCIe CEM add-in card specified by the PCI-SIG. This card is defined to fit into the traditional desktop PC and server systems and comes with different connector widths (x, x, x, and x). It supports power usage from baseline W up to W in x variant used for high-end graphics and similar applications. In addition to the PCI-SIG, there are several other industry associations, such as PCMCIA and PICMG that have been involved in development of PCI Express based form-factors.
PCI Express Today Products based on first generation (. GT/s) and second generation ( GT/s) of PCIe technologies are deployed broadly across computer and communication/embedded industries. When you consider this in light of the PCI-SIG’s members, many of them very active, the PCIe product development and users community represents a large ecosystem. According to a report from an industry analyst organization, In-Stat [], the size of market of components that include PCIe interface will exceed million units/year by . PCI-SIG is marching towards the milestone of releasing a third generation of PCI Express spec, i.e., PCIe Rev . in second half of . In addition to . and GT/s, PCIe . supports GT/s operational speed by using a new, more efficient signaling encoding scheme which will allow doubling of bandwidth compared to the prior generation. In addition, the new generation of PCIe products based on Rev . will support enhanced power management as well as features for
P
P
PCI Express
Traditional desktop/server form-factor add-in card ⫻1
PCIe add-in card
⫻4
Connectors ⫻8 ⫻16
Mobile expresscard
Server IO module
PCI Express. Fig. Example of PCIe form-factors
application performance acceleration through: atomic read-modify-write, data reuse/caching hints, multicast operations, etc.
the others, external interfaces will be still needed, but with a requirement for significantly improved power efficiency. This may dictate evolution of PCIe technology in two directions:
Future Directions There are two very significant trends within the mainstream computer and communication industry that will influence the evolution of PCI Express technology: ●
Increasing level of integrated functionality resulting in very high-integration single die devices, known as SoC (System on a Chip), or high-integration multidie components/modules, known as SiP (System in a Package). ● Evolution of Internet with ever increasing need for improved computational, storage, and communication performance of the main building blocks that provide the service. First trend is mainly driven by the applications that dictate small form-factors, low-power, and low-cost. They range from smart-phones and entertainment devices to the components used in embedded control applications such as home appliances, cars, printers, etc. For some of these products, levels of integration will result in removal of external interfaces such as PCIe and for
●
On-die interconnects that are architecturally equivalent to PCIe and allow integration of PCIe devices in the form of IP (Intellectual Property) building blocks. ● Lower power and lower speed variants of PCIe. Note that significant value of PCIe ecosystem comes from the ability to support standard operating system software, device drivers, and applications. On-die interconnects that are an architectural equivalent (not necessarily physical/electrical equivalent) of PCIe, will enable integration of hardware functionality without the requirement to change the software stack. This may result in huge R&D savings and faster time-tomarket for SoC/SiP designs that use PCIe IP building blocks. Second trend, evolution and expansion of Internet, will translate into a requirement to the computer industry to provide new generations of communication
PCI Express
P
servers and data centers that are capable of handling increasing and diverse workloads. This will drive a need for hardware and software optimizations for virtualization and cloud computing. Evolution of traditional rack-servers and blade-servers may open room for a new interconnect backbone/fabric based on PCIe technology. This will drive the following enhancements to the PCIe technology:
Routing (Including Deadlock Avoidance)
●
As mentioned in the introduction, development of PCI Express technology as an industry standard is being coordinated through PCI Special Interest Group (PCISIG). This organization maintains the website with officially published specification documents and other collaterals that are part of PCI Express standard. For detailed information go to:
Enhanced routing and configuration to support more elaborate topologies (besides PCI/PCIe hierarchical tree) to allow system scalability through aggregation of large number of discrete devices as well as high-integration devices (that consist of multiple logical devices). ● Capability for tunneling and mapping other protocols (e.g., for storage, networking, system management, etc.) which will allow consolidation of interconnect technologies within the rack/blade and perhaps even a data center. ● Doubling/multiplying bandwidth needed to support next generation of Internet, HPC (HighPerformance Computing), and cloud computing. This will eventually result in emergence of PCIe Optical connectivity starting with discrete solutions at speeds ranging from GT/s to GT/s and evolving to GT/s silicon-photonics-based integrated solution. Evolution of PCI Express will likely track the future path of the computer industry. New IO technologies will continue to emerge in the future, using some elements of PCI Express as its foundation.
Related Entries Bandwidth-Latency Models (BSP, LoGP) Benchmarks Buses and Crossbars Collective Communication Computer Graphics Data Centers Deadlocks Ethernet Flow Control InfiniBand Interconnection Networks Network Interfaces
Switch Architecture Switching Techniques Synchronization
Bibliographic Notes and Further Reading
●
PCI-SIG: www.pcisig.com.
Note that, although some documents are available to a general audience, membership with the PCI-SIG is required for non-restricted access to all documents and specifications managed by the SIG. In addition to PCI-SIG, there are several other industry organizations, such as: ● ●
PCMCIA: www.pcmcia.org PICMG: www.picmg.org
that manage development of complementary industry standards based on PCIe Base Specification. These organizations enhance PCI Express technology by defining additional form-factors and design collaterals that allow even broader adoption of PCI Express component-level hardware products as well as companion software. Among the component and platform vendors that provide significant additional support for PCI Express technology is Intel Corporation with: Intel Developer Network for PCI Express∗ Architecture. This forum provides members with technical data, marketing support and industry connections needed to accelerate the innovation and marketing of PCI Express based solutions. For more details see: ●
PCIe DevNet: http://www.intel.com/technology/pci express/devnet/
There are several technical books that provide overviews of PCI Express technology as well as detailed implementation guidelines. See references [–].
P
P
PCIe
Bibliography . Intel white paper. Creating a Third Generation I/O Interconnect. www.intel.com/technology/pciexpress/devnet/docs/ WhatisPCIExpress.pdf . In-Stat – In Depth Analysis (www.in-stat.com): I/O, I/O, Changing the Status Quo: Chip-to-Chip Interconnects, February , , http://www.instat.com/newmk.asp?ID= . Widmer AX, Franaszek PA () A DC-balanced, partitionedblock B/B transmission code. IBM J Res Dev ():– . Budruk R, Anderson D, Shanley T () PCI express system architecture. Mindshare, Colorado . Wilen A, Schade JP, Thornburg R () Introduction to PCI express: a hardware and software developer’s guide Intel Hillsboro . Solari Ed, Congdon B, Clark D () The complete PCI express reference: design implications for hardware and software developers (Engineer to Engineer series). Intel, Hillsboro . OpenSystems Media – Articles: PCI Express. http:// www.opensystems-publishing.com/articles/search/?topic=PCI+ Express . Compact PCI Systems. http://www.compactpci-systems.com . DSP-FPGA.com – Articles, Videos and White Papers: PCI Express. http://www.dsp-fpga.com/articles/search/index.php? mag=&max=&op=ew&q=pcie&skip= . Embedded Systems. www.embedded.com . PCI Express Architecture Frequently Asked Questions, PCI-SIG. http://www.pcisig.com/news_room/faqs/faq_express/ . PCI Express External Cabling . Specification. http://www. pcisig.com/specifications/pciexpress/pcie_cabling./ . PCI-SIG Announces PCI Express . Bit Rate For Products In And Beyond. Aug . http://www.pcisig.com/news_ room/__/ . PHY Interface for the PCI Express Architecture, version . (PDF). http://download.intel.com/technology/pciexpress/ devnet/docs/pipe_.pdf . PCI Express . Frequently Asked Questions, PCI-SIG. http:// www.pcisig.com/news_room/faqs/pcie._faq/
PCIe PCI Express
PCI-E PCI Express
PCI-Express PCI Express
Peer-to-Peer Stefan Schmid , Roger Wattenhofer Telekom Laboratories/TU Berlin, Berlin, Germany ETH Zürich, Zurich, Switzerland
Synonyms Distributed hash table (DHT); Overlay network; Decentralization; Open distributed systems; Consistent hashing
Definition The term peer-to-peer (pp) is ambiguous, and is used in a variety of different contexts, such as: ●
In popular media coverage, pp is often synonymous to software or protocols that allow users to “share” files (music, software, books, movies, etc.). pp file sharing is very popular and a large fraction of the total Internet traffic is due to pp. ● In academia, the term pp is used mostly in two ways. A narrow view essentially defines pp as the “theory behind file-sharing protocols.” In other words, how do Internet hosts need to be organized in order to deliver a search engine to find (share) content (files) efficiently? A popular term is “distributed hash table” (DHT), a distributed data structure that implements such a content search engine. A DHT should support at least a search (for a key) and an insert(key, object) operation. A DHT has many applications beyond file sharing, e.g., the Internet domain name system (DNS). ● A broader view generalizes pp beyond file sharing: Indeed, there is a growing number of applications operating outside the juridical gray area, e.g., pp Internet telephony à la Skype, pp mass player games, pp live audio and video streaming as in PPLive, StreamForge or Zattoo, or pp social storage and cloud computing systems such as Wuala. Trying to account for the new applications beyond file sharing, one might define pp as a large-scale distributed system that operates without a central server bottleneck. However, with this definition almost “everything decentralized” is pp! ● From a different viewpoint, the term pp may also be synonymous for privacy protection, as various
Peer-to-Peer
pp systems such as Freenet allow publishers of information to remain anonymous and uncensored. In other words, there is no single well-fitting definition of pp, as some definitions in use today are even contradictory. In the following, an academic viewpoint is assumed (second and third definition above).
Discussion The Paradigm At the heart of pp computing lies the idea that each network participant serves both as a producer (“server”) and consumer (“client”) of services. Depending on the application, the shared resources can be data (files), CPU power, disk storage, or network bandwidth. Often pp systems have an open clientele, and do not rely on the availability of specific individual machines; rather they can deal with dynamic resources and do not exhibit single points of failure or bottlenecks. Compared to centralized solutions, the pp paradigm features a better scalability because the amount of resources grows with the network size, availability (avoiding a single point of failure), reliability, fairness, cooperation incentives, privacy, and security – just about everything researchers expect from a future Internet architecture. As such, it is not surprising that new “clean slate” Internet architecture proposals often revolve around pp concepts. One might naively assume that for instance scalability is not an issue in today’s Internet, as even most popular web pages are generally highly available. However, this is not necessarily due to our well-designed Internet architecture, but rather due to the help of so-called overlay networks: The Google Web site for instance manages to respond so reliably and quickly because Google maintains a large distributed infrastructure, essentially a pp system. Similarly, companies like Akamai sell “pp functionality” to their customers to make today’s user experience possible in the first place. Quite possibly today’s pp applications are just testbeds for tomorrow’s Internet architecture.
pp networks are often highly dynamic in nature. While traditional computer systems are typically based on fixed infrastructures and are under a single administrative domain (e.g., owned and maintained by a single
company or corporation), the participating machines in pp networks are under the control of individual (and to some extent: anonymous) users who can join and leave at any time and concurrently. In pp parlor, such membership changes are called churn. A second implication of the autonomy of the machines in pp networks is that the network consists of different stakeholders. Users can have various reasons for joining the network. For instance, an (anonymous) user may not voluntarily contribute his or her bandwidth, disk space, or CPU cycles to the system, but prefer to free ride. This adds a socioeconomic aspect to pp computing. As the pp paradigm relies on the contributions of the participating machines, effective incentive mechanisms have to be designed, which foster cooperation and punish free riders. Another source of inequality in pp systems apart from selfishness is heterogeneity: Due to the open membership, different machines run different operating systems, have different Internet connections, and so on.
Applications The best-known representatives of pp technology are probably the numerous file-sharing applications such as Napster, Gnutella, KaZaA, eMule, or BitTorrent. Also, the Internet telephony tool Skype is very popular and used by millions everyday. Zattoo, PPLive, and StreamForge, among many others, use pp principles to stream video or audio content. The cloud computing service Wuala offers free online storage by exploiting the participants’ disks and Internet connections to improve performance. Recently, the power and anonymity of decentralized Internet working has gained the attention of operators of botnets in order to attack certain infrastructure components by a denial-of-service attack. Finally, pp technology is used for large-scale computer games.
Architecture Variants Several pp architectures are known: ●
Implications
P
Client/Server goes pp: Even though Napster is known to the be first pp system (), by today’s standards its architecture would not deserve the label pp anymore. Napster clients accessed a central server that managed all the information of the shared files, i.e., which file was to be found on
P
P
Peer-to-Peer
which client. Only the downloading process itself was between clients (“peers”) directly, hence pp. In the early days of Napster the load of the server was relatively small, so the simple Napster architecture was sufficient. Over time, it turned out that the server may become a bottleneck – and an attractive target for an attack. Indeed, eventually a judge ruled the server to be shut down (a “juridical denial of service attack”). However, it remains to note that many popular PP networks today still include centralized components, e.g., KaZaA or the eDonkey network accessed by the eMule client. Also, the peer swarms downloading the same file in the BitTorrent network are organized by a so-called tracker whose functionality today is still centralized (although initiatives exist to build distributed trackers). ● Unstructured pp: The Gnutella protocol is the antithesis of Napster, as it is a fully decentralized system, with no single entity having a global picture. Instead each peer connects to a random sample of other peers, constantly changing the neighbors of this virtual overlay network by exchanging neighbors with neighbors of neighbors. (Any unstructured system also needs to solve the so-called bootstrap problem, namely how to discover a first neighbor in a decentralized manner. A popular solution is the use of well-known peer lists.) The fact that users often turn off their clients once they downloaded their content implies high levels of churn (peers joining and leaving at high rates), and hence selecting the right “random” neighbors is an interesting research problem. The Achilles’ heal of unstructured pp architectures such as Gnutella is the cost of searching. A search request is typically flooded in the network and each search operation will cost m messages, m being the number of virtual edges in the architecture. In other words, such an unstructured pp architecture will not scale. Indeed, when Napster was unplugged, Gnutella broke down as well soon afterward due to the inrush of former Napster users. ● Hybrid pp: The synthesis of client/server architectures such as Napster and unstructured architectures such as Gnutella are hybrid architectures. Some powerful peers are promoted to so-called superpeers (or, similarly, trackers). The set of superpeers may change over time, and taking down
a fraction of superpeers will not harm the system. Search requests are handled on the superpeer level, resulting in much less messages than in flat/ homogeneous unstructured systems. Essentially, the superpeers together provide a more fault-tolerant version of the Napster server, as all regular peers connect to a superpeer. As of today, almost all popular pp systems have such a hybrid architecture, carefully trading off reliability and efficiency. ● Structured pp: Inspired by the early success of Napster, the academic world started to look into the question of efficient file sharing. Indeed, even earlier, in , Plaxton et al. [] proposed a hypercubic architecture for pp systems. This was a blueprint for many so-called structured pp architecture proposals, such as Chord [], CAN [], Pastry [], Tapestry [], Viceroy [], Kademlia [], Koorde [], SkipGraph [], and SkipNet []. Maybe surprisingly, in practice, structured pp architectures did not take off yet, apart from certain exceptions such as the Kad architecture (from Kademlia []), which is accessible with the eMule client.
Scientific Origins The scientific foundations of pp computing were laid many years before the most simple “real” pp systems like Napster emerged. As already mentioned, in , a blueprint for structured systems has been proposed in []. Indeed, also the [] paper was standing on the shoulders of giants. Some of its eminent precursors are the following: ● ● ●
Research on linear and consistent hashing, e.g., []. Research on locating shared objects, e.g., [] or []. Research on so-called compact routing: The idea is to construct routing tables such that there is a tradeoff between memory (size of routing tables) and stretch (quality of routes), e.g., [] or []. ● Even earlier, hypercubic networks, see below.
Hypercubic Overlays and Consistent Hashing Every application run on multiple machines needs a mechanism that allows the machines to exchange information. A naive solution is to store at each machine the domain name or IP address of every other machine.
Peer-to-Peer
While this may work well for a small number of machines, large-scale distributed applications such as file sharing, grid computing, cloud computing, or data center networking systems need a different, more scalable approach: instead of forming a clique (where everybody knows everybody else), each machine should only be required to know some small subset of other machines. This graph of knowledge can be seen as a logical network interconnecting the machines; it is also known as an overlay network. A prerequisite for an overlay network to be useful is that it has good topological properties. Among the most important are small peer degree, small network diameter, robustness to churn, or absence of congestion bottlenecks. The most basic network topologies used in practice are trees, rings, grids, or tori. Many other suggested networks are simply combinations or derivatives of these. The advantage of trees is that the routing is very easy: for every source-destination pair there is only one possible path. However, the root of a tree can be a severe bottleneck. An exception is a pp streaming system where the single content provider forms the network root. However, trees are also highly vulnerable, e.g., with respect to membership changes. Essentially all state-of-the-art pp networks today have some kind of hypercubic topology (e.g., Chord, Pastry, Kademlia). Hypercube graphs have many interesting properties, e.g., they allow for efficient routing: although each peer only needs to store a logarithmic number of other peers in the system (the peers’ neighbors), by a simple routing scheme, a peer can reach each other peer in a logarithmic number of steps (or “hops”). In a nutshell, this is achieved by assigning each peer a unique d-bit identifier. A peer is connected to all d peers that differ from its identifier at exactly one bit position. In the resulting hypercube network, routing is done by adjusting the bits in which the source and the destination peers differ – one at a time (at most d many). Thus, if the source and the destination differ by k bits, there are k! routes with k hops. Figure gives an example. Given a hypercubic topology, it is then simple to construct a distributed hash table (DHT): Assume there are n = d peers that are connected in a hypercube topology as described above. Now a globally known hash function f is used, mapping file names to long bit strings. Let fd denote the first d bits (prefix) of
110
111 101
100
010
000
P
011
001
Peer-to-Peer. Fig. A simplified pp topology: a three-dimensional hypercube. Each peer has a three-bit identifier. For example, peer is connected to the three peers , , whose identifiers differ at exactly one position. In order to route a message from peer to say peer , one bit is fixed after the other. One possible routing path is depicted in the figure: → → → . An alternative path could be → → →
the bitstring produced by f . If a peer is searching for file name X, it routes a request message f (X) to peer fd (X). Clearly, peer fd (X) can only answer this request if all files with hash prefix fd (X) have been previously registered at peer fd (X). There are some additional issues to be addressed in order to design a DHT from a hypercubic topology, in particular how to allow peers to join and leave without notice. To deal with churn the system needs some level of replication, i.e., a number of peers, which are responsible for each prefix such that failure of some peers will not compromise the system. In addition, there are security and efficiency issues that can be addressed to improve the system. There are many hypercubic networks that are derived from the hypercube: among these are the butterfly, the cube-connected-cycles, the shuffle-exchange, and the de Bruijn graph. For example, the butterfly graph is basically a “rolled out” hypercube (hence directly providing replication!) of constant degree. Another important class of hypercubic topologies are skip graphs [, ]. A simple, interesting way to design dynamic pp systems is the continuous–discrete approach described by Naor and Wieder []. This approach is based on a “think continuously, act discretely” strategy, and can be used to design a variety of hypercubic topologies. The continuous–discrete approach gives a unified method
P
P
Peer-to-Peer
for performing join/leave operations and for dealing with the scalability issue, thus separating it from the actual network. The idea is as follows: Let I be a Euclidean space, e.g., a (cyclic) one-dimensional space. Let Gc be a graph where the vertex set is the continuous set I. Each point in I is connected to some other points. The actual network then is a discretization of this continuous graph based on a dynamic decomposition of the underlying space I into cells where each “server” is responsible for a cell. Two cells are connected if they contain adjacent points in the continuous graph. Clearly, the partition of the space into cells should be maintained in a distributed manner. When a join operation is performed, an existing cell splits, when a leave operation is performed two cells are merged into one. The task of designing a dynamic and scalable network follows these design rules: () Choose a proper continuous graph Gc over the continuous space I. Design the algorithms in the continuous setting, which is often simpler (also in terms of analysis) than in the discrete case. () Find an efficient way to discretize the continuous graph in a distributed manner, such that the algorithms designed for the continuous graph would perform well in the discrete graph. The discretization is done via a decomposition of I into the cells. If the cells that compose I are allowed to overlap, then the resulting graph would be fault tolerant. To give an example, in order to build a dynamic de Bruijn network (a so-called Distance Halving DHT), a peer at position x ∈ [, ) (in binary form b b . . . −b i ) connects to positions l(x) := such that x = ∑∞ i= x/ ∈ [, ) and r(x) := ( + x)/ ∈ [, ) in Gc (out-degree two per peer). Observe that if position x is written in binary form, then l(x) effectively shifts in a “” from the left and r(x) shifts in a “” from the left. Thus, routing is straightforward: based solely on the current position and the destination (without the overhead of maintaining routing tables), a message can be forwarded by a peer by fixing one bit per hop. The set of peers in the cyclic [, ) space then define the pp network: Let xi denote the position of the ith peer (ordered in increasing order with respect to position). Peer i is responsible for the cell [xi , xi+ ), computed in a modulo manner, i.e., this peer is responsible to store the data mapped to this cell plus for the establishment of the corresponding connections defined in Gc . Figure gives an example.
001101 xi /2
101101
xi 011010
(1+xi)/2
Ce
ll
xi +1
Peer-to-Peer. Fig. The continuous–discrete approach for the dynamic de Bruijn graph. Peers are indicated using circles, files using rectangles. In the continuous setting, the peer at position xi = . (in binary notation) is connected to positions xi / and ( + xi )/. In the discrete setting, it is responsible for the cell (i.e., the connections and files that are mapped there) between positions xi and xi+
Dealing with Churn A distinguishing property of pp systems are the frequent membership changes. Measuring the churn levels of existing pp systems is challenging and one has to be careful when generalizing a given measurement to entire application classes (e.g., []). Nevertheless, several insightful measurement studies have been conducted. For instance, [, ] reported on the dynamic nature of early pp networks such as Napster and Gnutella, and [] analyzed low-level data of a large Internet Service Provider (ISP) to estimate churn. Also the Kad DHT has been subject to measurement studies, and the reader is referred to the results in [] and []. It is widely believed that hypercubic structures are a good basis for churn-resilient pp systems. As written earlier, a DHT is essentially a hypercubic structure with peers having identifiers such that they span the ID space of the objects to be stored. A simple approach to map the ID space onto the peers has already been described for the hypercube. To give another example, in the butterfly network, we may use its layers for replication, i.e., all peers with the same ID redundantly store the data of the same hash prefix. Other hypercubic DHTs can be more difficult to design, e.g., networks based on the pancake graph [].
Peer-to-Peer
For many well-known systems, theoretic analyses exist showing that the networks remain well-structured after some joins, leaves, or failures occur. In order to evaluate the robustness formally, metrics such as the network expansion (for deterministic failures) or the span [] (for randomized failures) are used. Unfortunately, the span is difficult to compute, and the span value is known only for the most simple topologies. The continuous–discrete approach [] already mentioned constitutes the basis of several dynamic systems. For example, the SHELL system [] is robust to certain attacks by connecting older or more reliable peers in a core network where access can be controlled; SHELL also allows to organize heterogeneous peers in an efficient topology. Many systems proposed in the literature offer a high robustness in the average case, i.e., they provide probabilistic guarantees that hold with high probability. Robustness under attacks or worst-case dynamics is less well understood. In [], a system is developed that achieves an optimal worst-case robustness in the sense that there is no alternative system that can tolerate higher churn rates without disconnecting. The basic idea is to simulate a hypercube: each peer is part of a distinct hypercube node; each hypercube node consists of a logarithmic number of peers. Peers have connections to other peers of their hypercube node and to peers of the neighboring hypercube nodes. After a number of joins and leaves, some peers may have to change to another hypercube node such that up to constant factors, all hypercube nodes have the same cardinality at all times. If the total number of peers grows or shrinks above or below a certain threshold, the dimension of the hypercube is increased or decreased by one, respectively. The balancing of peers among the nodes can be seen as a dynamic token distribution problem on the hypercube: Each node of a graph (hypercube) has a certain number of tokens, and the goal is to distribute the tokens along the edges of the graph such that all nodes end up with the same or almost the same number of tokens. Thus, the system builds on two basic components: () an algorithm, which performs the described dynamic token distribution and () an information aggregation algorithm, which is used to estimate the number of peers in the system and to adapt the hypercube’s dimension accordingly. These techniques also work for alternative graphs, like pancake graphs [].
P
An appealing notion of robustness is topological selfstabilization: A pp topology is called self-stabilizing if it is guaranteed that from any weakly connected initial state (e.g., after an attack), it will quickly converge to a desirable network in the absence of further membership changes. In contrast to the worst-case churn considered in [], self-stabilization focuses on the convergence time in periods without membership changes, but allows for general initial system states. While until recently, self-stabilizing algorithms with guaranteed runtime have only been known for simple one-dimensional or two-dimensional linearization problems [], recently a construction for a variation of skip graphs, namely SKIP+ graphs [], has been proposed. Single joins and leaves in SKIP+ can be handled locally, and require logarithmic time and polylogarithmic work only. However, there remains the important open question of how to provide degree guarantees during convergence from arbitrary states.
Fostering Cooperation The appeal of pp computing arises from the collaboration of the system’s constituent parts, the peers. If all the participating peers contribute some of their resources, highly scalable decentralized systems can be built. However, in reality, peers may act selfishly and strive for maximizing their own utility by benefitting from the system without contributing much themselves []. Hence, the performance of a pp system crucially depends on its capability of dealing with selfishness. Already in , Adar and Huberman [] noticed that there exists a large fraction of free riders in the file-sharing network Gnutella. The problem of selfish behavior in pp systems has been a hot topic in pp research ever since, and many mechanisms to encourage cooperation have been proposed []. Perhaps the simplest fairness mechanism is to directly incorporate contribution monitoring into the client software. For instance, in the file-sharing system KaZaA, the client records the contribution of its user. However, such a solution can simply be bypassed by implementing a different client that hard-wires the contribution level to the maximum, as it was the case with KaZaA Lite. Inspired by real economies, some researchers have also proposed the introduction of some form of virtual money, which is used for the transactions.
P
P
Peer-to-Peer
BitTorrent has incorporated a fairness mechanism from the beginning and has hence been subject to intensive research (e.g., [, , ]). Although this mechanism has similarities to the well-known tit-fortat scheme, the strategy employed in BitTorrent distinguishes itself from the classic mechanism in many respects. For instance, it is possible for peers to obtain parts of a file “for free,” i.e., without reciprocating. While this may be a useful property for bootstrapping newly joined peers, it has been shown that the BitTorrent mechanism can be exploited: the BitThief BitTorrent client [] allows to download entire files fast without uploading any data. It has also been demonstrated in [] that sharing communities are particularly vulnerable to such exploits. BitThief is not the only client cheating BitTorrent. Piatek et al. [] presented BitTyrant. BitTyrant’s strategy is to exploit the BitTorrent protocol in order to maximize download rates. For instance, BitTyrant uses a smart neighbor selection strategy and connects to those peers with the best reciprocation ratios. In contrast to BitThief, BitTyrant does not free ride. BitTyrant seeks to provide the minimal necessary contribution, and also increases the active neighbor set if this is beneficial to the download rate. The authors claim that their client provides a median % performance gain in certain environments. There can be many other forms of strategic behavior in open distributed systems. One subject that has recently gained attention, especially by the gametheoretic research community, is neighbor selection in unstructured pp networks (e.g., []). There may be several reasons for a peer to prefer connecting to some peers rather than others. For instance, a peer may want to connect to peers with high bandwidths, peers storing many interesting files, or peers having large degrees and hence provide quick access to many other peers. At the same time, a selfish peer itself may not be eager to store and maintain too many neighbors itself.
Current Trends and Outlook One can argue that today, pp computing is already a relatively mature (research) field; nevertheless, there are still many active discussions and developments, also in the context of the future Internet design. Moreover, there exists a discrepancy between the technology of the systems in use and what is actually known in theory.
For example, the Kad network is still vulnerable to quite simple attacks []. If employed by the wrong people, the flexibility and robustness of pp technology also constitutes a threat. Denial-of-service attacks are arguably one of the most cumbersome problems in today’s Internet, and it is appealing to coordinate botnets in a pp fashion. A DHT can be used by the bots, e.g., to download new instructions. For instance, it was estimated that in , the DHT-based Storm botnet [] ran on several million computers. Apart from mechanisms to detect or prevent attacks even before they take place, a smart redundancy management may improve availability during the attack itself (see, e.g., the Chameleon system []). In terms of cooperation, there is a tension between the goal of providing incentive compatible mechanisms that exclude free riders and the goal of designing heterogeneous pp systems that also tolerate (and make use of!) weak participants. Moreover, in addition to design mechanisms dealing with pure selfishness, there is a trend toward pp systems that are also resilient to malicious behavior (see, e.g., [] or []). Another active discussion regards the interface between pp systems and ISPs. The large amount of pp traffic raises the question of how ISPs should deal with pp, e.g., by caching contents. pp networks often employ inefficient overlay-to-ISP mappings as the logical overlay network is typically not aware of the underlying “real” networks and constraints, and much overhead can be avoided by improving the interface between pp networks and ISPs, e.g., by an oracle []. For a critical point of view on the subject, the reader is referred to []. It seems that while a few years ago the lion’s share of Internet traffic was due to pp, the proportion seems to be declining [] now. Especially web services and server-based solutions such as the popular YouTube and RapidShare are catching up. The measured data traces should be interpreted with care however, as they do not take into account what happens behind the scenes of big corporations. Indeed, it is believed that there is a paradigm shift in pp computing: While pp retreats (relatively to other applications) from public Internet traffic, today pp technology plays a crucial role in the coordination and management of large data centers and server farms of corporations such as Akamai or Google.
Peer-to-Peer
Related Entries Hypercubes and Meshes
Bibliographic Notes and Further Reading Beyond the specific literature pointed to directly in the text, there are several recommendable introductory books on pp computing. In particular, the reader is referred to the classic books [, , ] and two more recent issues [, ]. The theoretically more inclined reader may also be interested in [], which provides an overview of compact routing solutions, and [] which discusses trade-offs in local algorithms that achieve global goals based on local information only and without centralized entities whatsoever. Regarding the challenges of distributed cooperation, the recent book [] gives a thorough and up-to-date survey of current (game-theoretic) trends, and also includes a chapter on pp specific questions.
Bibliography . Adar E, Huberman B () Free riding on gnutella. First Monday ():– . Aggarwal V, Feldmann A, Scheideler C () Can ISPs and pp users cooperate for improved performance? ACM Comput Commun Rev ():– . Aspnes J, Shah G () Skip graphs. In: Proceedings of the th annual ACM-SIAM symposium on discrete algorithms (SODA), Baltimore, . Awerbuch B, Peleg D () Sparse partitions. In: Proceedings of the st annual symposium on foundations of computer science (SFCS), vol , pp –, Washington, . Awerbuch B, Peleg D () Online tracking of mobile users. JACM ():– . Bagchi A, Bhargava A, Chaudhary A, Eppstein D, Scheideler C () The effect of faults on network expansion. In: Proceedings of the th annual ACM symposium on parallelism in algorithms and architectures (SPAA), Barcelona, . Baumgart M, Scheideler C, Schmid S. A DoSresilient information system for dynamic data management. In: Proceedings of the st ACM symposium on parallelism in algorithms and architectures (SPAA), Calgary, Alberta, . Buford J, Yu H, Lua EK () PP networking and applications. Morgan Kaufmann, San Francisco, . Gummadi K, Dunn R, Saroiu S, Gribble SD, Levy HM, Zahorjan J () Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In: Proceedings of the th ACM symposium on operating systems principles (SOSP), Bolton Landing,
P
. Haeberlen A, Mislove A, Post A, Druschel P () Fallacies in evaluating decentralized systems. In: Proceedings of the th international workshop on peer-to-peer systems (IPTPS), Santa Barbara, . Harvey NJA, Jones MB, Saroiu S, Theimer M, Wolman A. Skipnet: a scalable overlay network with practical locality properties. In: Proceedings of the th USENIX Symposium on Internet Technologies and Systems (USITS), Seattle, . IPOQUE () Internet study /. http://www.ipoque. com/resources/internet-studies/internet-study--, pp – (accessed on October , ) . Jacob R, Richa A, Scheideler C, Schmid S, Täubig H () A distributed polylogarithmic time algorithm for self-stabilizing skip graphs. In: Proceedings of the ACM symposium on principles of cistributed computing (PODC), New York, . Jacob R, Ritscher S, Scheideler C, Schmid S () A selfstabilizing and local delaunay graph construction. In: Proceedings of the th international symposium on algorithms and computation (ISAAC), Hawaii, . Kaashoek F, Karger DR () Koorde: a simple degree-optimal distributed hash table. In: Proceedings of the international workshop on peer-to-peer systems (IPTPS), Berkeley, . Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D () Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the th ACM symposium on theory of computing (STOC), New York, pp –, . Khan J, Wierzbicki A () Foundation of peer-to-peer computing. Elsevier Computer Communication, . Kuhn F, Moscibroda T, Wattenhofer R () The price of being near-sighted. In: Proceedings of the th ACM-SIAM symposium on discrete algorithms (SODA), Miami, . Kuhn F, Schmid S, Wattenhofer R () Towards worstcase churn resistant peer-to-peer systems. J Distrib Comput (DIST) ():– . Larkin E () Storm worm’s virulence may change tactics. British Computer Society (accessed on August , ) . Legout A, Urvoy-Keller G, Michiardi P () Rarest first and choke algorithms are enough. In: Proceedings of the th ACM SIGCOMM conference on internet measurement (IMC), pp –, Rio de Janeriro, . Levin D, LaCurts K, Spring N, Bhattacharjee B () Bittorrent is an auction: analyzing and improving bittorrent’s incentives. SIGCOMM Comput Commun Rev ():– . Li H, Clement A, Marchetti M, Kapritsos M, Robinson L, Alvisi L, Dahlin M () Flightpath: obedience vs choice in cooperative services. In: Proceedings of the symposium on operating systems design and implementation (OSDI), San Diego, . Locher T, Moor P, Schmid S, Wattenhofer R () Free riding in bittorrent is cheap. In: Proceedings of the th Workshop on Hot Topics in Networks (HotNets), Irvine, . Malkhi D () Locality-aware network solutions. Technical Report, The Hebrew University of Jerusalem, HUJI-CSE-LTR-
P
P
Pentium
. Malkhi D, Naor M, Ratajczak D () Viceroy: a scalable and dynamic emulation of the buttery. In: Proceedings of the st annual symposium on principles of distributed computing (PODC), Monterey, . Maymounkov P, Mazières D () Kademlia: a peer-to-peer information system based on the xor metric. In: Proceedings of the st international workshop on peer-to-peer systems (IPTPS), Cambridge, . Moscibroda T, Schmid S, Wattenhofer R () On the topologies formed by selfish peers. In: Proceedings of the th annual symposium on principles of distributed computing (PODC), Denver, . Naor M, Wieder U () Novel architectures for pp applications: the continuous-discrete approach. In: Proceedings of the th annual ACM symposium on parallel algorithms and architectures (SPAA), pp –, San Diego, . Nisan N, Roughgarden T, Tardos E, Vazirani VV () Algorithmic game theory. Cambridge University Press, Cambridge . Peleg D, Upfal E () A tradeoff between space and efficiency for routing tables. In: Proceedings of the th annual ACM symposium on theory of computing (STOC), pp –, Chicago, . Piatek M, Isdal T, Anderson T, Krishnamurthy A, Venkataramani A () Do incentives build robustness in bittorrent? In: Proceedings of the th USENIX symposium on networked systems design and implementation, Cambridge, . Piatek M, Madhyastha HV, John JP, Krishnamurthy A, Anderson T () Pitfalls for ISP-friendly pp design. In: Proceedings of the hotnets, New York, . Plaxton C, Rajaraman R, Richa AW () Accessing nearby copies of replicated objects in a distributed environment. In: Proceedings of the th ACM symposium on parallel algorithms and architectures (SPAA), pp , , Newport, . Qiu D, Srikant R () Modeling and performance analysis of bittorrent-like peer-to-peer networks. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM), pp –, New York, . Ratnasamy S, Francis P, Handley M, Karp R, Schenker S () A scalable content-addressable network. In: Proceedings of the ACM SIG-COMM conference on applications, technologies, architectures, and protocols for computer communications, pp –, New York, . Rowstron AIT, Druschel P () Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Proceedings of the IFIP/ACM international conference on distributed systems platforms (middleware), pp –, Heibelberg, . Saroiu S, Gummadi PK, Gribble SD () A measurement study of peer-to-peer file sharing systems. In: Proceedings of the multimedia computing and networking (MMCN), San Jose, . Scheideler C () How to spread adversarial nodes?: rotate! In: Proceedings of the th Annual ACM symposium on theory of computing (STOC), pp –, Baltimore,
. Scheideler C, Schmid S () A distributed and oblivious heap. In: Proceedings of the th international colloquium on automata, languages and programming (ICALP), Rhodes, . Sen S, Wang J () Analyzing peer-to-peer traffic across large networks. IEEE/ACM Trans Netw ():– . Shneidman J, Parkes DC () Rationality and self-interest in peer to peer networks. In: Proceedings of the nd international workshop on peer-to-peer systems (IPTPS), Berkeley, . Steiner M, Biersack EW, Ennajjary T () Actively monitoring peers in KAD. In: Proceedings of the th international workshop on peer-to-peer systems (IPTPS), Bellevue, . Steiner M, En-Najjary T, Biersack EW () Exploiting KAD: possible uses and misuses. Comput Commu Rev ():– . Steinmetz R, Wehrle K () Peer-to-peer systems and applications. Springer, Heidelberg . Stoica I, Morris R, Karger D, Kaashoek F, Balakrishnan H () Chord: a scalable peer-to-peer lookup service for internet applications. In: Proceedings of the ACM SIGCOMM conference on applications, technologies, architectures, and protocols for computer communications, San Diego, . Stutzbach D, Rejaie R () Understanding churn in peer-topeer networks. In: Proceedings of the th internet measurement conference (IMC), New York, . Subramanian R, Goodman B () Peer-to-peer computing: the evolution of a disruptive technology. IGI, Hershey . Thorup M, Zwick U () Compact routing schemes. In: Proceedings of the annual ACM symposium on parallel algorithms and architectures (SPAA), pp –, Crete, Greece, . Zhao BY, Huang L, Stribling J, Rhea SC, Joseph AD, Kubiatowicz J () Tapestry: a resilient global-scale overlay for Service deployment. IEEE journal on selected areas in commonuincations vol , No.
Pentium Intel Core Microarchitecture, x Processor Family
PERCS System Architecture E. N. (Mootaz) Elnozahy , Evan W. Speight , Jian Li , Ram Rajamony , Lixin Zhang , Baba Arimilli IBM Research, Austin, TX, USA IBM Systems and Technology Group, Austin, TX, USA
Definition In , IBM started to develop a pioneering supercomputer, codenamed PERCS (Productive, Easy-to-use, Reliable Computing System), with support from the
PERCS System Architecture
Defense Advanced Research Project Agency (DARPA). The project was part of DARPA’s High Productivity Computing Systems (HPCS)[] initiative, a -year research program that sought to change the landscape of high-end computing by shifting the focus away from just floating point performance toward overall system productivity. Enhancing system productivity required unprecedented innovation in the system design to simplify system usage and programming tasks, all while maintaining the system cost within the requirements of a realistic commercial offering. The system’s productivity is orders of magnitude better than previous supercomputers as expressed by the High Productivity Challenge benchmark suite [].
Discussion Introduction High-end computing went through a generational transformation during the s, which saw emerging clusters of commodity processors and networks gradually replace traditional supercomputer systems based on vector and large shared-memory systems. Commodity clusters offered advantages in cost and scalability compared with the technologies they replaced. The simultaneous emergence of the Message Passing Interface (MPI) provided programmers with a portable, standard programming environment that exploited these clusters well. Interest in high-performance computing gained momentum, and a semiannual “top supercomputers” contest was created based on the Linpack benchmark [] to assess the available supercomputers in the market and to provide an arena for vendors to compete. With the Linpack benchmark’s emphasis on floatingpoint performance, designers focused on improving processor performance, and the resulting systems were showing great leaps in Linpack performance. These improvements however did not benefit mainstream applications that required a system design that balances floating-point operations with commensurate memory and communication bandwidths. For example, streaming data applications, message-intensive applications exemplified by the GUPS benchmark [], and digitalsignal processing applications based on Fast Fourier Transform [] are sensitive to network performance and memory bandwidth. Programmers were forced to adapt
P
their algorithms to reduce data transfers and harness the increased computational capability of commodity clusters. The approach proved futile as programmers generally failed to achieve performance gains to show for the added code complexity and programming efforts. A programmer and system productivity crises were clearly at hand. In , visionaries at DARPA and various U.S. defense agencies took the initiative to confront the situation and started the High Productivity Computing System (HPCS) program. HPCS took the form of three competitive stages, the first two being devoted to pure research with each round culminating in the elimination of proposals that were deemed noncompetitive or noninnovative. The third stage was devoted to the commercialization of the research results of the first two stages into a mainstream product. Five companies competed in the initial stage, namely Cray, HP, IBM, SGI, and Sun, and by , only Cray and IBM were supported to proceed with the third stage of commercialization. The HPCS vision called for balanced system attributes and high-level computing abstractions that alleviate programmers from the chores of adapting code to improve performance. The program aimed to foster research that would result in systems that are amenable to productive use without too much effort on the part of the programmer. This article provides a short summary of IBM’s PERCS project. Compared to state-ofthe-art high-performance computing (HPC) systems, PERCS achieves high performance and productivity goals through tight integration of computing, networking, storage, and software. PERCS focuses on scaling all parts of the system while easing the programming complexity.
Key Elements of the PERCS Design Compute Node Design The building block for the PERCS design is the compute node as shown in Fig. . There are four POWER chips [] in a node with a single operating system image that controls resource allocation. Each chip features eight cores, four threads per core, and two memory controllers that can connect to GB of DRAM memory. Applications executing on a single compute node can thus utilize cores, SMT threads, eight memory controllers, up
P
P
PERCS System Architecture
Quad Chip Module (QCM) Memory
P7
A/X
Memory
P7
B/Y C/Z
Memory
P7
C/Z
A/X
P7
Memory
PERCS System Architecture. Fig. PERCS compute node architecture
to GB of memory capacity, one teraflop of compute power, and over GB/s of memory bandwidth. The four POWER chips are cache coherent and are tightly coupled using three pairs of buses. Each processor has six fabric bus interfaces to connect to three other processors in a multi-chip module (MCM). These -teraflop tightly coupled shared-memory nodes deliver GB/sec of bandwidth (. Byte/flop) to a specialized hub chip used for I/O, messaging, and switching. The POWER chip represents a large leap forward in single-chip performance, available both in terms of computational ability and achievable memory bandwidth over other processor chip offerings.
PERCS Interconnect The challenge of building a highly productive system required the entire system infrastructure to scale along with the microprocessor’s capabilities. Previous solutions such as fat-tree Interconnect suffered from substantial communication latency due to message copying, protocol processing, multiple data conversions between optical and electrical signaling, and switching overheads. Other approaches such as toroidal networks mitigated some of these issues at the expense of handing the programmer very complex communication topologies. Such limitations have forced programmers traditionally to consolidate data communications into large messages that amortize the communications overhead. Applications that required frequent exchange of short messages [] either had to settle for poor performance or had to be modified in nontrivial ways to exploit large messages. To solve this problem, we had to design an
Interconnect that obeyed the following design principles in a cost-effective product: (a) Minimum data copying (b) Only one conversion from electrical to optical domain was allowed for any message (c) Simple communications protocol with limited processing overhead (d) Maintain a simple topology that simplifies programming We approached the problem in an unorthodox way by building on the shared-memory protocol that IBM uses to build highly-scalable shared-memory machines. The communication is treated at three levels: (a) Within a node, communications take place over the shared-memory bus. (b) A second level consists of a supernode, which is a group of nodes connected via noncoherent shared-memory buses. Physically, two supernodes fit within a single rack. (c) A third level consists of communications between supernodes. Supernodes are connected using a fully connected graph of optical cables. Thus, each supernode has an equal distance to other supernodes, simplifying the topology and the programming task. A communication coprocessor [] resides on the memory bus within each node, and thus data can be streamed from source to destination using the shared-memory protocol, using a -byte payload. This eliminates the data copying common in previous solutions where data had to be copied from the memory to the I/O bus.
PERCS System Architecture
Routing Between Nodes
PERCS System Architecture. Fig. Diagram of PERCS interconnect
Supernode
Node Node
Supernode
Supernode
Supernode
Node
Node
Node
Node
Supernode
The above-described principles form the basis for direct routing in the PERCS system. A direct route employs a shortest path between any two compute nodes in the system. Since a pair of supernodes can be connected together by more than one D link, there can be multiple shortest paths between a given set of compute nodes. With only two levels in the topology, the longest direct route L-D-L can have at most three hops made up of no more than two L hops and at most one D hop. PERCS also supports indirect routes to guard against potential interconnect hot spots. An indirect route is one that has an intermediate compute node in
Supernode
Node
switching bandwidth of the hub chip exceeds . GB/s. When all links are populated and operate at peak, the injection bandwidth to network bandwidth ratio is :.. Note though that by performing the dual roles of data routing and interconnect gateway, the majority of traffic through the hub chip will typically be destined for other compute nodes. The low injection to network bandwidth ratio means that plenty of bandwidth is available for that other traffic. The topology used by PERCS permits routes to be made up of very small numbers of hops. Within a supernode, any compute node can communicate with any other compute node using a distinct L link. Across supernodes, a compute node has to employ at most one L hop to get to the “right” compute node within its supernode that is connected to the destination supernode (recall that every supernode pair is connected by at least one D link). At the destination supernode, at most one L hop is again sufficient to reach the destination compute node.
Supernode
The communication coprocessor, also called the hub-chip [], is directly connected to the POWER MCM at GB/s and provides GB/s to seven other nodes in the same drawer on copper connections; GB/s to nodes in the same supernode (composed of four drawers, or compute nodes) on optical connections; GB/s to other supernodes on optical connections; and GB/s for general I/O, for a total of , GB/s peak bandwidth per hub chip (see Fig. ). Two key design goals for PERCS were to dramatically improve bisection bandwidth (over other topologies such as fat-tree interconnects) and to eliminate the need for external switches. With these goals in mind, the hub chip was designed to support a large number of links that connect it to other hub chips. These links are classified into two categories “L,” and “D,” which permit the system to be organized into a two-level directconnect topology. Every hub chip has thirty-one L links that connect to other hub chips. Within this group of thirty-two hub chips, every chip has a direct communication link to every other chip. The hub chip implementation further divides the L links into two categories: seven electrical LL links with a combined bandwidth of GB/s and optical LR links with a combined bandwidth of GB/s. The L links bind thirty-two compute nodes into a supernode. Every hub chip also has D links that are used to connect to other supernodes with a combined bandwidth of GB/s. The topology maintains at least one D link between every pair of supernodes in the system, although smaller systems can employ multiple D links between supernode pairs. The hub chip is connected to the POWER chips in the compute node at a bandwidth of GB/s and has GB/s of bandwidth for general I/O. The peak
P
P
P
PERCS System Architecture
the route that resides on a different supernode from that of the source and destination compute nodes. An indirect route must employ a shortest path from the source compute node to the intermediate one, and a shortest path from the intermediate compute node to the destination compute node. The longest indirect route L-D-L-D-L can have at most five hops made up of no more than three L hops and at most two D hops. Figure illustrates direct and indirect routing within the PERCS system. A specific route can be selected in three ways when multiple routes exist between a source-destination pair. First, software can specify the intermediate supernode but let the hardware determine how to route to and then from the intermediate supernode. Second, hardware can select amongst the multiple routes in a round robin manner for both direct and indirect routes. Finally, the hub chip also provides support for route randomization, whereby the hardware can pseudo-randomly pick one of the many possible routes between a sourcedestination pair.
The minimum direct-routed message latency in the system is approximately μ second. The peak unidirectional link bandwidths are as follows: GB/s from each POWER chip in an SMP to its hub/switch, GB/s from a given hub/switch to each of the other seven hub/switches in the same drawer ( GB/s total), GB/s from a given hub/switch to each of the other hub/switches in different drawers in the same supernode ( GB/s total), and GB/s from a given hub/switch in one supernode to a hub/switch in another supernode to which that particular hub/switch is directly connected ( GB/s total).
Novel Features of the PERCS Hub Chip The PERCS Hub Chip, detailed in Fig. , incorporates many features to enable the high-performance targets set for the system on HPC applications.
Collective Accelerator Unit (CAU) Many HPC applications perform collective operations that involve all or a large subset of the nodes
SuperNode 1 0
7
Maximum Direct Route = L D L
LL4
8
S
Maximum Indirect Route = L D L D L
15 LR15
16
22 23
24
31
D3
SuperNode 2 0
D10 SuperNode 3
7
8
15 LR3
0
4
7
16
D
23
LR22
8
15
24
LR7
16 24
23 28
D5
31
PERCS System Architecture. Fig. Direct and indirect routing in PERCS
26
31
PERCS System Architecture
P
POWER7 QCM Connect 192 GB/s 4 POWER7 Links – 3 Gb/s × 8B POWER7 LinkCtl
POWER7 Coherency Bus
NMMU HFI
HFI CAU
PCIE 2.1 ×16 PCIE 2.1 ×8
ISR 24x LR LinkCtl 24x 24 LRemote –10 Gb/s × 6B Intra-SuperNode Connect 240 GB/s
PCIE 2.1 ×16
16x D LinkCtl
40 GB/s PCIE Connect 3 PCIE – 5 Gb/s × 40b
Inter-Node Connect 7 LLocal – 3 Gb/s ×8B
336 GB/s
16x 16D – 10 Gb/s × 12B Inter-SuperNode Connect 320 GB/s
PERCS System Architecture. Fig. Details of PERCS Hub chip
participating in the computation. Forward progress can only be realized when each node has completed the collective operation and results have been returned to all participating nodes. The PERCS hub chip provides specialized hardware to accelerate frequently used collective operations such as multicast, barriers, and reduction operations. For reductions, a dedicated arithmetic logic unit (ALU) within the CAU supports the following operations and data types:
Power Bus Interface The on-chip interconnect for the POWER processor is the newly-designed PowerBus architecture. Each hub chip contains a PowerBus interface that enables it to participate in the coherency operations taking place between the four POWER chips in the compute node, providing the hub chip visibility to coherence transactions taking place in the node.
●
Host Fabric Interface The two Host Fabric Interface (HFI) units in the hub chip manage communication to and from the PERCS interconnect. The HFI was designed to provide userlevel access to applications. The basic construct provided by the HFI to applications for delineating different communication contexts is the “window.” The HFI supports many hundreds of such windows each with its associated hardware state. An application invokes the operating system to reserve a window for its use. The reservation procedure maps certain structures of the HFI into the application’s address space with window control being possible from that point onward through user-level reads and write to the HFI-mapped structures.
Fixed point: NOP, SUM, MIN, MAX, OR, AND, XOR (signed and unsigned) ● Floating point: MIN, MAX, SUM, PROD (single and double precision) Software organizes the CAUs in the system into collective trees, firing when data on all of its inputs are available with the result being fed to the next “upstream” CAU in a data-flow manner. Each hub chip contains a single CAU. A multiple-entry content addressable memory (CAM) structure per CAU supports multiple independent trees that can be concurrently used by different applications, for different collective patterns within the same application, or some combination thereof.
P
P
PERCS System Architecture
The HFI supports three APIs for communication: a general packet transfer mechanism that can be used for composing either reliable or unreliable protocols upon which to build MPI or other messaging layers; a protocol for global address space operations that allow user-level codes to directly manipulate memory of a task residing on a different compute node; and direct internet protocol transfer ability. The HFI can extract data that needs to be communicated over the interconnect from either the POWER memory or directly from the POWER caches. The choice of source is transparent, and data is automatically sourced from the faster location (caches can typically source data faster than memory). In addition to writing network data to memory, the HFI can also inject network data directly into a processor’s L cache, lowering the data access latency for code executing on that processor [].
POWER Processor Overview
Integrated Switch Router (ISR) The ISR implements the two-tiered full-graph network utilized in the PERCS system. It is organized as a × full crossbar that operates at up to GHz. In addition to the L and D ports described previously, the ISR also has eight ports to the two local Host Fabric Interfaces, and one service port. The ISR uses both input and output buffering with a packet replay mechanism to tolerate transient link
PERCS systems use a version of the new POWER processor chip fabricated in IBM’s nm SOI CMOS technology, a die photo of which is shown in Fig. . POWER incorporates eight high-performance processor cores, L, L, and L caches per core, on-chip interconnect, multiple I/O controllers, memory controllers, and the SMP fabric controller. The processor design is closely linked with the technology node development, which is co-developed with the high-performance server processors to produce a high-performance server
errors. This feature is especially important since the D links can be several tens of meters in length. The ISR operates in units of -byte FLITs with a maximum packet size of , bytes. Messages are composed of multiple packets with the packets making up a message being potentially delivered out of order. High-performance computing applications benefit from having access to a single global clock across the entire system. The ISR implements a global clock feature, whereby a clock onboard is globally distributed across the interconnect and kept consistent with the clocks on other Hub chips. Deadlock prevention is achieved through virtual channels, each corresponding to a hop in the L-D-L-D-L worst case route.
PERCS System Architecture. Fig. Die photo of the POWER processor chip
PERCS System Architecture
processor as well as a world class CMOS technology node. The version of the POWER processor chip used in PERCS is fabricated utilizing IBM’s nm technology at frequencies ranging up to GHz. The chip size is mm and contains eight multithreaded cores. The processor design offers an excellent balance between floating and fixed-point operations, and between computation and memory access. Each core has four double precision floating-point units (FPU) implemented as two -way SIMD engines (scalar floating point instructions can only use two FPUs), two fixed-point units (FXU), two load-store units (LSU), a vector (VMX) unit, a decimal floating point unit, a branch unit, and a condition register. Each FPU is capable of performing one multiply-add instruction (FMA) in one cycle, and each LSU can execute simple integer operations. Thus the chip is capable of GF/s and GOp/s at Ghz. The chip also features two memory controllers serving up to eight memory channels, with an aggregate bandwidth of GB/s. This offers about . B/F or B/Op greater memory bandwidth balance.
Processor Core POWER implements the -bit Power PC-AS architecture and is backward compatible with previous POWER chips, ensuring that existing binaries will run on the new architecture without the need to recompile. IBM’s continued focus on single-thread performance in POWER remains an important component in its viability in the HPC marketplace. The core uses an extremely power-efficient, high-frequency design with a superscalar, out-of-order execution engine with highly-accurate branch prediction mechanism. Up to eight instructions may be executed in a given cycle. POWER also provides excellent performance for throughput-oriented workloads. Simultaneous multithreading consisting of four hardware contexts per core provides up to different instruction streams per chip for use by software. The multithreaded design ensures that a thread does not block the flow of instructions for other threads through the instruction pipeline, the cache or memory hierarchy, while keeping the area and power overhead low.
P
Processor Cache Hierarchy The POWER processor chip has three levels of cache, all sharing a common line size of bytes. ECC is used extensively to ensure reliability and data integrity across all cache levels. While the Level (L) and Level (L) caches contain both instruction and data, the Level (L) cache is split into separate instruction and data caches. The L instruction cache is KB, -way setassociative and provides a maximum bandwidth of B per processor cycle. The L data cache is KB, -way set-associative and can be accessed in every processor cycle to deliver two B load operations, and one B store operation. Each core has a private, -way set-associative, KB L cache. The L cache access latency is about seven processor cycles, and in every processor cycle, the processor core can load B of data or instructions from L and store B of data. Each core has a private -way set-associative L cache of size MB constructed from on-chip embedded DRAM (eDRAM), which offers low power, high density, and excellent access times. The L cache access latency is about processor cycles, and in every processor cycle, it can supply B of data to the L cache and receive B of data from the L cache. The L cache serves as a victim cache of the L cache. All L caches on a chip may be virtually aggregated into a quasishared cache by causing Ls to write back dirty lines to each other before writing them off-chip to memory. A sophisticated multi-tier LRU algorithm maintains balance among this confederation of L caches. The L cache arrays can also be reconfigured so that two adjacent banks are combined into a larger L cache of size MB shared between the two corresponding cores, which is useful for commercial workloads. This is another instance of the configurability of the architecture, and how this configurability is used to meet different workload characteristics and enhance the commercial viability of the system as outlined in the vision set forth by DARPA for the HPCS program.
On-chip Integrated Fabric and Chip Interconnect Another innovative aspect of the POWER design is the use of existing system buses to both implement sharedmemory traffic and integrate the network switching functionality. Effectively, the POWER implements a
P
P
PERCS System Architecture
distributed switch function within the computer rack. The POWER Fabric Bus Controller (FBC) is the building block that implements this distributed switch functionality. The FBC is integrated on the POWER chip and acts as the central point of cache-coherent data traffic. It is responsible for implementing cache-to-cache data transfers, bus arbitration, address routing, etc. It maintains simple and configurable routing tables and implements all necessary flow control to ensure freedom from deadlocks and livelocks. The FBC acts as the hub of all coherent and noncoherent communications among all internal chip units (eight processor cores, two memory controllers, L and L caches, Non-Cacheable Unit, Gigabit Ethernet, and PCI-Express controller) to other internal units on or outside of the chip. The FBC provides all of the interfaces, buffering, and sequencing of address and data operations within the coherent memory subsystem. Physically, the Fabric bus is an -Byte wide, splittransaction, multiplexed address and data bus. Data packets are four beats each carrying bytes of data. It takes four data packets to transfer the entire Byte cache line. Configuration switches assign node identification for each chip and also for data routing tables. Data transactions within the node are always sent along a unique point-to-point path. A tag travels with the data to help make routing decisions along the way. The Fabric busses provide fully connected topology to reduce latency.
Memory Subsystem Special care has been taken in the design of the POWER chipmemory subsystem to ensure the maximum bandwidth within reasonable constraints of cost and power. The design is flexible and configurable, offering different memory access modes, each optimized to a certain application profile, ranging from sequential access of memory to random access of memory. The memory subsystem also can perform partial cache line loads, improving the efficiency and bandwidth for applications with poor spatial locality such as sparse matrix computations. The PERCS memory subsystem has two memory controllers with support for up to eight DIMM sockets and uses custom DDR DRAM chips running at . GHz. These DIMMs include specialized buffer chips
to increase the amount of memory and bandwidth to memory per chip. Peak memory bandwidth (read + write) of a single channel is rated at GB/s, two thirds of which is for read bandwidth and the remaining one third for write bandwidth. The address space is split across the multiple fully buffered DIMM arrays in a manner that allows for an evenly distributed access pattern across them. This access pattern results in the highest bandwidth available at the processor interface and an evenly distributed thermal pattern.
Blue Waters: The First PERCS Installation The PERCS system design will first be implemented in the Blue Waters supercomputer to be placed at the National Center for Supercomputing Applications (NCSA) at the University of Illinois []. Blue Waters will be comprised of supernodes with nearly , POWER cores, more than PB memory, more than PB disk storage, more than . EB archival storage, and will achieve close to PF/s peak performance. Acquisition of the Blue Waters system is supported by the National Science Foundation and leverages the investment made by the Defense Advanced Research Projects Agency’s High Productivity Computing Systems (HPCS) program. IBM, the University of Illinois, and the NCSA will work together throughout Blue Waters’ lifespan to enhance IBM’s high-performance computing environment, ensuring that applications can take full advantage of Blue Waters and achieve performance on a variety of real-world applications. The enhanced highperformance computing environment will also increase the productivity of applications developers, system administrators, and researchers by providing an integrated toolkit for using, analyzing, monitoring, and controlling Blue Waters. Blue Waters is expected to be one of the most powerful supercomputers in the world when it comes online. It will have a peak performance close to petaflops ( quadrillion calculations every second) and will achieve sustained performance of petaflop per second running a range of science and engineering codes. Scientists will create breakthroughs in nearly all fields of science using Blue Waters, including predicting the behavior of complex biological systems; understanding how the cosmos
Performance Analysis Tools
evolved after the Big Bang; designing new materials at the atomic level; predicting the behavior of hurricanes and tornadoes; and simulating complex engineered systems like the power distribution system, airplanes, and automobiles.
Bibliography . Arimilli B, Arimilli R, Chung V, Clark S, Denzel W, Drerup B, Hoefler T, Joyner J, Lewis J, Li J, Ni N, Rajamony R () The PERCS high-performance interconnect. Proceedings of Hot Interconnects, Mountain View, CA, August . Blue waters project at the national center for supercomputing applications. http://www.ncsa.illinois.edu/BlueWaters/. Accessed January . DARPA high productivity computer systems. http://www. highproductivity.org/ . Dongarra J, Luszczek P, Petitet A () The LINPACK benchmark: past, present, and future. J Concur Comput Pract Exp :– . Duhameland P, Vetterli M () Fast Fourier transforms: a tutorial review and a state of the art. Signal Process :– . Kalla R, Sinharoy B () POWER: IBM’s next generation server processor. IEEE symposium on high-performance chips (Hot chips ), Stanford, August . Luszczek P, Dongarra J, Koester D, Rabenseifner R, Lucas B, Kepner J, Mccalpin J, Bailey D, Takahashi D () Introduction to the HPC challenge Benchmark Suite. http://icl.cs.utk.edu/ projectsfiles/hpcc/pubs/hpcc-challenge-benchmark.pdf. March . Milenkovic A, Milutinovic V () Cache injection: a novel technique for tolerating memory latency in bus-based SMPs. EUROPAR parallel processing. Lecture notes in computer science, vol /, Munich, pp –
Perfect Benchmarks
P
ARCD, a finite difference fluid dynamics code developed at NASA Ames; BDNA, a molecular dynamics code for nucleic acid simulation contributed by IBM Kingston; DYFESM, a structural dynamics finite element code contributed by NASA Langley Research Center; FLOQ, a computation fluid dynamics code contributed by Princeton University; MDG, which uses molecular dynamics to simulate liquid water; MGD, a signal processing code contributed by Tel Aviv University; OCEAN, a computational fluid dynamics code contributed by Princeton University; QCD, a quantum chromodynamics code contributed by Caltech; SPEC, a weather simulation code contributed by CSRD; SPICE, a circuit simulator contributed by UC Berkeley; TRACK, which determines the course of a collection of targets from observations taken at regular intervals, contributed by Caltech; and TRFD, a quantum mechanics computation code contributed by IBM Kingston.
Bibliography . Berry M, Chen D, Koss P, Kuck D, Lo S, Pang Y, Pointer L, Roloff R, Sameh A, Clementi E, Chin S, Schneider D, Fox G, Messina P, Walker D, Hsiung C, Schwarzmeier J, Lue K, Orszag S, Seidl F, Johnson O, Goodrum R, Martin J () The perfect club Benchmarks: effective performance evaluation of supercomputers. International Journal of Supercomputing Applications ():–
P Performance Analysis Tools Michael Gerndt Technische Universität München, München, Germany
The Perfect Benchmarks [] were created in the late s under the leadership of the University of Illinois’ Center for Supercomputing Research and Development (CSRD) with the participation of several other institutions (the Perfect Club). The name Perfect is an acronym for PERFormance Evaluation by Costeffective Transformations. The Benchmarks came from real applications in contrast to the Livermore Loops and other simple codes used in the s to evaluate machines. Thirteen Benchmarks were included: ADM, solves the hydrodynamics equations to simulate air pollution, contributed by IBM Kingston;
Synonyms Performance measurement; Profiling; Tracing
Definition Performance analysis tools support the application developer in tuning the application’s performance for a given architecture. They measure performance data during the execution of the application and provide means to analyze and interpret the provided data and to detect performance bottlenecks.
P
Performance Analysis Tools
Discussion Introduction The development of high-performance applications requires a careful adaptation of the program to the underlying parallel architecture. Due to the manifold interrelations of the parallel program and the architecture, designing an application with optimal performance on parallel systems is almost impossible. Therefore, the application goes through a tuning cycle which consists of measuring performance, detecting performance bottlenecks, and applying program transformations. The assumption for this tuning approach is that the performance will be the same for different runs with the same resources and the same input data. Performance analysis tools support the programmer in the first two tasks of the tuning cycle. Performance data are gathered during program execution by monitoring the application’s execution. Performance data are either summarized and stored as profile data or all the details are stored in so-called trace files. In addition to application monitoring, performance analysis tools also provide the means to analyze and interpret the provided performance data and thus to detect performance problems. The following sections present the basis of performance analysis, i.e., the execution event model, the different monitoring techniques, and the most important analysis techniques.
at least contains a time stamp and an identification of the executing process or thread. Frequently, additional information is recorded, such as the message receiver and the message length or the scheduling time for an iteration assignment.
Monitoring The general technique for collecting information about events is called Monitoring. Three different monitoring techniques can be distinguished: hardware monitoring, sampling, and instrumentation. These techniques are explained in the next three paragraphs.
Hardware Monitoring The monitoring of applications should not influence the program’s execution. Therefore, hardware monitoring is required for frequent events, such as events in the processor or the caches. If cache misses, e.g., were to be counted via hardware interrupts, the intrusion would be immense. Therefore, current microprocessors provide event counters. A small number of hardware counters can be used to count a large number of different event types. Careful selection of the right event type is thus required. The low-level APIs for accessing the system’s hardware counters are quite different on different architectures. Therefore, standard APIs have been developed. The most notable one is the Performance Application Programming Interface (PAPI) [, ].
Event Model The abstraction of the execution used in performance analysis is an event model. Events happen at a specific point in time in a process or thread. Events belong to event classes, such as enter and exit events of a userlevel function, start and finish events of a send operation in MPI programs, iteration assignment in work-sharing constructs in OpenMP, and cache misses in sequential execution. In profiling, information about events of the same class is aggregated during runtime. For example, cache misses are counted, the time spent between the start event and finish event of a send operation is accumulated as the communication time of the call site, and the time for taking a scheduling decision in work-sharing loops is accumulated as parallelization overhead. In tracing, specific information is recorded for each event in a trace file. The event record written to the file
Sampling Many profiling tools such as the Unix prof utility collect statistical performance information via sampling. The processor is interrupted with a certain frequency, known as the sampling rate. The interrupt routine determines the address of the current instruction from the program counter and adds, e.g., the length of the sampling interval to the accumulated execution time of the currently executed source location. Instead of the execution time, other metrics such as the number of cache misses or of floating point operations can be used. The most important advantage of this technique is that its intrusion is usually low and can be controlled by setting the sampling rate appropriately. The major drawback is that only statistical information is gathered. More precise information can be obtained via instrumentation.
Performance Analysis Tools
A similar technique is called Event-Based Sampling. Each time a hardware counter passes a threshold, an interrupt is generated and information is accumulated for the current source line. The user can control the intrusion of this method by specifying the overflow threshold. A larger threshold will give less accurate information, but is less intrusive. This technique is used, e.g., in the Digital Continuous Profiling Infrastructure (DCPI) [].
Instrumentation In contrast to sampling, precise information about the dynamic application behavior can be obtained via program instrumentation. Here the application is modified to gather information at specific events. For example, a call to a monitoring routine is inserted at the beginning and end of a user function to measure the execution time of the function. Three important instrumentation techniques can be distinguished: source code instrumentation, object code instrumentation, and library interposition. In source code instrumentation, the compiler or a source-to-source transformation tool inserts the monitor calls into the program before compilation. In object code instrumentation, the instrumenter patches the machine code. While the first approach is very portable, object code instrumentation is extremely machine specific. It depends not only on the machine’s instruction set but also on the compiler, the OS, and the object format. On the other hand, with object code instrumentation, the source language is unimportant and programs can even be instrumented for which no source code is available. Object code instrumentation can also be applied at runtime. While the program is already executing, instrumentation is inserted only at the required places. This technique was developed within the Paradyn environment and is supported in DYNINST [, ] and DPCL [, ]. While object code instrumentation is limited to code regions that can be deduced from the binary, it is typically limited to functions or basic blocks. Source code instrumentation can be used to gather information for arbitrary program regions. Instrumentation of functions can also be done based on a technique called library interposition. This
P
technique is used to instrument MPI functions via the MPI profiling interface (PMPI). It determines for all MPI functions a second name via the PMPI prefix. The implementor of an MPI monitoring library can write wrappers for MPI functions and call the PMPI version inside. Within the wrapper, code can be inserted to collect performance information. The wrapper library is then linked to the application before the original MPI library so that the wrappers are called instead of the original functions.
Analysis The techniques and tools for performance analysis presented in this entry all assume that the performance behavior of the application is the same over multiple experiments. Based on this assumption, multiple experiments can be performed with different tools or different tool configurations to identify and rank performance problems. The following aspects are relevant for performance analysis tools: . . . . .
Level of detail Performance aspects Application perturbation Level of automation Scalability
Level of Detail Three classes of performance analysis tools can be distinguished depending on how detailed the raw performance data are gathered and analyzed, i.e., profiles, profile time series, and traces. Profiling tools aggregate the raw performance data over possibly multiple dimensions. Most common is the aggregation over time, e.g., the number of cache misses for individual functions in each process. Some tools also aggregate the data over processes and threads or even into a single value for the entire execution, e.g., the number of floating point instructions executed by an application. Furthermore, tools apply aggregation with respect to the program regions, i.e., functions and loops. They either compute a flat profile, in which the data are aggregated for each region, or a call path profile, where the data are aggregated for each call path. Call path profiles are very useful if the behavior of functions is different for individual invocations. This is frequently
P
P
Performance Analysis Tools
the case for library routines such as basic mathematical or MPI routines. Some profiling tools provide not only aggregated data over the entire execution but also time series of snapshots of profile information. Such time series of profile data allow the analysis of time-dependent variations of the performance behavior. The most detailed analysis is performed by tracing tools that store information about individual events in so-called traces. For each event a trace record with a time stamp is generated. A trace record includes additional data, such as the sender and receiver, the amount of data, and an identification of the message sent by an MPI send operation. The trace records are typically stored in trace files that are subsequently analyzed for performance problems. While profiling tools have the advantage that the amount of performance data is not proportional to the execution time of the application, tracing tools generate performance data proportional to the execution time and the number of processes and threads. On the other hand, profiling tools are limited in the analyses they can perform.
Performance Aspects Most performance analysis tools are specialized for specific performance aspects. Tuning an application with respect to all possible performance problems probably requires the use of multiple performance analysis tools. The performance tools are typically specialized in one or multiple of the following aspects: . . . . . .
Execution time Instruction execution Memory access behavior Memory usage IO Parallel execution
The most basic and thus the most frequently used performance tools provide the user with a time profile. This allows the user to detect the hot code regions, i.e., those regions where most of the execution time is spent and thus have the highest potential for performance improvement. The next class of tools supports the programmer in the inspection of performance problems related to the utilization of the resources in an individual core. These
tools typically provide information about the instruction mix, highlight expensive operations such as divides or type conversions, point out exception handling, or identify pipeline stalls, e.g., due to branch mispredictions. They are based on measurements performed with the processor’s hardware performance counters. Very critical for the performance of applications is the memory access behavior. The long latency of memory accesses and the limited bandwidth require that most of the data accesses are served from the on-chip caches. Programs must be carefully tuned for data locality so that the caches are most effectively used. Performance tools support the analysis of the access behavior with information about the number of cache hits and misses as well as the number of stall cycles for memory references. More precise information can be obtained from cache simulators which are fed with memory reference traces of the application. Cache simulation tools, e.g., can determine the distribution of misses across miss classes, i.e., compulsory, conflict, capacity, and invalidation misses. They can identify false sharing as well as compute higher-level metrics, such as the reuse distance. Another memory-related analysis supported by performance analysis tools is the inspection of memory usage. Performance problems might arise from allocating too much memory may be due to the size of data structures or to memory leakages. High-performance applications tend to work on huge data sets. Thus performance tools need to help in the identification of performance problems with respect to file IO. Information can be obtained from runtime libraries implementing IO operations as well as from the OS. Many performance analysis tools support the detection of performance problems with respect to the parallel execution. Certainly, the two standard programming interfaces MPI and OpenMP have the best support. Profiling and tracing tools are available for message passing programs that support the analysis of communication among the processes. More elementary tools provide measurements on the time spent in certain MPI functions and the amount of data transferred between communication partners, while advanced tools provide information on different waiting times that indicate certain performance problems, e.g., the late sender problem where the sender of a message arrives later at the
Performance Analysis Tools
send than the receiver at the receive statement. Tools for message passing also indicate load balancing problems based on the execution time of collective operations. Analysis tools for shared memory programming, i.e., OpenMP, focus on load imbalances, synchronization, and management overhead. Currently available tools do not yet fully support the extensions of OpenMP ., most notably the tasking concept.
Perturbation Performance analysis tools are based on measurements. If those measurements are not fully done by additional hardware, the tool will influence or change the codes performance behavior. The intrusion of tools has multiple sources. First of all the time spent in the monitoring library delays the execution. Another important overhead is flushing performance data to external files which needs to be done by tracing tools since the trace records cannot be stored in main memory. These flushes can either be done at global synchronization points and thus delay the execution of all processes, or they can be done asynchronously in the processes which might lead to artificial load imbalance. The intrusion can also be more subtle because memory accesses of the performance analysis tool might change the cache status, or execution delays might even influence the algorithmic behavior in the case of nondeterministic algorithms. Performance analysis tools try to work around that problem by either correcting the obtained measurements or by reducing the amount of perturbation. Since the first approach is extremely difficult many current tools focus on the second approach. While the overhead induced by the measurements can be controlled in the sampling approach by selecting an appropriate sampling rate, instrumentationbased measurements can be optimized by reducing the amount of instrumentation and by tuning the instrumentation functions. Instrumentation of program regions that are frequently called but have only little execution time suffer most from the instrumentation overhead. Performance analysis tools provide the means to guide the instrumentation, such as to instrument only certain region types or to exclude individual program regions. Of course, perturbation not only results from the instrumentation but also from other time-consuming
P
activities of the analysis tools during the program’s execution. Examples are the management of trace records and the computation of call path profiles. Various techniques have been developed to tune those operations.
Automation The ultimate goal of performance analysis tools is to indicate performance problems and thus potential for performance improvement. Ideally, the tools should perform that task in a fully automatic way. Full automation requires the formalization of the performance analysis specialist’s and the application specialist’s knowledge. Until now, this has only partially been achieved. One area of automation is program instrumentation. The techniques here range from manual instrumentation to fully automatic instrumentation. Some tools require the user to insert instrumentation into the program. Other tools provide semiautomatic instrumentation via the compiler or a source-to-source instrumenter. The instrumentation can be configured to control the amount of program perturbation. Some tools assist the user in the configuration by pointing out regions that need not be or should not be instrumented based on a previous analysis run. In the fully automatic approach, the instrumentation is selected by the performance analysis tool. Based on the tool’s knowledge of which instrumentation is required to obtain the required performance data, the instrumentation is automatically configured. Automation is also required in the actual measurement of performance data via the monitoring. The analysis tool decides automatically which data will be measured. However for some tools, especially if they access the hardware performance counters, the measurement configuration is done manually. Depending on the type of performance problem the user is interested in, he or she can configure the analysis to measure certain events with the hardware counters. It might even be necessary to perform multiple runs of the application to gather all the required information. In addition, the last step of the analysis process can be automated. A performance analysis tool should automatically extract and rank the performance problems such that the user can start by tuning the application to solve the most severe performance problem.
P
P
Performance Analysis Tools
In most tools, this step is not automated. The tools allow the user to manually inspect the performance data either via textual or graphical displays. Textual information, i.e., a sorted list of hot regions, is provided with references to the source code or the source code is actually annotated with information such as the number of cache misses. Graphical diagrams are typically used in tracing tools. The dynamic behavior is visualized via timeline diagrams, statistics are shown as bar charts, and many other chart types are utilized to visually represent the huge data sets. It is the task of the user of those tools to select the right displays to identify performance problems. Only a few tools automate the last step and automatically identify performance problems and their severity. The tools use a formal description of potential performance problems and verify whether those known potential problems can be found based on the measured data. Fully automatic tools automate the whole process, covering instrumentation, measurement, and analysis. The search for performance problems is driven based on formalized potential performance problems. Following a certain search strategy, they deduce from the performance problem specifications which information is required, measure the data, verify whether a problem exists, and report the found problems to the tool user. Fully automatic tools can perform the analysis online. Online tools have also been developed for semiautomatic tools that visualize performance data, but those require that the user performs the analysis while the application is executing. This is not appropriate especially since large-scale application runs are only possible in batch jobs. Therefore, most of those tools are offline tools. They produce the data by monitoring an actual run of the application but the inspection and analysis of the data is done independently of the execution.
Scalability Especially in the field of high-performance computing, performance analysis tools have to be scalable. Applications are run on thousands of processors and the performance behavior cannot be studied for scaled down runs. Tools have to be scalable with respect to the number of processors as well as to the duration of the application’s execution. Typical limitations of the tools are the
time spent in the analysis as well as the space requirements to store and process performance data. Scalability is obviously a big issue for tracing tools, but profilingbased tools can also easily run into scalability problems if hundreds of thousands of processors are used as in the current BlueGene systems. This problem can be alleviated by processing the data in a tree analysis network (MRNet []). The leaves measure the performance data and while the data travel up the tree, they are aggregated to coarsen the information. Several techniques for tracing tools have been developed to increase scalability. Most important is to reduce the size of the traces. Therefore, compression techniques as well as “on the fly” detection of recurring patterns in traces are applied. A few tools also parallelize the analysis step. SCALASCA post-processes the trace in parallel on the processors of the application after this has terminated. Vampir uses a parallel analysis server that processes the trace files while the user is working with the analysis tool to inspect the measured performance data. Periscope processes the performance data within a hierarchy of analysis agents while the application is executing.
Representative Tools This section identifies a few performance analysis tools and gives a short characterization. The mentioned tools are only a very small set of examples from the numerous tools developed. Gprof is the GNU Profiler tool. It provides a flat profile and a callpath profile for the program’s functions. The measurements are done by instrumentation. Vtune is a performance analysis tool from Intel that provides flat and callpath profiles based on sampling and instrumentation, as well as measurements based on the hardware performance counters. The measurements can be inspected on source and assembler level. pfmon is a tool for accessing the hardware performance counters developed by HP. It can be used to collect certain events specified via command line arguments. Application-specific as well as system-wide measurements can be performed. ompP is a profiling tool for OpenMP developed at Technische Universität München and University of Tennessee []. It is based on instrumentation with Opari
Performance Analysis Tools
[] and determines certain overhead categories of parallel regions. HPC Toolkit from Rice University uses statistical sampling to collect a performance profile []. It relates the measurements back to the original source code based on a static analysis of the binary and presents the data by annotating the sources. Tau is a flexible performance environment from the University of Oregon []. It can generate application profiles as well as traces. While it provides tools to analyze the profiles, traces can be analyzed with Vampir. Vampir is a commercial trace-based performance analysis tool from the Technische Universität Dresden []. It provides a powerful visualization of traces and scales to thousands of processors based on a parallel visualization server. Paraver is a trace visualization and analysis environment from the Barcelona Supercomputing Center. It provides powerful analysis services such as automatic phase detection [] and clustering. Paradyn from the University of Wisconsin was the first automatic online analysis tool []. Its performance consultant guided the search for performance bottlenecks while the application was executing. SCALASCA is an automatic performance analysis tool developed at the Forschungszentrum Jülich []. It is based on performance profiles as well as on traces. The automatic trace analysis determines MPI wait time via a parallel trace replay after the application execution on the application’s processors. Periscope currently under development at the Technische Universität München follows the approach of Paradyn and performs an online search for performance problems []. It is based on a hierarchy of analysis agents to fulfill scalability requirements of current and future HPC systems.
Related Entries
Intel Thread Profiler OpenMP Profiling with OmpP Parallel Tools Platform Periscope PMPI Tools Scalasca Tau Vampir
P
Bibliography . Digital continuous profiling infrastructure. www.unix.digital. com/dcpi . Dynamic probe class library. oss.software.ibm.com/ developerworks/opensource/dpcl/ . Dyninst api. www.dyninst.org . Performance application programming interface. icl.cs.utk.edu/ papi . Buck B, Hollingsworth JK () An API for runtime code patching. Int J High Perform Comput Appl ():– . Casas M, Badia RM, Labarta J () Automatic phase detection of MPI applications. In: Bischof C et al (eds) Proceedings of the international conference on parallel computing (ParCo ’), Jülich/Aachen. Advances in Parallel Computing, vol , IOS, Amsterdam, pp – . DeRose L, Hoover T, Hollingsworth JK () The dynamic probe class library – an infrastructure for developing instrumentation for performance tools. IBM, . Dongarra J, London K, Moore S, Mucci P, Terpstra D, You H, Zhou M () Experiences and lessons learned with a portable interface to hardware performance counters. In: IPDPS ’: Proceedings of the th international symposium on parallel and distributed processing, Washington, DC, IEEE Computer Society, p . . Fürlinger K, Gerndt M, Dongarra J () Scalability analysis of the SPEC OpenMP benchmarks on large-scale shared memory multiprocessors. In: Shi Y, van Albada GD, Dongarra J, Sloot PMA (eds) Computational science – ICCS , Beijing. Lecture notes in computer science, vol . Springer, Berlin, pp – . Geimer M, Wolf F, Wylie BJN, Mohr B () Scalable parallel trace-based performance analysis. In: Proceedings of the th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface (EuroPVM/MPI ), Bonn, pp – . Gerndt M, Ott M () Automatic performance analysis with Periscope. Concurr Comput Pract Exp ():– . Miller BP, Callaghan MD, Cargille JM, Hollingsworth JK, Irvin RB, Karavanic KL, Kunchithapadam K, Newhall T () The Paradyn parallel performance measurement tool. IEEE Comput ():– . Mohr B, Malony AD, Shende SS, Wolf F () Towards a performance tool interface for OpenMP: An approach based on directive rewriting. In: Proceedings of the third workshop on OpenMP (EWOMP’), Barcelona, September . Müller MS, Knüpfer A, Jurenz M, Lieber M, Brunst H, Mix H, Nagel WE () Developing scalable applications with Vampir, VampirServer and VampirTrace. In: Bischof C et al (eds) Proceedings of the international conference on parallel computing (ParCo ), Jülich/Aachen. Advances in Parallel Computing, vol , IOS, Amsterdam, pp – . Roth PC, Arnold DC, Miller BP () MRNet: A software-based multicast/reduction network for scalable tools. In: Proceedings of the conference on supercomputing (SC ), Phoenix, November
P
P
Performance Measurement
. Shende SS, Malony AD () The TAU parallel performance system. Int J High Perform Comput Appl, ACTS Collection Special Issue, SAGE, ():– . Tallent NR, Mellor-Crummey JM, Adhianto L, Fagan MW, Krentel M () Diagnosing performance bottlenecks in emerging petascale applications. In: SC ’: Proceedings of the conference on high performance computing networking, storage and analysis, Portland, ACM, New York, pp –
Performance Measurement Performance Analysis Tools
Performance Metrics Metrics
Periscope Michael Gerndt Technische Universität München, München, Germany
Definition Periscope is an automatic performance analysis tool for highly parallel applications written in MPI or OpenMP. It performs an online search for performance properties in a distributed fashion. The properties found by Periscope point to areas of parallel programs that might benefit from further tuning.
Discussion Introduction Performance analysis is an important step in the development of parallel applications for high-performance architectures. Due to the complexity of the architecture of these machines, applications can typically not be designed to be most efficient. Experiments are required to understand the performance obtained and to identify possible areas in the code that might profit from further tuning. Performance analysis tools help the developer in analyzing the application’s performance. During an execution of the application with a typical input data set
and a typical number of processors, performance data are measured. Afterwards those data are analyzed to detect performance problems of the application. Certain tuning actions can then be selected, e.g., selection of different compiler optimization switches or modifications of the source code, based on the detected performance problems. Periscope is a performance analysis tool developed at Technische Universität München. It is a representative for a class of automatic performance analysis tools automating the whole analysis procedure. Specific for Periscope is that it is an online tool and it works in a distributed fashion. This means that the analysis is done while the application is executing (online) and by a set of analysis agents that are searching for performance problems in a subset of the application’s processes (distributed). Periscope was inspired by the work of the European American APART working group and especially by Paradyn developed at University of Wisconsin, Madison. Paradyn uses a performance consultant that does dynamic instrumentation and searches for bottlenecks based on summary information during the program’s execution. Automation in Periscope is based on formalized performance properties, e.g., inefficient cache usage or load imbalance. Those properties are formalized in the APART Specification Language (ASL). Based on a repository of performance properties the analysis agents can search for those properties in the program execution under investigation. They automatically determine which properties to search for, which measurements are required, which properties were found, and which are more specific properties to look for in the next step. The low-level performance data are analyzed in a distributed fashion and only highlevel performance properties are finally reported to the user, significantly reducing the amount of data to be inspected by the user.
Performance Properties The basis for the analysis is the formalization of performance properties. Each performance property consists of a condition, a severity, and a confidence. The condition uses measured performance data and static information to decide whether the property holds and thus is a performance problem. The confidence defines whether
Periscope
the property is only a hint for a performance problem or a proof. For example, properties based on static information, e.g., the number of prefetch operations, are typically only hints since the actual execution of those operations depends on runtime values. The severity states the importance of the performance problem in comparison to the other found performance problems. Typically, the severity is calculated as the amount of time lost due to the problem.
USERS
Architecture Periscope consists of an agent hierarchy shown in Fig. . The leaves in the hierarchy are the analysis agents that perform the actual performance analysis. Each analysis agent is responsible for a subset of the application’s processes. It communicates with the monitor linked with the application processes. The MRI monitor serves
USERS
Periscope GUI Frontend
Performance Analysis Agent Network Master Agent
Search Strategies The overall search for performance problems is determined by search strategies. A search strategy defines in which order an analysis agent investigates the multidimensional search space of properties, program regions, and processes. Many of Periscope’s search strategies are multistep strategies, i.e., they consist of multiple search steps. A very important concept for multistep strategies is a program phase which is determined by a certain repetitive program region. A good example for such a phase region is the body of the time loop in scientific simulations where each iteration simulates one time step. In a first search step, the strategy determines a set of candidate properties, e.g., load imbalance at a barrier in a certain process. The agent then requests certain measurements from the monitor linked into the application. Afterwards the application is released for one execution of the phase. At the end of the phase the execution is stopped again and the measured performance data are returned to the agent. Based on the measurements the agent evaluates the candidate properties and determines the set of found properties. If a refinement is possible, i.e., for the found properties more precise properties are available, a next analysis step is started with a new candidate property set. Periscope provides search strategies for single-node performance, e.g., searching for inefficient use of the memory hierarchy, MPI, and OpenMP.
P
Communication Agents
Analysis Agents
MRI Monitor
MRI Monitor
MRI Monitor
Application Processes
Periscope. Fig. Architecture of periscope
two purposes: It performs the measurements of performance data requested by the analysis agent and it controls the execution of the process which means, it takes commands from the agent to release or stop the execution. The root of the hierarchy is the master agent which takes commands from the frontend, propagates the commands to the analysis agents, and provides the list of found performance properties to the frontend. The agents in the middle of the hierarchy are responsible for forwarding information among the analysis agents and the master agent. The search is controlled by the frontend. It invokes the application and creates the agent hierarchy. How many agents are created depends on the number of application processes and of additional processors provided in the batch job. The processes and agents are mapped to the available processors in a way that communication is local with respect to the physical interconnection network topology.
P
P
Periscope
After the application and the agent hierarchy have been started, the frontend starts the search by propagating a command to all the analysis agents. If an agent requests a new experiment, i.e., an execution of the program phase with measurements to evaluate performance properties, the master agent starts the next experiment via another command. The reason for the global synchronization is that Periscope also supports automatic restart. If another experiment is requested but the application terminated, the application can automatically be restarted by the frontend. On top of the frontend, Periscope provides a graphical user interface based on Eclipse and the Parallel Tools Platform (PTP). The GUI allows the programmer to define a project with all the source files, start a
performance analysis via the frontend and, most important, to investigate the performance properties found by Periscope. Figure illustrates Periscope’s GUI. In the lower right a table view visualizes all the properties found in the application. It provides multi-criteria sorting to identify the most severe performance problems, filtering, regular expressions searching, multi-level grouping, as well as clustering. The clustering can be used to identify groups of processes with the same performance problems. Double clicking on one of the properties will focus on the source code of the respective region in the central source editor. On the right, an outline view of the entire program is provided. It presents an overview of
Periscope. Fig. Inspection of performance properties with the graphical user interface of Periscope
Petri Nets
the regions, the nesting of regions, and the number of properties found for a region. Selecting a region in the outline view can be used to enforce that only those properties found for that region are shown in the region view below.
Summary Periscope is an automatic performance analysis tool for large-scale parallel systems. It performs a distributed online search for performance properties. The properties are formalized and the set of properties can be easily extended. Specialized search strategies determine how the analysis agents walk the multidimensional search space. The result of a search is a set of high-level performance properties that can be inspected via a graphical user interface implemented as an Eclipse plugin. Therefore, the user can follow an integrated approach of program development in Eclipse and performance analysis with Periscope.
Related Entries Parallel Tools Platform Performance Analysis Tools Scalasca Tracing
P
. Fahringer T, Gerndt M, Riley G, Träff JL () Specification of performance problems in MPI-programs with ASL. In: International conference on parallel processing (ICPP’), Toronto. IEEE Computer Society, Washington, DC, pp –
Personalized All-to-All Exchange All-to-All
Petaflop Barrier Roadrunner Project, Los Alamos
Petascale Computer Petascale computers are those capable of executing at least floating point operations per second. This number of operations per second is known as a petaflop.
Related Entries Roadrunner Project, Los Alamos
Bibliographic Notes and Further Reading A recent overview of Periscope can be found in []. The steps in applying Periscope to parallel applications as well as results obtained with Periscope are reported in []. Details on the graphical user interface can be found in []. More information on the formalization of performance properties with the APART Specification Language (ASL) can be found in [].
Bibliography . Gerndt M, Ott M () Automatic performance analysis with Periscope Concurr Comput Pract Exp ():– . Benedict S, Petkov V, Gerndt M () PERISCOPE: an onlinebased distributed performance analysis tool. In: rd parallel tools workshop, Dresden. Springer, Berlin, pp –, ISBN --- . Petkov V, Gerndt M () Integrating parallel application development with, performance analysis in periscope. In: th international workshop on high-level progamming models and supportive environments (HIPS ), Atlanata, GA, April –. Springer, Berlin, pp –, ISBN ----
Top
P Petri Nets Jack B. Dennis Massachusetts Institute of Technology, Cambridge, MA, USA
Synonyms Place-transition nets
Definition A Petri Net is a graph model for the control behavior of systems exhibiting concurrency in their operation. The graph is bipartite, the two node types being places drawn as circles, and transitions drawn as bars. The arcs of the graph are directed and run from places to transitions or vice versa. Each place may be empty, or hold a finite number of tokens. The state of a Petri net is the distribution of tokens on its places, called a marking of the net.
P
Petri Nets
A transition is enabled if each of its input places holds at least one token. Firing a transition means removing one token from each input place and adding one token to each output place. A run of a Petri net is any sequence of firings of enabled transitions; a run defines a sequence of markings. Because many transitions may be enabled in a state, there are often many possible distinct runs of a Petri net. Hence, a Petri net represents a kind of nondeterministic state machine, but in a convenient form for modeling and analyzing concurrent systems. Various extensions and generalizations of Petri nets have been found useful in applications.
Discussion Introduction The study of Petri nets began in with the dissertation of Carl Adam Petri entitled Communication with Automata. Petri’s work was promoted in the United States by Anatol Holt and Fred Commoner, who introduced the graphical form of Petri nets now widely used. Further development of properties and applications of Petri nets was done at the MIT Laboratory for Computer Science and elsewhere, and promulgated through workshop meetings held in and . Extensions to Coloured Petri nets, Timed Petri nets, and Continuous Petri nets have been developed for modeling additional properties of concurrent systems. Methods based on Petri nets have become widely used for control aspects of all sorts of concurrent systems, from digital hardware through cooperating computation processes, to diverse applications in areas such as project management and even biological systems.
each of its input places and adding one token to each of its output places. Note that, because more than one transition may be enabled in a marking of a Petri net, the net is, in general, a nondeterministic system, that is, many distinct firing sequences may exist starting from a given initial marking.
Example: A Barber Shop Figure is a Petri net model for a barber shop. The net includes a place waiting that models a waiting area where customers wait for the barber to serve them, and two places that model the barber’s status, busy or idle. The transitions model events that can occur: A firing of transition enter corresponds to a customer entering the waiting area of the shop. Transition serve corresponds to a customer moving to the barber’s chair for service. Transition done corresponds to the customer leaving the barber chair after receiving service, with the barber becoming idle. This model of the barber shop permits any number of customers to enter the waiting room, but models the constraint that only one customer can occupy the barber chair. This net is an unbounded Petri net because the number of tokens in the waiting place can increase indefinitely. Figure shows a model of a barber shop with two additional features: there are now two barbers and the waiting area has bounded capacity. Place available models the number of empty spaces for customers in the waiting area and place occupied models the occupied spaces. This addition turns the model into a bounded
The Barber
Definition and Examples Figure shows a Petri net with three places (the circles) and three transitions (the bars) interconnected by directed arcs. The graph is bipartite, the two node types being places and transitions. The places may be empty, or may hold any finite number of tokens, represented by black dots. The distribution of tokens is called a marking of the Petri net and represents its state. Operation of a Petri net produces a sequence of markings by successive applications of the firing rule: Firing Rule: A transition in a Petri net is enabled if each of its input places holds at least one token. Firing an enabled transition means removing one token from
BUSY
Waiting Area WAITING
ENTER
SERVE
DONE IDLE
Petri Nets. Fig. The barber shop with one barber. Places are shown as circles; transitions are shown as bars. Two tokens represent customers in the waiting area. A third token represents the waiting state of the barber
Petri Nets
P
The Barbers BUSY_A Waiting Area
SERVE_A
OCCUPIED
DONE_A
ENTER IDLE_A
BUSY_B
DONE_B AVAILABLE SERVE_B
IDLE_B
Petri Nets. Fig. The barber shop with two barbers. Either barber may serve a waiting customer
Petri net. Note that there are now two transitions that model a customer moving to a barber chair. These transitions share an input place and are said to be in conflict. This structure represents the nondeterminacy of deciding which barber serves the next customer in the case that both barbers are idle. A Petri net in which there can be no conflict is determinate, that is, its behavior is such that all firing sequences are equivalent, even though operation is nondeterministic. If one wishes more detail about the discipline of the waiting area, it can be modeled as a FIFO queue, as shown in Fig. . This net contains as many stages as desired (the capacity of the queue), each stage modeling a single queued item (waiting customer). The modeling techniques shown by the barber shop example illustrate application of Petri nets to such systems as pipelined processors and production systems for manufacturing or business workflow. Petri nets lack means to model the timing of actions represented by transition firings, but extensions have been developed to remedy this limitation (See below).
Petri Nets and Finite State Machines The Finite State Machine (FSM) model, a well-known tool used in the design and analysis of switching circuits and other engineering artifacts, lacks the ability
provided by Petri nets to capture the spatial structure of systems in a way that can aid study and analysis. This is illustrated by the FIFO queue of Fig. . By means of N places and N transitions, a queue of length N may be represented, including the occupancy state of each position in the queue. An equivalent FSM description of the queue would have N states. This difference is a strong argument in favor of using Petri nets in system design and analysis. Petri nets can exhibit some interesting properties. A Petri net with a marking may have no enabled transitions; no firings are possible and the net is in deadlock. Also, as seen in Fig. , a Petri net can have firing sequences in which the number of tokens in some places increases without limit. Two definitions restrict Petri nets to subclasses most useful for modeling certain kinds of realistic systems: Liveness: A Petri net with marking M is live if and only if given any marking M′ reachable from the given marking M, and any transition T of the net, there exists a firing sequence starting from M′ that fires transition T. Safety: A Petri net is N-safe for a given marking if and only if no marking reachable from the given marking has more than N tokens in any place. A net and marking that is -safe is said to be safe.
P
P
Petri Nets
FULL
ENTER
REMOVE EMPTY STAGE 1
STAGE 2
STAGE N
Petri Nets. Fig. Model of a FIFO queue with stage occupied
Note that the definition of live is formulated to rule out nets containing an initialization sequence that is never executed again in steady-state operation of the net. A Petri net cannot be both live and N-safe for a marking unless the net is connected, that is, the net contains a directed path between any pair of nodes. The nets of Fig. and Fig. are both live as marked, but only Fig. is safe because the waiting place in Fig. may accumulate tokens without limit. The class of live and safe Petri nets is especially interesting because these properties are characteristic of real systems that continue in operation indefinitely and are implementable with finite hardware.
Conflict and Determinacy Conflict in a Petri net occurs if, in some reachable marking, two transitions are both enabled and share an input place. A Petri net and marking for which no conflict is possible is determinate. The net in Fig. has no conflict and is determinate; the net of Fig. is not determinate because transitions serve_a and serve_b may be in conflict, representing the choice of barber in the case that both servers are idle.
The Petri Net Hierarchy A convenient hierarchy of classes of Petri nets having increasing modeling power may be specified by syntactic constraints on the structure of a net. The simplest of these subclasses is the marked graphs, which are Petri nets in which each place is an input place of exactly one transition and an output place of exactly one transition. These nets describe systems whose behavior is an endless repetition of a pattern of events. The name marked graphs is given to these nets because they may be drawn with each place, together with its input arc and output arc, represented as a simple arc, as shown in Fig. b for
the net of Fig. a. Transitions are drawn as dots instead of bars, and a marking is shown by placing tokens on the arcs. A marked graph with an initial marking that is live must be connected, and can be shown to be N-safe for some N. Another simple class of Petri nets is the state machines, which are nets in which each transition has exactly one input and one output place. A state machine with a one-token marking corresponds to an FSM with as many states as the net has places. A Petri net can be both a marked graph and a state machine. Such a net is very simple and consists of a set of disconnected cycles of alternating transitions and places. Both marked graphs and state machines are determinate because conflict is impossible. Free choice Petri nets constitute the next interesting level toward greater expressivity. Free choice nets permit a constrained form of conflict to be represented, corresponding to decisions in a digital system, or to choices based on predicates in a system description. A Petri net is a free choice net if and only if for each place P that is an input place of two transitions t and t; place P is the only input place of t and t. This ensures that in any marking in which P holds at least one token, it is guaranteed that a free choice is available – both t and t are enabled and either may fire, disabling the other. Figure b is not a free choice net because the decision is conditioned by a barber being available. Free choice Petri nets that have conflict are not determinate for any live marking. A Petri net is a simple Petri net if and only if it has no transition with more than one shared input place. Simple nets can represent systems in which arbitration occurs among several activities competing for shared resources, behavior not representable with free choice nets.
Petri Nets
P
t2
t4
t1
t2
t1
a
t3
b
t4
t3
Petri Nets. Fig. A Petri net (a) and its representation as a marked graph (b)
The subclasses of Petri nets form a hierarchy of strictly increasing expressive power: Marked Graphs State Machines Free Choice Nets Simple Nets All Basic Petri Nets
Extensions The basic Petri nets described above have found uses in describing digital systems, modeling process synchronization schemes using Dijkstra’s P and V operations, and monitors, and for the control structures underlying dataflow graphs. Methods have been developed for converting Petri nets into logic designs. Outside the realm of computer science, Petri nets have found application in modeling manufacturing systems, among many other uses. On the other hand, basic Petri nets are cumbersome and limited for application to many important problems in the analysis of concurrent systems such as performance analysis and job/work flow management. To make the formalism more generally useful, several extensions of basic Petri nets have been developed, and a rich body of formal analysis and application for these extensions has evolved. The most important extensions are Coloured Petri Nets, Timed Petri Nets, and Continuous/Hybrid Petri Nets.
Coloured Petri Nets In a Coloured Petri Net, each token may be labeled with a color from an arbitrary set of colors. The use of colored tokens permits more compact representations of large and complex systems. If transitions are permitted
to map the colors of input tokens to new colors of output tokens, a powerful model of arbitrary parallel computations is realized. If the set of colors is finite, then any colored Petri net has an equivalent basic Petri net, albeit possibly very large. Thus, the coloring of tokens does not increase the expressive power of basic nets. However, adding hierarchy to the colored Petri nets permits recursive computation to be directly represented.
Timed Petri Nets One important extension is the addition of explicit timing. One form of this extension is to associate a fixed time with each transition. The interpretation is that when an enabled transition fires, tokens are removed from its input places; then, at the specified later time, tokens are added to the output places of the transition. In this interpretation, a transition may have several instances of operation progressing simultaneously. This may be seen as a generalization of the Program Evaluation and Review Technique (PERT) of critical path planning for project scheduling. Association of time specifications with places and with arcs has also been studied. Analysis methods have been developed for determining bounds on the overall execution time of timed Petri nets. Another variation associates a probability distribution of operation times with each transition, leading to a generalization of queuing networks. Timed Petri nets have significant applications in performance analysis of computer systems, workflow analysis and manufacturing systems.
Continuous and Hybrid Petri Nets Another direction in extension of Petri nets is motivated by applications to industrial control and manufacturing
P
P
PETSc (Portable, Extensible Toolkit for Scientific Computation)
systems. In a continuous Petri net, tokens carry values that are nonnegative real numbers and are thought of as comprising an infinite number of marks. Transitions are permitted to fire continuously so long as each input place holds a positive, nonzero value. In such a net, a place can model, for example, the amount of liquid in a tank, the number parts in a bin, or the rate of traffic flow on a highway. In a hybrid Petri net, two kinds of places are distinguished: continuous places that hold real values, and discrete places that hold a nonnegative, integral number of tokens as in basic Petri nets. Transitions are also of two kinds. By means of a transition that has both a continuous place and a discrete place as inputs, turning a flow on and off can be modeled. Many variations and extensions of the Petri net model have arisen in the years since the early developments. The principal record of these developments has been the series of conference proceedings published by Springer as Lecture Notes in Computer Science (LNCS) under the title Applications and Theory of Petri Nets.
Related Entries CSP (Communicating Sequential Processes) Data Flow Graphs Synchronization
Bibliographic Notes and Further Reading The study of Petri nets began with the doctoral dissertation of Carl Adam Petri []. The graphical expression of Petri nets now widely used was presented by Holt and Commoner []. The MIT Project MAC report in which this paper is published includes annotated references to significant prior work in the field. The book by James Peterson [] was the first exposition of the theory and applications of Petri nets in book form, and includes a good summary of the subject’s early development and its literature. The survey paper of Murata [] is an excellent summary of work on basic Petri nets through the s. Timed Petri nets were first studied by Ramchandani [] in and many versions of Petri nets dealing with system timing have been studied since []. The three volumes by Jensen provide a comprehensive treatment of “colored” Petri nets []. Continuous and hybrid Petri nets are treated by David and Alla [], where a review of basic net theory and extensive bibliographic notes may be found. The principal current
record of advances and applications in Petri nets is the series of Conferences now titled “Applications and Theory of Petri Nets” []. An interesting web site is maintained by Carl Adam Petri and Wolfgang Reisig [], and these two authors are the contributors of the Petri net entry in scholarpedia [].
Bibliography . David R, Alla H () Discrete, continuous, and hybrid Petri Nets. Springer, Berlin . Girault C, Valk R () Petri Nets for systems engineering. Springer, New York . Holt A, Commoner F () Events and conditions. In: Record of the project MAC conference on concurrent systems and parallel computation. ACM, New York, pp – . Jensen K (, ) Coloured Petri Nets: basic concepts, analysis methods and practical use. Monographs in theoretical computer science, vol , . Springer, Berlin . Murata T (Apr ) Petri nets: properties, analysis and applications. Proc IEEE ():– . Peterson JL () Petri Net theory and the modeling of systems. Prentice-Hall, Englewood Cliffs . Petri CA () Kommunikation mit automaten. Schriften des Institutes für Instrumentelle mathematik. Ph.D. dissertation, University of Bonn, Germany . Petri CA. http://www.informatik.uni-hamburg.de/TGI/ mitarbeiter/profs/petri_eng.html . Petri CA, Reisig W () Petri net. Scholarpedia ():. http://www.scholarpedia.org/article/Petri_net . Ramchandani C (Feb ) Analysis of asynchronous concurrent systems by Petri Nets. Technical Report MIT/LCS/TR-, MIT Laboratory for Computer Science . Springer () Application and theory of Petri Nets. In: Informatik fachberichte; (–) Advances in Petri Nets. Lecture notes in computer science, vol ; (–) Applications and theory of Petri Nets. Lecture notes in computer science, vol . Springer, Berlin
PETSc (Portable, Extensible Toolkit for Scientific Computation) Barry Smith Argonne National Laboratory, Argonne, IL, USA
Definition The Portable, Extensible Toolkit for Scientific computation (PETSc, pronounced PET-see) is a suite of
PETSc (Portable, Extensible Toolkit for Scientific Computation)
open source software libraries for the parallel solution of linear and nonlinear algebraic equations. PETSc uses the Message Passing Interface (MPI) for all of its parallelism.
Discussion The numerical solution of linear systems with sparse matrix representations is at the heart of many numerical simulations, from brain surgery to rocket science. These linear systems arise from the replacement of continuum partial differential equation (PDE) models with suitable discrete models by the use of the finite element, finite volume, finite difference, collocation, or spectral methods and then possibly linearization by a fully implicit strategy such as Newton’s method or semiimplicit techniques. The resulting linear systems can range from having a few thousand unknowns to billions of unknowns, thus requiring the largest parallel computers currently available. Large-scale linear systems also arise directly in optimization, economics modeling, and many other non-PDE-based models. The main focus of PETSc is in solving linear systems arising from PDE-based models, though it is applied to other problems as well. PETSc also has limited support for dense matrix computations (through an interface to LAPACK and PLAPACK); but if the computation involves exclusively dense matrices, then PLAPACK or ScaLAPACK are appropriate libraries. PETSc is a software library intended for use by mathematicians, scientists, and engineers with a solid understanding of programming, some basic understanding of the issues in parallel computing (though they need not have programmed in MPI), a basic understanding of numerical analysis, and an understanding of the basics of linear algebra. It has a higher and deeper learning curve than do software packages such as Matlab. PETSc can be used directly from Fortran /, C/C++, and Python, with bindings that are specialized to each language. PETSc is used by a variety of parallel PDE solver libraries, including freeCFD, a general-purpose CFD solver; OpenFVM, a finite-volume-based CFD solver; OOFEM, an object-oriented finite element library; libMesh, an adaptive finite element library; and DEAL.II, a sophisticated C++- based finite element simulation package. Magpar is a widely used,
P
parallel micromagnetics package written using PETSc. For large-scale optimization and the scalable computation of eigenvalues, PETSc has two companion packages, TAO and SLEPc, developed by other groups, that use all of the PETSc parallelism and linear solver infrastructure. The emphasis of the PETSc solvers is on iterative methods for the solution of linear systems, but it provides its own efficient sequential direct (LU and Cholesky factorization-based) solvers as well as interfaces to several parallel direct solvers; see Table . PETSc has a unique configuration system that will automatically download and install the multitude of optional packages that it can use. In addition to the direct solvers, it can use several parallel partitioning packages as well as preconditioners in the hypre and TRILINOS solver packages; see Table . A key design feature of PETSc is the composibility of its linear solvers. Two or more solvers may be combined in various ways: by splittings, multigrid, and Schur complementing to produce efficient, problem-specific solvers. The parallelism in PETSc is usually achieved by domain decomposition. The geometry on which the PDE is being solved is divided among the processes,
PETSc (Portable, Extensible Toolkit for Scientific) Computation. Table Partial list of direct solvers available in PETSc Factorization Package
Complex numbers Parallel support support
LU
PETSc
x
SuperLU
x
SuperLU_Dist x
x
MUMPS
x
x
Spooles
x
x
PaStiX
x
x
IBM’s ESSL UMFPACK LUSOL Cholesky
PETSc
x
Spooles
x
x
MUMPS
x
x
PaStiX
x
x
DSCPACK
x
P
P
PETSc (Portable, Extensible Toolkit for Scientific Computation)
PETSc (Portable, Extensible Toolkit for Scientific) Computation. Table Partial list of preconditioners available in PETSc
Preconditioner Package
Complex numbers Parallel support support
ICC(k)
PETSc
x
ILU(k)
PETSc
x
Euclid/hypre
x
ILUdt
pilut/hypre
x
Jacobi
PETSc
x
SOR
PETSc
x
Block Jacobi
PETSc
x
x
Additive Schwarz
PETSc
x
x
Geometric multigrid
PETSc
x
x
Algebraic multigrid
BoomerAMG/hypre
x
ML/TRILINOS
x
Approximate inverse
SPAI
x
Parasails/hypre
x
x
and each process is assigned the unknowns and matrix elements associated with that domain. The communication required during the solution process is then nearest neighbor ghost (halo) point updates and global reductions (using MPI_Allreduce()) over a MPI communicator. PETSc has optimized code based on the inspector-executor model to perform the ghost point updates. In addition to its broad support for linear solvers, PETSc provides robust implementations of Newton’s method for nonlinear systems of equations. These include a variety of line-search and trust-region schemes for globalization. The solvers are extensible, allowing easy provision of user-provided convergence tests, line-search strategies, and damping strategies. Several variants of the Eisenstat–Walker convergence criteria for inexact Newton solves are available. There is also support for grid sequencing to efficiently generate high-quality initial solutions for fine grids. To compute the Jacobians commonly needed for Newton’s method, PETSc provides coloring of sparse matrices and efficient computation of the Jacobian entries using the coloring
with finite differencing, ADIC (automatic differentiation for C programs), and ADIFOR (automatic differentiation for Fortran programs). All of these run scalably in parallel. PETSc also provides a family of implicit and explicit ODE integrators, including an extensive suite of explicit Runge–Kutta methods. The implicit methods support all the functionality of the PETSc nonlinear solvers and use of any of the Krylov methods and preconditioners. The more sophisticated adaptive time-stepping ODE integrators of SUNDIALS can also be used with PETSc and allow use of all PETSc preconditioners. Provided in PETSc is an infrastructure for profiling the parallel performance of the application and the solvers it uses, including floating-point operations done, messages, and sizes of messages sent and received. It provides the results in a table that indicates the percentage of time spent in the various parts of the solver and application. Development of PETSc was started in by Bill Gropp, Lois Curfman McInnes, and Barry Smith at Argonne National Laboratory. They were joined shortly later by Satish Balay. Aside from a small amount of National Science Foundation funding in the mid-s, the US Department of Energy has provided the funding for PETSc development and support. Since its origin, PETSc has received software contributions from many of its users. PETSc was the winner of a R&D award. It also has formed the basis of three Gordon Bell Prize application codes in , , and as well as several Gordon Bell finalists.
Library Design PETSc follows the distributed-memory single program multiple data (SPMD) model of MPI, with the flexibility of having different types of computation running on different processes. Specifically PETSc allows users to create their own MPI communicators and designate computations for PETSc to perform on each of these communicators. A typical application code written with PETSc requires very few MPI calls by the developer. PETSc is written in C using the object-oriented programming techniques of data encapsulation, polymorphism, and inheritance. Opaque objects are defined that contain function tables (using C function pointers) used to call the code appropriate for the underlying
PETSc (Portable, Extensible Toolkit for Scientific Computation)
data structures. The six main abstract classes in PETSc are the Vec vector class for managing the system solutions, the Mat matrix class for managing the sparse matrices, the KSP Krylov solver class for managing the iterative accelerators, the PC preconditioner class, the SNES nonlinear solver class, and the TS ordinary differential equations (ODE) integrator class. The DM helper class manages transferring information about grids and discretizations into the Vec and Mat classes. Virtually all of the parallel communication required by PETSc (the MPI message passing and collective calls) takes place within these objects. The constructor for each PETSc object takes an MPI communicator, which determines on what processes the object and its computations will reside. The most common are MPI_COMM_WORLD, in which the object is distributed across all the user’s processes (and computations involving the object will require communication within that communicator), and MPI_COMM_SELF, in which the object lives on just that process and no communication is ever required for its computations. A typical application that requires linear solvers has a structure as depicted in Fig. . In this example, the DA object, which is an implementation of the DM class for structured grids, is used to construct the needed sparse matrix and vectors to contain the solution and right-hand side; it serves as a factory for Vec and Mat objects. Once the numerical values of the matrix are set, in this case by calls to MatSetValues(), the matrix is provided to the linear solver via KSPSetOperators(). Since MatSetValues() may be called with values that belong to any process, the calls to MatAssemblyBegin/End() are used to communicate the values to the process where they belong. Values may be set into vectors either by using VecSetValues(), with a concluding VecAssemblyBegin/End(), as with matrices, or by accessing the array of values using VecGetArray(), VecGetArrayF(), or DAVecGetArray() and putting values directly into the array. In this latter case, no communication of off-process values is done by PETSc. A typical application that requires nonlinear solvers has a structure as depicted in Fig. . In addition to serving as a factory for the Jacobian sparse matrix and solution vector (as in the linear case), the DA object is used as a factory for the ghosted representation of the solution xlocal and performs the ghost point updates with DAGlobalToLocal() in the routines
P
ComputeFunction() and ComputeJacobian(). These call-back routines are registered with the nonlinear solver object SNES with the routines SNESSetFunction() and SNESSetJacobian(). They are called when needed by the solver class. A typical application that requires ODE integration has a structure as depicted in Fig. . This simple example uses the Python interface to TS where the entire discretized ODE (in this case using the backward Euler method) is provided directly as the function and Jacobian. It is also possible to provide the function and Jacobian of the right-hand side of the ODE, that is, ut = F(u), and have the TS class manage the ODE discretization, with either an explicit or an implicit scheme. Each PETSc object has a method XXXSetFromOptions() that allows runtime control of almost all of the solver options through the PETSc options database. Command-line arguments (as keyword value pairs) are stored in a simple database. The XXXSetFromOptions() routines then search the options database, select any appropriate options, and apply them. For example, the option -ksp_type gmres is used by KSPSetFromOptions() to call KSPSetType() to set the solver type to GMRES. The options database may also be used directly by user code. Also common to all classes are the XXXView() methods. These provide a common interface to printing and saving information about any object to a PetscView object, which is an abstract representation of a binary file, a text file (like stdout), a graphical window for drawing, or a Unix socket. For example, MatView(Mat A,PetscViewer v) will present the matrix in a wide variety of ways depending on the viewer type and its state. Calling the viewer method on a solver class, such as SNES, displays the type of solver and all its options; see Fig. for an example. Note that the figure displays both the nonlinear and linear solver options. For the Mat class, PETSc provides several realizations. The most important of these are the following: ● ●
Compressed sparse row (CSR) format Point-block version of the CSR where a single index is used for small dense blocks of the matrix ● Symmetric version of the point-block CSR that requires roughly one-half the storage ● User-provided format (via inheritance)
P
P
PETSc (Portable, Extensible Toolkit for Scientific Computation)
program main ! Solves the linear system J x = f #include "finclude/petscalldef.h" use petscksp; use petscda Vec x,f; Mat J; DA da; KSP ksp; PetscErrorCode ierr call PetscInitialize(PETSC_NULL_CHARACTER,ierr) call DACreate1d(MPI_COMM_WORLD,DA_NONPERIODIC,8,1,1,PETSC_NULL_INTEGER,da,ierr) call DACreateGlobalVector(da,x,ierr); call VecDuplicate(x,f,ierr) call DAGetMatrix(da,MATAIJ,J,ierr) call ComputeRHS(da,f,ierr) call ComputeMatrix(da,J,ierr) call call call call
KSPCreate(MPI_COMM_WORLD,ksp,ierr) KSPSetOperators(ksp,J,J,SAME_NONZERO_PATTERN,ierr) KSPSetFromOptions(ksp,ierr) KSPSolve(ksp,f,x,ierr)
call MatDestroy(J,ierr); call VecDestroy(x,ierr); call VecDestroy(f,ierr) call KSPDestroy(ksp,ierr); call DADestroy(da,ierr) call PetscFinalize(ierr) end subroutine ComputeRHS(da,x,ierr) #include "finclude/petscalldef.h" use petscda DA da; Vec x;PetscErrorCode ierr;PetscInt xs,xm,i,mx; PetscScalar hx;PetscScalar, pointer::xx(:) call DAGetInfo(da,PETSC_NULL_INTEGER,mx,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,... call DAGetCorners(da,xs,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,xm,PETSC_NULL_INTEGER,... hx = 1.d0/(mx-1) call VecGetArrayF90(x,xx,ierr) do i=xs,xs+xm-1 xx(i) = i*hx enddo call VecRestoreArrayF90(x,xx,ierr) return end subroutine ComputeMatrix(da,J,ierr) #include "finclude/petscalldef.h" use petscda Mat J; DA da; PetscErrorCode ierr; PetscInt xs,xm,i,mx; PetscScalar hx call DAGetInfo(da,PETSC_NULL_INTEGER,mx,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,... call DAGetCorners(da,xs,PETSC_NULL_INTEGER,PETSC_NULL_INTEGER,xm,PETSC_NULL_INTEGER,... hx = 1.d0/(mx-1) do i=xs,xs+xm-1 if ((i .eq. 0) .or. (i .eq. mx-1)) then call MatSetValue(J,i,i,1d0,INSERT_VALUES,ierr) else call MatSetValue(J,i,i-1,-hx,INSERT_VALUES,ierr) call MatSetValue(J,i,i+1,-hx,INSERT_VALUES,ierr) call MatSetValue(J,i,i,2*hx,INSERT_VALUES,ierr) endif enddo call MatAssemblyBegin(J,MAT_FINAL_ASSEMBLY,ierr); call MatAssemblyEnd(J,MAT_FINAL_ASSEMBLY,ierr) return end
PETSc (Portable, Extensible Toolkit for Scientific Computation). Fig. Example of linear solver usage in PETSc in Fortran
●
“Matrix-free” representations, where the matrix entries are not explicitly stored, but instead matrixvector products are performed by using one of the following: – Finite differencing of the function evaluations
– Automatic differentiation of the function evaluations using either ADIC, for C language code, or ADIFOR, for Fortran language code – User-provided C, C++, Fortran, or Python routine
PETSc (Portable, Extensible Toolkit for Scientific Computation) static char help[] = "Solves -Laplacian u - exp(u) = 0, #include "petscda.h" #include "petscsnes.h"
P
0 < x < 1\n\n";
int main(int argc,char **argv) { SNES snes; Vec x,f; Mat J; DA da; PetscInitialize(&argc,&argv,(char *)0,help); DACreate1d(PETSC_COMM_WORLD,DA_NONPERIODIC,8,1,1,PETSC_NULL,&da); DACreateGlobalVector(da,&x); VecDuplicate(x,&f); DAGetMatrix(da,MATAIJ,&J); SNESCreate(PETSC_COMM_WORLD,&snes); SNESSetFunction(snes,f,ComputeFunction,da); SNESSetJacobian(snes,J,J,ComputeJacobian,da); SNESSetFromOptions(snes); SNESSolve(snes,PETSC_NULL,x); MatDestroy(J); VecDestroy(x); VecDestroy(f); SNESDestroy(snes); DADestroy(da); PetscFinalize(); return 0;} PetscErrorCode ComputeFunction(SNES snes,Vec x,Vec f,void *ctx) { PetscInt i,Mx,xs,xm; PetscScalar *xx,*ff,hx; DA da = (DA) ctx; Vec xlocal; DAGetInfo(da,PETSC_IGNORE,&Mx,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,... hx = 1.0/(PetscReal)(Mx-1); DAGetLocalVector(da,&xlocal);DAGlobalToLocalBegin(da,x,INSERT_VALUES,xlocal);DAGlobalToLocalEnd(da,x,... DAVecGetArray(da,xlocal,&xx); DAVecGetArray(da,f,&ff); DAGetCorners(da,&xs,PETSC_NULL,PETSC_NULL,&xm,PETSC_NULL,PETSC_NULL); for (i=xs; i<xs+xm; i++) { if (i == 0 || i == Mx-1) ff[i] = xx[i]/hx; else ff[i] = (2.0*xx[i] - xx[i-1] - xx[i+1])/hx - hx*PetscExpScalar(xx[i]); } DAVecRestoreArray(da,xlocal,&xx); DARestoreLocalVector(da,&xlocal);DAVecRestoreArray(da,f,&ff); return 0;} PetscErrorCode ComputeJacobian(SNES snes,Vec x,Mat *J,Mat *B,MatStructure *flag,void *ctx){ DA da = (DA) ctx; PetscInt i,Mx,xm,xs; PetscScalar hx,*xx; Vec xlocal; DAGetInfo(da,PETSC_IGNORE,&Mx,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,PETSC_IGNORE,... hx = 1.0/(PetscReal)(Mx-1); DAGetLocalVector(da,&xlocal);DAGlobalToLocalBegin(da,x,INSERT_VALUES,xlocal);DAGlobalToLocalEnd(da,x,... DAVecGetArray(da,xlocal,&xx); DAGetCorners(da,&xs,PETSC_NULL,PETSC_NULL,&xm,PETSC_NULL,PETSC_NULL); for (i=xs; i<xs+xm; i++) { if (i == 0 || i == Mx-1) { MatSetValue(*J,i,i,1.0/hx,INSERT_VALUES);} else { MatSetValue(*J,i,i-1,-1.0/hx,INSERT_VALUES); MatSetValue(*J,i,i,2.0/hx - hx*PetscExpScalar(xx[i]),INSERT_VALUES); MatSetValue(*J,i,i+1,-1.0/hx,INSERT_VALUES); } } MatAssemblyBegin(*J,MAT_FINAL_ASSEMBLY);MatAssemblyEnd(*J,MAT_FINAL_ASSEMBLY);*flag = SAME_NONZERO_... DAVecRestoreArray(da,xlocal,&xx);DARestoreLocalVector(da,&xlocal); return 0;}
PETSc (Portable, Extensible Toolkit for Scientific Computation). Fig. Example of nonlinear solver usage in PETSc in C
Because PETSc is focused on PDE problems, rowbased storage of the sparse matrices (each process holds a collection of contiguous rows of the matrix) is satisfactory for higher-performance parallel matrix operations. Hence, all of PETSc’s built-in sparse matrix implementations use this approach. Custom formats can be provided to handle parallelism for
“arrow-head” matrices where row-based distribution does not scale. PETSc has the point-block-based storage of sparse matrices for faster performance. The speed of sparse matrix computations is essentially always strongly limited by the memory bandwidth of the system, not by the CPU speed. The reason is that sparse matrix
P
P
PETSc (Portable, Extensible Toolkit for Scientific Computation)
import sys, petsc4py petsc4py.init(sys.argv) from petsc4py import PETSc import math class MyODE: def __init__(self,da): self.da = da def function(self, ts,t,x,f): mx = da.getSizes(); mx = mx[0]; hx = 1.0/mx (xs,xm) = da.getCorners(); xs = xs[0]; xm = xm[0] xx = da.createLocalVector() da.globalToLocal(x,xx) dt = ts.getTimeStep() x0 = ts.getSolution() if xs == 0: f[0] = xx[0]/hx; xs = 1; if xs+xm >= mx: f[mx-1] = xx[xm-(xs==1)]/hx; xm = xm-(xs==1); for i in range(xs,xs+xm-1): f[i] = (xx[i-xs+1]-x0[i])/dt + (2.0*xx[i-xs+1]-xx[i-xs]-xx[i-xs+2])/hx - hx*math.exp(xx[i-xs+1]) f.assemble() def jacobian(self,ts,t,x,J,P): mx = da.getSizes(); mx = mx[0]; hx = 1.0/mx (xs,xm) = da.getCorners(); xs = xs[0]; xm = xm[0] xx = da.createLocalVector() da.globalToLocal(x,xx) x0 = ts.getSolution() dt = ts.getTimeStep() P.zeroEntries() if xs == 0: P.setValues([0],[0],1.0/hx); xs = 1; if xs+xm >= mx: P.setValues([mx-1],[mx-1],1.0/hx); xm = xm-(xs==1); for i in range(xs,xs+xm-1): P.setValues([i],[i-1,i,i+1],[-1.0/hx,1.0/dt+2.0/hx-hx*math.exp(xx[i-xs+1]),-1.0/hx]) P.assemble() return True # same_nz da = PETSc.DA().create([9],comm=PETSc.COMM_WORLD) f = da.createGlobalVector() x = f.duplicate() J = da.getMatrix(PETSc.MatType.AIJ); ts = PETSc.TS().create(PETSc.COMM_WORLD) ts.setProblemType(PETSc.TS.ProblemType.NONLINEAR) ts.setType(’python’) ode = MyODE(da) ts.setFunction(ode.function, f) ts.setJacobian(ode.jacobian, J, J) ts.setTimeStep(0.1) ts.setDuration(10, 1.0) ts.setFromOptions() x.set(1.0) ts.solve(x)
PETSc (Portable, Extensible Toolkit for Scientific Computation). Fig. Example of ODE usage in PETSc in Python
computations involve few operations per matrix entry. For example, for matrix-vector products there are two floating-point operations (a multiply and an addition) for each entry in the matrix. Memory-bandwidthlimited computations are sometimes said to hit the memory wall. In the CSR format, there is a column
index for every nonzero entry in the matrix, and the j
PETSc (Portable, Extensible Toolkit for Scientific Computation)
P
SNES Object: type: ls line search variant: SNESLineSearchCubic alpha=0.0001, maxstep=1e+08, minlambda=1e-12 maximum iterations=50, maximum function evaluations=10000 tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 KSP Object: type: fgmres GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization GMRES: happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000 right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: type: mg MG: type is FULL, levels=2 cycles=v Coarse grid solver -- level 0 presmooths=1 postsmooths=1 ----KSP Object:(mg_coarse_) type: preonly PC Object:(mg_coarse_) type: lu LU: out-of-place factorization matrix ordering: nd LU: tolerance for zero pivot 1e-12 LU: factor fill ratio needed 1.875 Matrix Object: type=seqaij, rows=64, cols=64 total: nonzeros=1024, allocated nonzeros=1024 using I-node routines: found 16 nodes, limit used is 5 Down solver (pre-smoother) on level 1 smooths=1 -------------------KSP Object:(mg_levels_1_) type: gmres GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization GMRES: happy breakdown tolerance 1e-30 maximum iterations=1 tolerances: relative=1e-05, absolute=1e-50, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object:(mg_levels_1_) type: ilu ILU: 0 levels of fill ILU: factor fill ratio allocated 1 ILU: tolerance for zero pivot 1e-12 Matrix Object: type=seqaij, rows=196, cols=196 total: nonzeros=3472, allocated nonzeros=3472 using I-node routines: found 49 nodes, limit used is 5 Matrix Object: type=seqaij, rows=196, cols=196 total: nonzeros=3472, allocated nonzeros=3472 using I-node routines: found 49 nodes, limit used is 5
PETSc (Portable, Extensible Toolkit for Scientific Computation). Fig. Example of output using SNESView()
size bs), there is one column index per block, and the matrix-vector product may be coded as y[bs ∗ i + k] = j
unrolling can keep the reused values in registers. Using the block CSR when appropriate, depending on the particular processor, can improve the performance of the sparse matrix operators by a factor of to . The KSP Krylov accelerator class provides over a dozen Krylov methods; see Table . The data encapsulation and polymorphic design of the Vec, Mat, and PC classes in PETSc allow the immediate use of any of their
P
P
PETSc (Portable, Extensible Toolkit for Scientific Computation)
PETSc (Portable, Extensible Toolkit for Scientific Computation). Table Partial list of Krylov methods available in PETSc Richardson (simple) iteration, xn+ = xn + B(b − Axn ) Chebychev iteration Conjugate gradient method Biconjugate gradient Biconjugate gradient stabilized (bi-CG-stab) Conjugate residuals Conjugate gradient squared Minimum residuals (MINRES) Generalized minimal residual (GMRES) Flexible GMRES (fGMRES) Transpose-free quasi-minimal residuals (QMR)
implementations with any of the Krylov solvers. When possible, these are implemented to allow left, right, or symmetric preconditioning and the use of various norms of the residual in the convergence tests including the “natural” (energy) norm. Custom convergence tests and monitoring routines can be provided to any of the solvers. The PC preconditioners class contains a variety of both classical and modern preconditioners including incomplete factorizations, domain decomposition methods, and multigrid methods. See Table for a partial list. In addition, several preconditioner classes are designed to allow composition of solvers. These include PCKSP, which allows using a Krylov method as a preconditioner; PCFieldSplit, which allows constructing solvers by composing solvers for different fields of the solution; KSPCOMPOSITE, which allows combining arbitrary solvers; and PCGALERKIN, which constructs preconditioners by the Galerkin process (that is, as projections of the error in some appropriate inner product). PCFieldSplit preconditioners are often called block preconditioners; for example, when one field is velocity and another pressure, the resulting Stokes solver is often solved with one block for velocity and one for pressure.
Applications A wide variety of simulation applications have been written by using PETSc. These include fluid flow for
aircraft, ship, and automobile design; blood flow simulation for medical device design; porous media flow for oil reservoir simulation for energy development and groundwater contamination modeling; modeling of materials properties; economic modeling; structural mechanics for construction design; combustion modeling; and nuclear fission and fusion for energy development. PETSc-FUNd was an early application based on Kyle Anderson’s NASA code, FUNd, that solves the Euler and Navier–Stokes equations including both compressible and incompressible on unstructured grids. PETSc-Fund won a Gordon Bell special prize in running on over , of the ASCI Red processors. This application, the dissertation work of Dinesh Kaushik, motivated many of the early optimizations of PETSc. The forward and inverse modeling of earthquakes using the PETSc algebraic solvers, developed by Volkan Akcelik, Jacobo Bielak, George Biros, Ioannis Epanomeritakis, Antonio Fernandez Omar Ghattas, Eui Joong Kim, David O’Hallaron, and Tiankai Tu, resulted in a Gordon Bell special prize. The algebraic multigrid solver Prometheus was written by Mark Adams using the PETSc Vec, Mat, KSP, and PC classes. It takes advantage of the block CSR sparse format in PETSc to maximize performance. It was to simulate whole-bone micromechanics with over half a billion degrees of freedom, resulting in a Gordon Bell special prize. PETSc has been used by several research groups to simulate heart arrhythmias, which are the cause of the majority of sudden cardiac deaths. These applications involve solving the nonlinear bidomain equations, which are two coupled partial differential equations that model the intracellular and extracellular potential of the heart. Numerical solutions to these equations (and more sophisticated models) explain much of the electrical behavior of the heart, including defibrillation. PFLOTRAN, led by Peter Lichtner of Los Alamos National Laboratory, is a subsurface flow and contaminant transport simulator that uses the PETSc DM class to manage the parallelism of its mesh, the SNES nonlinear solver class for the solutions needed at each time step, the Mat class to contain the sparse Jacobians, and the Vec class for its flow and contaminant’s solutions. It has been run on up to , cores of the Cray XT
PGAS (Partitioned Global Address Space) Languages
and has been used to more accurately model uranium plumes at DOE’s Hanford site. The UNIC neutronics package developed by Mike Smith and Dinesh Kaushik of Argonne National Laboratory has run full reactor core simulations on , cores of the IBM Blue Gene/P. It supports both the second-order Pn and Sn methods with dozens of energy groups. It parallelizes simultaneously over the geometry by means of domain decomposition and angles using a hierarchy of MPI communicators and PETSc solver objects.
Related Entries Algebraic Multigrid BLAS (Basic Linear Algebra Subprograms) Chaco Distributed-Memory Multiprocessor Domain Decomposition LAPACK Memory Wall METIS and ParMETIS MPI (Message Passing Interface) PLAPACK Scalability ScaLAPACK SPAI (SParse Approximate Inverse) SPMD Computational Model SuperLU
P
solver BoomerAMG. SUNDIALS specializes in nonlinear solvers and adaptive ODE integrators; it expects the required linear solver to be provided by the user or another package. Many of the solvers in these other packages can be called through PETSc.
Bibliography . Balay S, Buschelman K, Eijkhout V, Gropp WD, Kaushik D, Knepley MG, McInnes LC, Smith BF, Zhang H () PETSc Users Manual, Argonne National Laboratory Technical Report ANL-/ - Revision .. . Balay S, Gropp WD, McInnes LC, Smith BF () Efficient management of parallelism in object oriented numerical software libraries. In: Arge E, Bruaset AM, Langtangen HP (eds) Modern software tools in scientific computing. Birkhauser Press, Boston, pp – . Balay S, Gropp WD, McInnes LC, Smith BF () Software for the scalable solution of PDEs. In: Dongarra J, Foster I, Fox G, Gropp B, Kennedy K, Torczon L, White A (eds) CRPC handbook of parallel computing. Morgan Kaufmann Publishers . J Dongarra’s freely available software for linear algebra. http:// www.netlib.org/utk/people/JackDongarra/la-sw.html . List of external software packages available from PETSc. http:// www.mcs.anl.gov/petsc/petsc-as/miscellaneous/external.html . Saad Y () Iterative methods for sparse linear systems, nd edn. SIAM . Partial list of applications written using PETSc. http://www.mcs. anl.gov/petsc/petsc-as/publications/petscapps.html . PETSc’s webpage. http://www.mcs.anl.gov/petsc . TRILINOS’s webpage. http://trilinos.sandia.gov . Hypre’s webpage. https://computation.llnl.gov/casc/linear_ solvers/sls_hypre.html . SUNDIAL’s webpage. https://computation.llnl.gov/casc/sundials/ main.html
Bibliographic Notes and Further Reading The PETSc web site is the best location for up-to-date information on PETSc []. A complete list of external packages that PETSc can use is given in []. More details of the applications developed by using PETSc can be found at []. Further details on the design decisions made in PETSc may be found in []. Other related parallel solver packages include TRILINOS [], hypre [], and SUNDIALS []. TRILINOS is a large, general-purpose solver package much in the spirit of PETSc and written largely in C++; it currently has little support for use from Fortran. The hypre package specializes in high-performance preconditioners and includes a scalable algebraic multigrid
PGAS (Partitioned Global Address Space) Languages George Almasi IBM, Yorktown Heights, NY, USA
Definition PGAS (Partitioned Global Address Space) is a programming model suited for shared and distributed memory parallel machines, e.g., machines consisting of many (up to hundreds of thousands of) CPUs.
P
P
PGAS (Partitioned Global Address Space) Languages
Shared memory in this context means that the total of the memory space is available to every processor in the system (although access time to different banks of this memory can be different on each processor). Distributed memory is scattered across processors; access to other processors’ memory is usually through a network. A PGAS system, therefore, consists of the following components:
ubiquity of shared memory, the method of accessing remote memory, or the choice of a parent programming language. Consequently there is a wide variety of PGAS-like languages and libraries: ●
●
A set of processors, each with attached local storage. Parts of this local storage can be declared private by the programming model, and is not visible to other processors. ● A mechanism by which at least a part of each processor’s storage can be shared with others. Sharing can be implemented through the network device with system software support, or through hardware shared memory with cache coherence. This, of course, can result in large variations of memory access latency (typically, a few orders of magnitude) depending on the location and the underlying access method to a particular address. ● Every shared memory location has an affinity – a processor on which the location is local and therefore access is quick. Affinity is exposed to the programmer in order to facilitate performance and scalability stemming from “owner compute” strategies. All PGAS programming languages contain the components enumerated above, although the ways in which these are made available to the programmer differ. Every PGAS language allows programmers to distinguish between private and shared memory locations, and to determine the affinity of shared memory locations. Some PGAS languages provide work distribution primitives such as parallel loops based on affinity, or program syntax to allow special handling of remote (and therefore, slow) data accesses. The rest of this entry expands some of these differences between PGAS languages.
Discussion Introduction There exist a variety of choices for PGAS languages and implementations. Some of these choices are about the
●
●
●
● ●
●
●
●
UPC [] (Unified Parallel C) is a language descended from C. It extends C arrays and pointers with shared arrays and shared pointers that address into global memory. UPC also features a forall loop that distributes iterations based on affinity of array elements. Coarray Fortran [] is a Fortran-based language that extends Fortran arrays with co-dimensions that allow accessing arrays on other processes (called images). A variant of Coarray Fortran is included in the Fortran standard, making it the only PGAS language with ISO approval. Split-C [] is a C-based PGAS language that acknowledges the latency of remote memory accesses by allowing split-phase, or non-blocking, transactions. This allows overlapping of remote accesses with computation, hiding latency. Titanium [] is a Java-based PGAS language. Titanium features SPMD parallelism, pointers to shared data and an advanced distributed array model. ZPL [] is an array-based language featuring the global view programming model. Chapel [] is Cray Inc’s flagship modern programming language. It incorporates elements of ZPL but also features the multiresolution paradigm, allowing users to bore down to performance from an initial high-level program. X [] is a PGAS language that provides task parallelism as well as data parallelism. The key feature of X is asynchronous task dispatching. HPF [] (High-Performance Fortran) is an early attempt to solidify concepts from global view array programming in a Fortran-based language. It is one of the bases from which the PGAS concepts grew. MPI [] (Message Passing Interface) is the de facto standard for high-performance parallel programming. It does not implement the PGAS programming model, since it does not have the concept of global memory: All inter-processor data exchange is explicit. However, MPI contains many ideas and concepts relevant to PGAS and that makes it worth mentioning in this context.
PGAS (Partitioned Global Address Space) Languages
●
OpenMP [] is a cross-language standard for shared-memory programming used widely in the high-performance computing world. The standard allows loops to be annotated as executed in parallel, and variables as shared or private; the newer standard has task-parallel features as well. OpenMP is in a similar situation to MPI: not a PGAS language, but containing many relevant concepts. ● Global Arrays [] is a library or parallel array computing. It provides an abstraction of a shared array but is backed by distributed memory. Actual memory operations are implemented by a one-sided messaging library called ARMCI. ● HTAs (Hierarchical Tiled Arrays) are another library-based approach, providing the user with an array abstraction embedded into the multiple levels of a distributed system’s memory hierarchy. HTAs can be laid out to reflect this hierarchy: levels of cache, shared memory with affinity to particular processors, and of course nonlocal memory accessed (under the covers) by messaging.
Local Versus Shared Memory While all PGAS languages distinguish between local, shared local and shared remote memory. However, the default assignment of memory to the local versus shared space greatly varies across the space of languages. All memory in MPI (the Message Passing Interface standard) is local, and the only way to convey information to another process space is through messages. In contrast all memory in OpenMP (a GAS programming paradigm) is global, and the only way to make memory locations safe from other threads is to explicitly denote it as thread private. In UPC, Coarray Fortran and Split-C memory is declared as private by default, and has to be made global with an explicit declaration modifier. In Titanium program stacks are thread-private, but the heap is shared by default. In the array language ZPL and in the HTA library all arrays are shared by default. In X, memory is local and only accessible by sending units of work (“asyncs”) to the remote locations to execute.
Computation and Address Spaces Parallelism implies multiple processing units executing a particular program. However, the relationship between executing programs and address spaces differs
P
across programming languages. In UPC and Coarray Fortran address, spaces are tightly bound to computation. Execution units are called threads in UPC; Address affinity is calculated relative to UPC threads. Titanium calls the execution units processes, and locality is bound to these implicitly. By contrast, in Coarray Fortran it is the address spaces themselves that are named – images – and the implication is that each image has computation executing on it. X completely separates the notions of address space and computation. Every address space is called a place, and multiple computational threads called activities are allowed to execute simultaneously, subject to the capability of the hardware.
Messaging The PGAS programming model does not make any representation about the mechanics of accessing data in nonlocal address spaces. On distributed-memory hardware data exchange is done by exchanging messages across any network devices are available on the hardware in question; PGAS programming models are implemented on top of a messaging system. The preferred messaging system for PGAS implementations is one sided: That is, one of the participants is active and is responsible for specifying all parameters of the exchange (identities of sender, receiver, addresses on both ends, amount of data), while the other participant is passive and contributes nothing but the data itself. Active messages are also used preferentially by PGAS languages. Active messages vary from one-sided messages in that the passive participant is called upon to execute user code as part of receiving the message. Every PGAS language makes a choice as to what extent language syntax hides the underlying messaging system. In Split-C messages look like assignments, and provisions are made to hide the large latency of such messages. In UPC, local and shared assignments have the same syntax, making the indistinguishable; however, the programmer is allowed to write explicit one-sided messages into the program. Even third-party messages are allowed (e.g., UPC thread A specifying a data transfer between threads B and C). Coarray Fortran and Chapel do not allow explicit messaging. X exposes messaging to the programmer in the form of asyncs which are very close in concept to active messages.
P
P
PGAS (Partitioned Global Address Space) Languages
References to Remote Memory Just as in the C language arrays and pointers are two sides of the same coin, in PGAS languages there is a close relationship between arrays and references in global address space especially in those languages rooted in C syntax, like Split-C and UPC. The syntax and semantics of references to remote memory, including pointer arithmetic, tends to follow that of normal pointers. The unique features of remote pointer access revolve around hiding of access latency. Remote accesses tend to be orders of magnitude slower than local ones. The increased latency can be partially mitigated by posting remote operations as soon as the initial conditions are met, e.g., both source data and destination buffers are ready for transfer. However, the operation need not be complete until the data is actually needed on the destination end. To implement this, Split-C features the split assignment operator, and the Berkeley UPC extensions (not part of the UPC standard) allow non-blocking remote memory operations.
Array Programming and Implicit Parallelism Array programming is a generic term describing a programming environment suitable for the processing of n-dimensional arrays. In these environments arrays are first-class citizens, allowing compact declaration and operators (unlike in conventional imperative programming languages where arrays are handled by loops). Some examples of array programming languages/environments are APL, Fortran , MATLAB, and R. The attraction of array languages is their ability to express operations on large amounts of data with few instructions. This has many benefits, including efficient programming of vector processors (e.g., Intel SSE, IBM Altivec) and graphics processors (NVIDIA GPUs), but array languages also lend themselves to explicit SPMD parallelism with the PGAS programming model. The programmer specifies the layout of array elements in distributed memory. The compiler and/or the runtime optimize array operations by staying as close as possible to the owner compute rule, i.e., scheduling computation on the CPUs closest to each array element. The execution model of pure array languages is SIMD; conceptually there is a single thread of control acting on a large amount of data. PGAS languages have
borrowed heavily from the array processing paradigm. The Fortran D and High-Performance Fortran (HPF) languages allow users to specify data layouts with the TEMPLATE and DISTRIBUTE commands. An HPF template declares a processor layout (and hence the structure of the partitioned address space). Global arrays are distributed across this template. A large set of intrinsic operators allow the concise expression of operations like shifting/transposing/summing up array slices. In Coarray Fortran, array data distribution takes the form of a co-dimension. Vector indexing in the Fortran style is permitted. The Chapel and ZPL languages offer a refinement of the Fortran /Matlab vector syntax by means of regions, or named subsets/slices of arrays: shifts, reductions, dimensional floods (i.e., broadcasts), boundary exchanges can be expressed this way. The partitioned global address space is set up by means of distributions, an analogue of HPF templates. The Titanium programming language also follows this approach. Less conventional runtime-only approaches include the Hierarchical Tiled Arrays (HTAs) library a pure runtime solution that provides multiple levels of data decomposition, one for each level of non-locality in a modern computer architecture. The Global Arrays toolkit also allows programmers to specify and optimiz their own array layouts. The Matlab Parallel Toolbox uses the spmd keyword and specialized array distribution syntax to control data parallel execution. There is a natural affinity between array processing and parallelism. By putting arrays into global memory one transcends the memory limitations of any single CPU, while still allowing for quick access to the array from anywhere. Array operations are generally floatingpoint intensive, and therefore natural candidates for parallelization. The large number of operations causes more granular computation, resulting in less parallelism overhead and therefore fewer losses to Amdahl’s law. Well-known parallel algorithms exist for many array operators. Some of these algorithms have good scaling properties (i.e., low cross-CPU communication requirements, good load balance) and can be coded into the supporting runtime system or even the compiler, allowing the programmer instant access to high-performance parallel array operations.
PGAS (Partitioned Global Address Space) Languages
Parallel Loops and Explicit Data Parallelism The parallel loop construct is an established way of expressing explicit parallelism; Fortran’s DOALL statement is one of the oldest such constructs. The essence of the construct is to divide the iteration space of a loop nest among processors, either statically or dynamically. OpenMP in particular is known for a wide variety of parallel loop options. Several PGAS languages have their own versions of parallel loops. Perhaps the most prominent of these is the UPC forall construct which ties execution of particular iterations to an affinity expression that can depend on the induction variable of the loop. ZPL, Chapel, X, and Titanium allow parallel loops to be run on affinity sets which implicitly determine which CPU executes what iteration.
Collectives, Teams, and Synchronization Collective operations in parallel programming languages denote operations that potentially involve more than two participants. Collective communication concepts were popularized by MPI, although basic ideas like parallel prefix are considerably older. Collectives are important in the context of parallel programming models for two major reasons. First, collective communication primitives succinctly express complex data movement operations, contributing to brevity and clarity in parallel programs. Second, because of their relatively simple and well-studied semantics, collectives are good optimization targets, resulting in improved performance and scalability. Collective operations are either pure data exchange protocols (such as broadcast, scatter, and all-to-all exchanges), or computational collectives (like reductions, where data are interpreted and recomputed during the collective). Another way to describe collectives is based on whether they have synchronizing properties. For example, the Alltoall collective causes synchronization between every pair of tasks involved, since completion of the collective involves bidirectional data dependencies on every pair. Other collectives, like Scatter, create much fewer data dependencies and therefore do not cause global synchronization. Finally, Barrier is an example of a collective that exchanges no data at all; its only purpose is to effect a synchronization.
P
Collective communication is further categorized by the number of participants. The simplest case is that of every task in a job participating in a collective. However, arbitrary teams of tasks (called communicators in MPI) can be set up for collective communication. There are several intriguing aspects that cause the mapping of collective communication to be nonobvious in a PGAS context. The most immediate of these involves data integrity. A natural way to think about data integrity in collective communication is as follows: Data buffers passed from the caller to a collective cannot be touched (either read or written) by the user until the collective completes. In other words, data buffers’ ownership changes when the collective is invoked and when it terminates. However, on a system with shared memory and onesided data access the invocation boundary is fuzzy. The collective may be entered at different times on different processors. For example, in the presence of one-sided communication the calling process is unable to provide a strong guarantee to the collective that the data buffer will not be touched – since other processes may not yet have entered the collective and may be in the middle of a remote update to the very buffer being processed by the collective. UPC attempts to deal with this problem by allowing the programmer to state pre- and post-conditions on the boundaries of a collective operation with regard to shared data. Just like the PGAS model extends point-to-point communication with non-blocking and split-phase transactions, a similar extension can be envisaged for collective communication. The evident advantage of non-blocking collective communication is the ability to overlap it with computation or other communication. There is a case to be made for one-sided collectives. Far from a contradiction in terms, one-sided collective communication involves the address spaces of multiple tasks, but possibly not every task participates actively in the collective. An example would be a one-sided broadcast similar to a one-sided write operation but targeting multiple address spaces. Some PGAS languages, like Coarray Fortran, feature synchronization operation on teams of tasks designated on an ad hoc basis. This extends the MPI notion in which MPI communicators are predefined
P
P
PGAS (Partitioned Global Address Space) Languages
and relatively heavyweight objects. Collective communication exists in some of the PGAS languages today. Titanium features teams and exchange, broadcast, reduction collectives. UPC has a usual complement of collectives but no teams. Coarray Fortran features ad hoc teams in its synchronization operations. Many array languages feature array operations that are essentially “syntax sugar” for collective operations.
Memory Consistency The memory consistency model of PGAS programs deals with the effect of writing to remote memory; i.e., under what conditions does a remote write become visible by the source, destination, or third parties. The gold standard for memory consistency is sequential consistency: In this model, the memory behaves as if it were written by a single processor at a time. Sequential consistency is expensive to implement in a distributed memory system because performance can be gated by the slowest write. Therefore, most PGAS languages implement a weak consistency model. For example, Coarray Fortran’s consistency model is designed to avoid conflicts and allows compiler optimization. Ordering of memory accesses made to remote locations is done explicitly by the programmer by breaking the program into ordered segments. Conflicting writes in the same segment are disallowed: The basic constraint is that if a variable is defined in a segment, it cannot be read or written by any other image in the same segment. UPC has two memory consistency modes, strict and relaxed, where strict consistency is understood to be sequential consistency. In Titanium, local dependencies are observed. Shared reads and writes performed in critical sections cannot appear to have executed outside.
Future Trends The future of the Partitioned Global Address Space programming model is difficult to predict. A variety of programming languages based on the model have been proposed, none meeting with universal approval. The state of the art in parallel programming continues to be MPI and OpenMP programming; it is safe to say that the programming model has not yet fulfilled its promise. The face of parallel computing is continuously changing. While single processor performance has stopped following Moore’s law, peak performance on the Top website continues to track an exponential
growth curve. This growth is achieved by machines with hybrid (shared and distributed memory) architectures, forcing a change in programming technology. Also, an increasing share of high-performance programming also targets hybrid architectures of another kind: dedicated accelerators based on, e.g., GPU compute engines. OpenMP and MPI face some difficulty in coping with these challenges, and may leave the field open for new software technology. Recognizing that PGAS languages are unlikely to replace MPI, the current trend is to enhance interoperability, allowing coexistence of multiple languages in the same executable. The challenge is both conceptual and practical, and includes reconciliation of the execution models, data representations, and execution semantics of different programming models. On a practical level, the trend is toward shared infrastructure with MPI and an expression of PGAS functionality through library calls to enable a multiplicity of language implementations.
Related Entries Array Languages Chapel (Cray Inc. HPCS Language) Coarray Fortran Collective Communication Fortress (Sun HPCS Language) Global Arrays Parallel Programming Toolkit HPF (High Performance Fortran) Memory Models MPI (Message Passing Interface) OpenMP SPMD Computational Model Titanium UPC ZPL
Bibliography . Chamberlain BL, Choi S-E, Christopher Lewis E, Lin C, Snyder L, Weathersby D () ZPL: a machine independent programming language for parallel computers. Softw Eng ():– . The cascade high productivity language. HIPS, :–, . High Performance Fortran Forum (). High performance Fortran language specification, version .. Technical report CRPCTR, Houston . Nieplocha J, Palmer B, Tipparaju V, Krishnan M, Trease H, Apra E (). Advances, applications and performance of the global arrays shared memory programming toolkit. Int J High Perform Comput Appl :–
Phylogenetics
. Numrich RW, Reid J () Co-array fortran for parallel programming. SIGPLAN Fortran Forum ():– . Open MP () Simple, portable, scalable SMP programming. http://www.openmp.org/ . Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J. MPI-the complete reference. The MPI Core, vol . MIT Press, Cambridge, MA . Split-C website. http://www.eecs.berkeley.edu/Research/ Projects/CS/parallel/castle/split-c/ . UPC language Specification, V., May . The X programming language. http://x.sourceforge.net, . Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, Hilfinger P, Graham S, Gay D, Colella P, Aiken A (). Titanium: a high-performance Java dialect. Concurrency Pract Experience (–):–
Phylogenetic Inference Phylogenetics
Phylogenetics Alexandros Stamatakis Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
Synonyms Molecular evolution; Phylogenetic inference; Reconstruction of evolutionary trees
Definition Phylogenetics, or phylogenetic inference (bioinformatics discipline), deals with models and algorithms for reconstruction of the evolutionary history – mostly in form of a (binary) evolutionary tree – for a set of living biological organisms based upon their molecular (DNA) or morphological (morphological traits) sequence data.
Discussion Introduction The reconstruction of phylogenetic (evolutionary) trees from molecular or morphological sequence data is a comparatively old bioinformatics discipline, given that likelihood-based statistical models for phylogenetic inference were introduced in the early s, while
P
discrete criteria that rely on counting changes in the sequence data date back to the late s and early s. Computationally, likelihood-based phylogenetic inference approaches represent a major challenge, because of high memory footprints and of floating point intensive computations. The goal of phylogenetic inference consists in reconstructing the evolutionary history of a set of n presentday organisms for which molecular sequence data can be obtained. In some cases it is also possible to extract ancient DNA or establish the morphological properties (traits) of fossil records.
Input The input for a phylogenetic analysis is a list of organism names and their associated DNA or protein sequence data. Note that the DNA sequences for distinct organisms will typically have different lengths. In modern phylogenetics, instead of using the raw sequence data, a so-called multiple sequence alignment (MSA) of the molecular data of the organisms is used as input. Multiple sequence alignment is an important – generally NPhard – bioinformatics problem. The key goal of MSA is to infer homology, that is, determine which nucleotide characters in the sequence data share a common evolutionary history. Because insertions and/or deletions of nucleotides may have occurred during the evolutionary history of the organisms (represented by their DNA sequences), such events are denoted by inserting the gap symbol - into the sequences during the MSA process. After the alignment step, all n sequences will have the same length m, that is, the MSA has m alignment columns (also called: characters, sites, positions). A simple example for an MSA of DNA data for the Human, the Mouse, the Cow, and the Chicken with n = species and m = sites is provided below: Cow Chicken Human Mouse
ATGGCATATCCCA-ACAACTAGGATTC ATGGCCAACCACTCCCAACTAGGCTTA ATGGCACAT---GCGCAAGTAGGTCTA ATGG----CCCATTCCAACTTGGTCTA
Output The output of a phylogenetic analysis is mostly an unrooted binary tree topology. The present-day organisms under study (for which DNA data can be extracted) are assigned to the leaves (tips) of such a tree, whereas the inner nodes represent common extinct ancestors. The branch lengths of the tree represent the relative
P
P
Phylogenetics
evolutionary time between two nodes in the tree. A likelihood-based phylogenetic tree for the Human, the Mouse, the Cow, and the Chicken using the above MSA is provided in Fig. . The biological interpretation of this tree is that the Mouse and the Human are more closely related to each other than to the Cow and the Chicken.
Combinatorial Optimization In order to reconstruct a phylogenetic tree, criteria are required to assess how well a specific tree topology explains (fits) the underlying molecular sequence data. One may think of this as an abstract function f () that scores alternative tree topologies for a given, fixed MSA. Thus, the goal of a phylogenetic tree reconstruction algorithm is to find the tree topology with the best score according to f (), that is, phylogenetic inference is a combinatorial optimization problem. The algorithmic problem in phylogenetics is characterized by the number of possible distinct unrooted binary tree topologies for n organisms that is given by: ∏ni= (i − ). For n = , there already exist approximately possible tree topologies; this number corresponds to the number of atoms in the universe. Because of the size of the tree search space, phylogenetic inference under commonly used scoring criteria f () such as maximum likelihood or maximum parsimony is NP-hard. Therefore, a significant amount of research effort in phylogenetics has 0.1
Cow
Mouse
Chicken
Human
Phylogenetics. Fig. Likelihood-based tree for the Cow, the Chicken, the Mouse, and the Human
focused on the design of efficient heuristic search strategies. Moreover, the optimization of the scoring function f () by algorithmic and technical means also represents an important research objective in phylogenetics.
Optimality Criteria The most straightforward approach to phylogenetic inference is to use distance-based methods as opposed to character-based methods (see below). Distancebased methods rely on initially building a symmetric n×n matrix D of pair-wise distances between the organisms under consideration, which is subsequently used to infer a tree. The optimality of a tree with given branch lengths is determined via a least squares method that is deployed to quantify the difference between the distances given by D and the distances induced by the tree topology (also called patristic distances). Thus, least squares optimization strives to find the tree that minimizes the difference between the pair-wise distances induced by the tree and the corresponding distances in D and is known to be NP-hard. Commonly used heuristics for distance-based tree reconstruction are the Unweighted Pair Group Method with Arithmetic mean (UPGMA) and Neighbor Joining (NJ) methods. Character-based methods (parsimony and likelihood) directly operate on the sequences of the MSA. The sequences are assigned to the leaves of the tree and an overall score for the tree is computed via a postorder tree traversal with respect to a virtual root. One of the main properties of the likelihood and parsimony criteria is that the respective scores (function f ()) are invariant to the placement of such a virtual root, that is, the scores will be identical, irrespective of where the virtual root is placed. Parsimony and likelihood criteria are characterized by two additional properties. () They assume that MSA columns have evolved independently, that is, given a fixed tree topology, one can simultaneously compute a parsimony or likelihood score for each column of the MSA. To obtain the overall tree score, the sum over all m, where m is number of columns in the MSA, per-column likelihood or parsimony scores at the virtual root is computed. () Likelihood and parsimony scores are computed via a post-order tree traversal that proceeds from the tips toward the virtual root and computes ancestral sequence or ancestral probability vectors of length m at each inner node that is visited (see Fig. ).
Phylogenetics Virtual root
AC
AG
P
3. Compute overall score by summing over per-site scores 2.
Ancestral vector
AG AT
AC
Sequence data
Place virtual root into arbitrary branch Post-order traversal
1. Ancestral vector
AT AC
AC
Phylogenetics. Fig. Virtual rooting and post-order traversal of a phylogenetic tree. During the post-order traversal, ancestral state vectors are computed. The per-column parsimony or likelihood scores are summed up at the root to obtain the overall tree score
The parsimony criterion intends to minimize the number of nucleotide changes on a tree, while maximum likelihood strives to maximize the fit between the tree and the data (the MSA) using an explicit statistical model of sequence evolution. Bayesian approaches that integrate over the tree (parameter) space using (Metropolis-Coupled) Markov-Chain Monte-Carlo approaches have become popular since the late s. The underlying computational problems are similar, because, as for maximum likelihood, execution times are dominated (–%) by evaluations of the phylogenetic likelihood function. Finally, there exist methods that do not directly use molecular sequence data for phylogeny reconstruction. Instead, these methods use gene order data as input, that is, they strive to infer evolutionary relationships based on distinct arrangements of corresponding genes along the chromosome(s) of the organisms under study. In this context, two organisms are more closely related to each other, if the order of their corresponding genes has not substantially changed, that is, if their chromosomes have not been rearranged to a large extent in the course of evolution. The overall goal can be formulated as inferring a tree that explains the evolutionary history in a parsimonious way, that is, with a minimum number of gene rearrangement events. Even simple versions of this problem are NP-hard [].
Vectorization Both likelihood and parsimony computations can be vectorized at a low level. Parsimony operations for counting the number of changes in a tree can be represented as operations on bit vectors, since ancestral parsimony vectors require
bits per alignment column to denote the presence or absence of one of the DNA characters A, C, G, T. The bit-level operations that are required are: bit-wise and, bit-wise or, bit-wise nand, and population count (popcount; counting the number of bits that are set to in a data word) operations. Using Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) vector instructions on x architectures, which are bits wide, ancestral states ( divided by ) can be computed during a single CPU cycle. Likelihood computations can also be vectorized, since the computation of the ancestral state at a position c, where c = . . . m, of the alignment entails computing the probabilities P(A), P(C), P(G), P(T) of observing a nucleotide A, C, G, or T at this position. At an abstract level, the phylogenetic likelihood computations for DNA data are dominated by a dense matrix– matrix multiplication of a × floating point matrix (nucleotide substitution matrix) with a × m floating point matrix (ancestral probability vector). The open-source phylogenetic inference program Randomized Axelerated Maximum Likelihood (RAxML) by Stamatakis [] provides SSE-vectorized implementations of the likelihood and parsimony functions for DNA data. Vector instructions for the likelihood function have also been deployed on the IBM CELL architecture by Blagojevic et al. []. In general, both optimality criteria allow for vectorization using wider (e.g., -bit) vector lengths. Auto-vectorization is not always possible because of code complexity or because the codes need to be redesigned to take advantage of vector instructions. In RAxML, for instance, intrinsic SSE functions have been used for explicit vectorization.
P
P
Phylogenetics
Distance-based methods can in principle also be vectorized, but the specific strategy depends on the function used to compute the pair-wise distances between sequences and also on the heuristic search strategy that is deployed.
Fine-Grain Parallelization As outlined in Fig. , the computations of per-site parsimony or likelihood scores are completely independent of each other until the virtual root is reached. Given an MSA with m = sites, this means that all per-site scores can be computed simultaneously and in parallel. The only limitation is that, to obtain the overall score for the tree, the per-site scores need to be accumulated when the virtual root is reached via a respective parallel reduction operation. The parallel efficiency of this approach depends on the speed of reduction operations in a parallel system and on the number of sites m in the MSA. Generally, scalability increases with m since a large m will yield a more favorable computation to synchronization (via a reduction operation) ratio. Both vectorization and fine-grain parallelization approaches for the likelihood and parsimony criteria are independent of the search strategy used. To date, fine-grain parallelism has mainly been adopted by likelihood-based programs, since the likelihood function has significantly higher memory and computational requirements than parsimony. In terms of parallel programming paradigms, Open Multi-Processing (OpenMP), POSIX threads (Pthreads), the Message Passing Interface (MPI), and the Compute Unified Device Architecture (CUDA) have been deployed for exploring fine-grain parallelism. The following types of parallel computer architectures or accelerator devices have been used for phylogenetic likelihood computations to date: Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), the IBM Playstation and the IBM CELL processor, Symmetric Multi-Processors (SMPs), multi-core architectures, clusters of SMPs with InfiniBand and Gigabit-Ethernet interconnects, the massively parallel IBM BlueGene/L, and a sharedmemory Silicon Graphics Instruments (SGI) Altix supercomputer.
Medium-Grain Parallelization Medium-grain parallelization refers to parallelizing the search algorithms of parsimony- or likelihood-based methods and is also called inference parallelism. However, because of their diversity and complexity, the parallelization of the steps of heuristic search algorithms is a nontrivial task. Moreover, every parallelization will need to be highly algorithm-specific and thereby not be generally applicable to other phylogenetic inference programs. An additional problem is that many modern search algorithms, as implemented for instances by Ronquist and Huelsenbeck in MrBayes [], by Zwickl in GARLI [], by Guindon and Gascuel in PHYML [], by Stamatakis in RAxML [], or by Goloboff in TNT [], are characterized by hard-toresolve sequential dependencies. However, other search algorithms like IQPNNI by Minh et al. [] and TreePuzzle by Strimmer et al. [] are straightforward to parallelize at this level.
Coarse-Grain Parallelization Likelihood-based and distance-based phylogenetic inference algorithms exhibit relatively easy-toexploit sources of coarse-grain/embarrassing parallelism, which are discussed below.
Coarse-Grain Parallelism in Distance-Based Analyses Parallelization of distance-based analyses is straightforward. The execution times of distance-based analyses are dominated by the computation of the n × n symmetric distance matrix D that contains the pair-wise distances between all n organisms. All n entries of D are independent of each other and can hence be computed in parallel. Coarse-Grain Parallelism in Maximum Likelihood Analyses There are two sources of coarse-grain parallelism in ML analyses. One can conduct a number of completely independent tree searches, starting from different reasonable, that is, nonrandom, starting trees. Such reasonable starting trees can, for instance, be obtained by using simpler (and less computationally intensive)
Phylogenetics
methods such as neighbor joining or maximum parsimony. Given a collection of distinct, nonrandom, starting trees, individual ML tree searches can be conducted to find the best-scoring (remember that ML optimization is NP-hard) tree. Thereby, one may find a tree with a better likelihood score than by conducting a single search. The same technique can also be applied to parsimony searches for finding the most parsimonious tree. The second source of embarrassing parallelism in ML phylogenetic analyses is the phylogenetic bootstrapping procedure that was proposed by Felsenstein []. Phylogenetic bootstrapping serves as a mechanism for inferring support values, that is, for assigning confidence values to inner branches of a phylogenetic tree. Those confidence values at the inner branches are usually interpreted as the certainty that a particular evolutionary split of the set of organisms has been correctly inferred. Cutting the tree into two parts at an inner branch generates a split (also called bipartition) of the organisms into two disjoint sets. Therefore, the goal consists in obtaining support values for all possible splits/bipartitions induced by the internal branches of a tree. The phylogenetic bootstrap procedure works as follows: Initially, the input alignment is perturbed by drawing columns/sites (with replacement) at random to assemble a bootstrap replicate alignment of length m. Thus, a bootstrapped alignment has the same number m of sites/columns as the original alignment, but exhibits a different site composition. This re-sampling process is repeated –, times, that is, –, bootstrap replicate alignments are generated. When those –, bootstrapped alignments have been generated, one applies the phylogenetic inference method of choice to infer a tree for each of the replicates. Thereby, as many trees as there are bootstrap replicates are generated. This set of bootstrap trees is then used to answer the question: How stable is the tree topology under slight alterations of the input data? This collection of bootstrapped trees is then either used to compute a consensus tree, that is, compute the “average” tree topology or for drawing support values on the best-known ML tree obtained on the original – nonbootstrapped – alignment. In the latter case, one just needs to count how frequently each bipartition of the
P
best-known ML tree occurs in the set of bootstrapped trees. The bootstrapping procedure is embarrassingly parallel because tree searches on individual bootstrap replicates are completely independent from each other and can be parallelized by executing the –, bootstrap inferences on a cluster, GRID, or cloud. A question that naturally arises in this context is: How many bootstrap replicates are required to obtain reliable support values? Hedges [] proposed a theoretical upper bound for phylogenetic bootstrapping. The number of required bootstrap replicates also appears to depend on the input data. Therefore, adaptive criteria as proposed by Pattengale et al. [] may be wellsuited to determine a sufficient number of bootstrap replicates.
Coarse-Grain Parallelism in Bayesian Analyses Bayesian analyses exhibit a source of coarse grain parallelism at the level of executing multiple chains in parallel, that is, using the Metropolis-Coupled MarkovChain Monte-Carlo (MC ) approach (see Metropolis et al. []). In a typical program run of the widely used MrBayes code by Ronquist and Huelsenbeck [] for Bayesian phylogenetic inference, the program will use three heated Markov chains that accept more radical moves in parameter space (this entails topological moves as well as moves to sample other parameters of the likelihood model) and a cold chain that only accepts more conservative moves. The cold chain operates on the tree with the currently best likelihood. However, one of the heated chains may at some point encounter a tree with a higher likelihood than the cold chain. In this case, the cold chain and the heated chain with the better tree need to exchange states, that is, the cold chain will become a heated chain and the heated chain will become the cold chain. Therefore, while those four chains (three heated chains and one cold chain) may be started as independent Message Passing Interface (MPI) processes, the chains will need to be synchronized after a certain number of, for instance , generations (proposals) to assess if states need to be exchanged between chains. Thus, parallel load balance is a critical issue, because the chains will need to run at similar speeds in order to minimize synchronization delays. On a homogeneous cluster (equipped with a single CPU type) this
P
P
Phylogenetics
is not problematic, because the average execution times for or , proposals are expected to be very similar. This good expected load balance is due to the fact that the same average number of proposal types (tree proposal, model parameter proposal) will be executed in each chain. Finally, completely independent runs can be executed in an embarrassingly parallel manner.
Phylogenetics Today Phylogenetics currently face two main challenges. One major challenge is the significant advances in wet-lab sequencing technologies that have led to an unprecedented molecular data “flood.” In fact, the amount of publicly available molecular data increases at a significantly higher speed than the number of transistors according to Moore’s law (see Goldman and Yang [] for a respective plot). To this end, researchers today do not use sequence data from a single gene or just a couple of genes to reconstruct trees for the organisms they study. Instead, they use data from several hundreds or even thousands of genes (e.g., Hejnol et al. []). Thus, the number of sites m in alignments has increased from ,–, to over ,,. This transition from single-/few-gene phylogenies to many-gene phylogenies is also reflected by the increased use of the term phylogenomics, that is, phylogenetic inference at (almost) the whole-genome level. In current phylogenomic analyses the number of organisms n ranges between and , but given the constant innovations in wet-lab sequencing, this number may soon increase by one order of magnitude. Because of sequencing innovations, input datasets are also growing with respect to n, that is, analyses using up to ten genes for ,–, organisms are becoming more common. To date, the largest published likelihood-based tree contained , organisms (see Smith and Donoghue []) while the largest published parsimony-based tree contained , organisms (see Goloboff et al. []). The general growth in dataset sizes poses problems with respect to the memory requirements of phylogenetic analyses, in particular with respect to likelihood-based approaches. Memory footprints for computing the likelihood on a single tree, exceeding GB are not uncommon any more. The highest reported memory footprint the author is aware of was GB for a likelihood-based analysis
of mammalian genomes (Ziheng Yang, personal communication, April ). Memory-related problems also affect distance-based methods (for large n) because of the space requirements of the n × n distance matrix (Wheeler [] and Price et al. [] discuss some technical solutions to this problem). Apart from memory-related problems, the computation of trees with more than , organisms also poses novel algorithmic challenges. Finally, the increasing complexity of the statistical models that are used for phylogenetic inference, such as, mixture models (e.g., Lartillot and Philippe []), further increase the computational complexity of likelihood-based approaches. The second major challenge is the emergence of multi-core systems at the desktop level and of accelerator architectures such as GPUs (Graphics Processing Units) or the IBM Cell. Those new parallel architectures pose challenges regarding the parallelization of the likelihood or parsimony functions. Given the aforementioned high memory footprints, a parallelization at a fine-grain level, that is, multiple processors/cores working together to compute the score on a single tree, is required. In order to be of value and to be used by the biological user community, such parallelizations need to be readily available at the production level and also need to be easy to install and use. Developers of phylogenetics software, which has come off age by now, are also increasingly facing software engineering issues, because the codes have become more complex over recent years. Programs such as RAxML or MrBayes provide a plethora of substitution models and search algorithms. They also allow for analyzing different data types, for instance, binary, morphological, DNA, RNA secondary structure (see Savill et al. [] for a discussion of RNA secondary structure evolution models), or protein data. Therefore, the phyloinformatics and HPC communities are facing an unprecedented challenge in trying to keep pace with data accumulation and parallel architecture innovations to provide ever more scalable and powerful analysis tools. HPC in phylogenetics has thus become a key to the success of the field.
Future Directions One of the major future challenges in phylogenetics consist of efficiently exploiting parallel architectures to compute the likelihood or parsimony scoring
Phylogenetics
functions at the production level. There exists an ongoing effort to implement an open-source likelihood function library (http://code.google.com/p/beagle-lib/) that can be executed on GPUs and multi-core architectures (including a vectorization using SSE intrinsics and an OpenMP-based fine-grain parallelization). Finegrain parallelizations of those functions are required to be able to accommodate and distribute the growing memory space requirements across several cores or – potentially hybrid – multi-core nodes. Nonetheless, alternative directions for handling memory consumption should be explored. While there exist suggestions for reducing memory requirements via algorithmic means (see Stamatakis and Ott [] and Stamatakis and Alachiotis []), trade-offs between using single and double precision arithmetics as well as their impact on numerical stability need to be further explored. Outof-core execution may provide a solution for executing large-scale analyses on the desktop. Analyses of trees with over , organisms require a – potentially difficult – parallelization at the algorithmic level, because scalability of fine-grain parallelism is limited to or cores due to the relatively small number of sites in such alignments. Finally, the simultaneous development of parallelization and algorithmic strategies appears to be the most promising approach to address current and future challenges.
Related Entries Bioinformatics Cell Broadband Engine Processor Clusters Collective Communication Distributed-Memory Multiprocessor Ethernet Genome Assembly Hybrid Programming With SIMPLE IBM Blue Gene Supercomputer InfiniBand Load Balancing, Distributed Memory Loops, Parallel MPI (Message Passing Interface) Multi-Threaded Processors NVIDIA GPU OpenMP
P
Parallelization, Automatic POSIX Threads (Pthreads) Reconfigurable Computers
Bibliographic Notes and Further Reading The text-books by Felsenstein [] and Yang [] provide a detailed introduction to the field of phylogenetic inference and the underlying models of evolution. The phylogenetic likelihood function was introduced by Felsenstein in []. One of the early papers dealing with parsimony was published by Fitch and Margoliash []. NP-hardness was demonstrated by Day [] for the least squares approach, by Day et al. for parsimony [], and by Chor and Tuller for likelihood [] (also see Roch [] for a shorter proof). Morrison [] provides an overview and discussion of current heuristic search strategies. The vectorization of the likelihood function as well as numerical aspects with respect to single and double precision implementations of the likelihood function are addressed by Stamatakis and Berger []. FPGA implementations are covered by Alachiotis et al. [, ], Mak and Lam [–], Zierke and Bakos [], and Bakos [, ]; GPU implementations by Charalambous et al. [], Suchard and Rambaut [], while Pratas et al. describe a performance comparison between GPUs, the IBM CELL, and general purpose multi-core architectures []. Blagojevic et al. have published a series of papers on porting RAxML to the IBM CELL [–, ]. Pthreads and OpenMP-based parallelization of the parsimony and likelihood functions have been described by Stamatakis et al. [, ]. Ott et al. [, ] and Feng et al. [] describe finegrain parallelizations with MPI on distributed memory machines. Stamatakis and Ott also address load-balance issues [] and assess performance of Pthreads versus MPI versus OpenMP for fine-grained parallelism in the likelihood function []. Medium-grain parallelizations are covered by Minh et al. [], Stamatakis et al. [], Stewart et al. [], and Ceron et al. []. Hybrid parallelizations have been described by Minh et al. [], Feng et al. [], and Pfeiffer and Stamatakis []. The two post-analysis steps for analyzing bootstrapped trees (consensus tree building and drawing bipartitions on the best-known tree) have also been parallelized (see Aberer et al. [, ]; the papers also
P
P
Phylogenetics
contain a detailed description of the discrete algorithms for consensus tree building and drawing bipartitions on trees). A related discrete problem on trees, that of reconstructing a species tree from (potentially incongruent) per-gene trees by the minimum possible number of gene duplication events, has been parallelized by Wehe et al. [] on an IBM BlueGene/L supercomputer. The review by Maddison [] provides an overview of the gene tree species tree problem. Heuristic algorithms for gene order phylogeny reconstruction have been implemented and optimized by Moret et al. [, ] in a tool called GRAPPA. GRAPPA has also been parallelized using a coarse-grain approach to simultaneously enumerate and evaluate all possible trees for organisms on a cluster with cores. The original heuristic algorithm has been proposed by Blanchette et al. []. Finally, the papers by Fleissner et al. [], Loytynoja and Goldman [], or Bradley et al. [], for instance, deal with the more challenging and advanced problem of simultaneous alignment (MSA) and tree building.
Bibliography . Aberer A, Pattengale N, Stamatakis A () Parallel computation of phylogenetic consensus trees. Procedia Comput Sci (): – . Aberer A, Pattengale N, Stamatakis A () Parallelized phylogenetic post-analysis on multi-core architectures. J Comput Sci ():– . Alachiotis N, Sotiriades E, Dollas A, Stamatakis A () Exploring FPGAs for accelerating the phylogenetic likelihood function. In: IEEE international symposium on parallel & distributed processing, . IPDPS , pp –. IEEE . Alachiotis N, Stamatakis A, Sotiriades E, Dollas A () A reconfigurable architecture for the Phylogenetic Likelihood Function. In: International Conference on Field Programmable Logic and Applications, . FPL , pp –. IEEE, . Bakos J () FPGA acceleration of gene rearrangement analysis. In: Proceedings of th annual IEEE symposium on fieldprogrammable custom computing machines. IEEE, Napa, CA, pp – . Bakos J, Elenis P, Tang J () FPGA acceleration of phylogeny reconstruction for whole genome data. In: Proceedings of the th IEEE international conference on bioinformatics and bio engineering. IEEE, Boston, MA, pp – . Berger S, Stamatakis A () Accuracy and performance of single versus double precision arithmetics for maximum likelihood phylogeny reconstruction. Lecture notes in computer science, vol . Springer, pp – . Blagojevic F, Nikolopoulos D, Stamatakis A, Antonopoulos C () Dynamic multigrain parallelization on the cell broadband engine. In: Proceedings of PPoPP , San Jose, CA, March , pp –
. Blagojevic F, Nikolopoulos D, Stamatakis A, Antonopoulos C, Curtis-Maury M () Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Comput :– . Blagojevic F, Nikolopoulos DS, Stamatakis A, Antonopoulos CD () RAxML-Cell: Parallel phylogenetic tree inference on the cell broadband engine. In: Proceedings of international parallel and distributed processing symposium (IPDPS), . Blanchette M, Bourque G, Sankoff D () Breakpoint phylogenies. In: Miyano S, Takagi T (eds) Workshop on genome informatics, vol . Univ. Academy Press, pp – . Bradley R, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L () Fast statistical alignment. PLoS Comput Biol ():e . Bryant D () The complexity of the breakpoint median problem. Technical report, University of Montreal, Canada . Ceron C, Dopazo J, Zapata E, Carazo J, Trelles O () Parallel implementation of DNAml program on message-passing architectures. Parallel Comput (–):– . Charalambous M, Trancoso P, Stamatakis A () Initial experiences porting a bioinformatics application to a graphics processor. Lecture notes in computer science, vol . Springer, New York, pp – . Chor B, Tuller T () Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics ():– . Day W () Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology ():– . Day W, Johnson D, Sankoff D () The computational complexity of inferring rooted phylogenies by parsimony. Mathematical biosciences (–): . Felsenstein J () Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol :– . Felsenstein J () Confidence limits on phylogenies: an approach using the bootstrap. Evolution ():– . Felsenstein J () Inferring phylogenies. Sinauer Associates, Sunderland . Feng X, Cameron K, Sosa C, Smith B () Building the tree of life on terascale systems. In: Proceedings of international parallel and distributed processing symposium (IPDPS), . Fitch W, Margoliash E () Construction of phylogenetic trees. Science ():– . Fleissner R, Metzler D, Haeseler A () Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst Biol :– . Goldman N, Yang Z () Introduction. statistical and computational challenges in molecular phylogenetics and evolution. Philos Trans R Soc B Biol Sci (): . Goloboff P () Analyzing large data sets in reasonable times: solution for composite optima. Cladistics :– . Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Källersjö M, Farris JS () Phylogenetic analysis of taxa corroborates major eukaryotic groups. Cladistics :– . Guindon S, Gascuel O () A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol ():–
Phylogenetics
. Hedges S () The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies. Mol Biol Evolution ():– . Hejnol A, Obst M, Stamatakis A, Ott M, Rouse G, Edgecombe G, Martinez P, Baguna J, Bailly X, Jondelius U, Wiens M, Müller W, Seaver E, Wheeler W, Martindale M, Giribet G, Dunn C () Rooting the bilaterian tree with scalable phylogenomic and supercomputing tools. Proc R Soc B :– . Lartillot N, Philippe H () A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process.Mol Biol Evol ():– . Loytynoja A, Goldman N () Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science (): . Maddison W () Gene trees in species trees. Syst Biol (): . Mak T, Lam K () High speed GAML-based phylogenetic tree reconstruction using HW/SW codesign. In: Bioinformatics Conference, . CSB . Proceedings of the IEEE, pp – . Mak T, Lam K () Embedded computation of maximumlikelihood phylogeny inference using platform FPGA. In: Proceedings of IEEE Computational Systems Bioinformatics Conference (CSB ), pp – . Mak T, Lam K () FPGA-Based Computation for Maximum Likelihood Phylogenetic Tree Evaluation. In: Lecture notes in computer science, pp – . Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E et al () Equation of state calculations by fast computing machines. J Chem Phys (): . Minh B, Vinh L, Haeseler A, Schmidt H () pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics ():– . Minh B, Vinh L, Schmidt H, Haeseler A () Large maximum likelihood trees. In: Proceedings of the NIC Symposium , pp – . Moret B, Tang J, Wang L, Warnow T () Steps toward accurate reconstructions of phylogenies from gene-order data∗ . J Comput Syst Sci ():– . Moret B, Wyman S, Bader D, Warnow T, Yan M () A new implementation and detailed study of breakpoint analysis. In: Pacific symposium on biocomputing :– . Morrison D () Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to nucleotide sequences. Syst Biol ():– . Ott M, Zola J, Aluru S, Johnson A, Janies D, Stamatakis A () Large-scale phylogenetic analysis on current HPC architectures. Scientific Programming (–):– . Ott M, Zola J, Aluru S, Stamatakis A () Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L. In: Proceedings of IEEE/ACM Supercomputing Conference (SC), IEEE, Reno, Nevada . Pattengale N, Alipour M, Bininda-Emonds O, Moret B, Stamatakis A () How many bootstrap replicates are necessary? J Comput Biol ():– . Pfeiffer W, Stamatakis A () Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code. In: IEEE
.
. .
.
.
. .
.
.
.
.
.
.
.
.
P
International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), , IEEE, Atlanta, Georgia, pp – Pratas F, Trancoso P, Stamatakis A, Sousa L () Fine-grain Parallelism using multi-core, Cell/BE, and GPU systems: accelerating the phylogenetic likelihood function. In: International conference on parallel processing, . ICPP’, IEEE, Vienna, pp – Price M, Dehal P, Arkin A () FastTree –approximately maximumlikelihood trees for large alignments. PLoS One ():e Roch S () A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM transactions on Computational Biology and Bioinformatics, pp – Ronquist F, Huelsenbeck J () MrBayes : Bayesian phylogenetic inference under mixed models. Bioinformatics (): – Savill NJ, Hoyle DC, Higgs PG () Rna sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics :– Smith S, Donoghue M () Rates of molecular evolution are linked to life history in flowering plants. Science ():– Stamatakis A () RAxML-VI-HPC: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics ():– Stamatakis A, Alachiotis N () Time and memory efficient likelihoodbased tree searches on phylogenomic alignments with missing data. Bioinformatics ():i Stamatakis A, Blagojevic F, Antonopoulos CD, Nikolopoulos DS () Exploring new search algorithms and hardware for phylogenetics: RAxML meets the IBM Cell. J VLSI Sig Proc Syst ():– Stamatakis A, Ludwig T, Meier H () Parallel inference of a .-taxon phylogeny with maximum likelihood. In: Proceedings of Euro-Par , September , IEEE, Pisa Italy, pp – Stamatakis A, Ott M () Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multicore architectures. Philos Trans R Soc B, Biol Sci :– Stamatakis A, Ott M () Exploiting fine-grained parallelism in the phylogenetic likelihood function with MPI, Pthreads, and OpenMP: a performance study. In: Chetty M, Ngom A, Ahmad S (eds) PRIB, Lecture notes in computer science, vol . Springer, Heidelberg, pp – Stamatakis A, Ott M () Load balance in the phylogenetic likelihood kernel. In: International conference on parallel processing, . ICPP’, IEEE, Vienna, Austria, pp – Stamatakis A, Ott M, Ludwig T () RAxML-OMP: an efficient program for phylogenetic inference on SMPs. Lecture notes in computer science, vol . Springer, Berlin, Heidelberg, pp – Stewart C, Hart D, Berry D, Olsen G, Wernert E, Fischer W () Parallel implementation and performance of fastDNAml – a program for maximum likelihood phylogenetic inference. In: Supercomputing, ACM/IEEE conference, ACM/IEEE, Denver, Colorado, pp –
P
P
Pi-Calculus
. Strimmer K, Haeseler A () Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol :– . Suchard M, Rambaut A () Many-core algorithms for statistical phylogenetics. Bioinformatics (): . Wehe A, Chang W, Eulenstein O, Aluru S () A scalable parallelization of the gene duplication problem. J Parallel Distr Comput ():– . Wheeler T () Large-scale neighbor-joining with ninja. Lecture notes in computer science, vol . Springer, Berlin, pp – . Yang Z () Computational molecular evolution. Oxford University Press, USA . Zierke S, Bakos J () FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods. BMC Bioinformatics (): . Zwickl D () Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis, University of Texas at Austin, April
Pi-Calculus Davide Sangiorgi Universita’ di Bologna, Bologna, Italy
Synonyms Calculus of mobile processes
Definition The π-calculus is a process calculus that models mobile systems, i.e., systems with a dynamically changing communication topology. It refines the constructs of the calculus of communicating systems (CCS) by allowing the exchange of communication links. Ideas from the λ-calculus have also been influential.
Discussion Introduction A widely recognized practice for understanding programming languages, be they sequential or concurrent, is to distill small “core languages,” or “calculi,” that embody the essential ingredients of the languages. This is useful to develop the theory of the programming language (e.g., techniques for static analysis, behavioral specification, and verification), to study implementations, to devise new or better programming language constructs.
A well-known calculus in the realm of sequential languages is the λ-calculus. Invented by Church in the s, it is a pure calculus of functions. Everything in the λ-calculus is a function, and computation is function application. The λ-calculus has been very influential in programming languages. It has been the basis for the so-called functional programming languages. Even more important, it has been essential in understanding concepts such as procedural abstraction and binding, parameter passing (e.g., call-by-name vs. call-by-value), and continuations. All these concepts have then made their way into mainstream programming languages. The λ-calculus can be regarded as a canonical model for sequential computations with functions: on the one hand, it describes the essential features of functions (function definition and function application) in a simple and intuitive way; on the other hand, all the known models of sequential computation have been proved to have the same expressive power as the λ-calculus. In other words, the λ-calculus describes all computable functions (a statement referred to as “Church’s thesis”). For concurrent languages (possibly including distribution), a similar canonical model does not exist. This can be explained with the varieties of such systems. In concurrency, the central computational unit is a process. However, there is no universally accepted definition of what a process is. For instance, the ways of conceiving interaction among processes may be very diverse: via synchronous signals, via shared variables, via broadcasting, via asynchronous message passing, via rendezvous. Therefore, in concurrency one does not find a single calculus, but, rather, a variety of calculi, referred to as process calculi, or process algebras when one wishes to emphasize that both the analysis and the description of a system are carried out in an algebraic setting. The best known such calculi are calculus of communicating systems (CCS), communicating sequential processes (CSP), algebra of communicating processes (ACP), the π-calculus. Among these, the π-calculus is the closest to the λ-calculus. Many fundamental ideas from the λ-calculus can be carried over to the π-calculus. For instance, abstraction and type systems are central concepts in both calculi. Moreover as the λ-calculus underlies functional programming languages, so the π-calculus, or variants of it, are taken as the basis for new programming languages. The process concept of
P
Pi-Calculus
the π-calculus is roughly that of structured entities interacting by means of message-passing. The π-calculus has two aspects. First, it is a theory of mobile systems, with a rich blend of techniques for reasoning about the behavior of systems. Second, it is a general model of computation, in which everything is a process and computation is process interaction. Syntactically, the main difference between the π-calculus and its principal process calculus ancestor, CCS, is that communication links are first-class values, and may therefore be passed around in communications. Thus, the set of communication links used by a term may change dynamically, as computation evolves. Such systems are called mobile, and the π-calculus is hence called a calculus of mobile processes. In CCS, in contrast, the set of communication links for a term is fixed by its initial syntax, and only very limited forms of mobility can be described. Mobility gives the π-calculus an amazing expressive power. For instance, functions and the λ-calculus can be elegantly modeled in it, reducing function application to special forms of process interactions.
Mobility A mobile concurrent system has a communication topology that can dynamically change, as the system evolves. Mobility can be found in many areas of computer science, such as operating systems, distributed computing, higher-order concurrent programming, and object-oriented programming. Two kinds of mobility can be broadly distinguished. In one kind, links move in an abstract space of linked processes. For example, hypertext links can be created, can be passed around, and can disappear; the connections between cellular telephones and a network of base stations can change as the telephones are carried around; and references can be passed as arguments of method invocations in object-oriented systems. In the second kind of mobility, processes move in an abstract space of linked processes. For instance, code can be sent over a network and put to work at its destination; mobile devices can acquire new functionality; an active laptop computer can be moved from one location to another, and made to interact with resources in a different environment. Languages in which terms of the language itself, such as processes, are first-class values, are called higher-order (in this sense, the λ-calculus is higher-order too).
The π-calculus models the first kind of mobility: it directly expresses movement of links in a space of linked processes. There are two kinds of basic entity in the (untyped) π-calculus: names and processes. Names specify links. Processes can interact by using names that they share. The crux is the data that processes communicate in interactions are themselves names, and a name received in one interaction can be used to participate in another. By receiving a name, a process can acquire a capability to interact with processes that were previously unknown to it. Thus, the structure of a system – the connections among its component processes – can change over time, in arbitrary ways. The source of strength in the π-calculusis how it treats scoping of names and extrusion of names from their scopes. There are various reasons for having only mobility of names in the π-calculus. First, naming and namepassing are ubiquitous: think of addresses, identifiers, links, pointers, and references. Second, as will be shown later, higher-order constructs can often be modeled within the π-calculus. Further, by passing a name, one can pass partial access to a process, an ability to interact with it only in a certain way. Similarly, with namepassing one can easily model sharing, for instance of a resource that can be used by different sets of clients at different times. It can be complicated to model these things when processes are the only transmissible values. Thirdly, the π-calculus has a rich and tractable theory. The theory of process-passing is harder, and important parts of it are not yet well understood.
Syntax As the λ-calculus, so the language of the π-calculus consists of a small set of primitive constructs. In the λ-calculus, they are constructs for building functions. In the π-calculus, they are constructs for building processes, similar to those one finds in CCS. The syntax is as follows. Capital letters range over processes, and small letters over names; the set of all names is infinite. P :: = x(y). P
∣
νx P
∣ ∣
x⟨y⟩. P !P
∣
∣
τ. P
∣
∣
P + P′
∣
P
∣ P′
[x = y]P
The input prefix x(z). P can receive any name via x and continue as P with the received name substituted for z. The substitution of the all occurrences of z in
P
P
Pi-Calculus
P with y is written P{y/z} (some care is needed here, to properly respect the bindings of a term). An output x⟨y⟩. P emits name y at x and then continues as P. The unobservable prefix τ. P can autonomously evolve to P, without the help of the external environment. As in CCS, τ can be thought of as expressing an internal action of a process. The inactive process can do nothing. For instance, x(z). z⟨y⟩. can receive any name via x, send y via the name received, and become inactive. A choice (or sum) process P + P′ may evolve either as P or as P′ ; if one of the processes exercises one of its capabilities, the other process is discarded. In the composition P ∣ P′ , the components P and P′ can proceed independently and can interact via shared names. For instance, (x(z). z⟨y⟩. + w⟨v⟩. ) ∣ x⟨u⟩. has four capabilities: to receive a name via x, to send v via w, to send u via x, and to evolve invisibly as an effect of an interaction between its components via the shared name x. In the restriction νz P, the scope of the name z is restricted to P. Components of P can use z to interact with one another but not with other processes. For instance, νx ((x(z). z⟨y⟩. + w⟨v⟩. ) ∣ x⟨u⟩. ) has only two capabilities: to send v via w, and to evolve invisibly as an effect of an interaction between its components via x. The scope of a restriction may change as a result of interaction between processes. This important feature of the calculus will be explained later. (The π-calculus notation for the restriction operator is different from that of CCS; indeed, exchange of names makes restriction in the π-calculus semantically quite different from restriction in CCS.) The replication !P can be thought of as an infinite composition P ∣ P ∣ ⋯ or, equivalently, a process satisfying the equation !P = P ∣ !P. Replication is the operator that makes it possible to express infinite behaviors. For example, !x(z). !y⟨z⟩. can receive names via x repeatedly, and can repeatedly send via y any name it does receive. In certain presentations, recursive process definitions are used in place of replication: the expressiveness injected into the calculus by these two constructs is the same. A match process [x = y]P behaves as P if names x and y are equal, otherwise it does nothing. For instance, x(z). [z = y]z⟨w⟩. , on receiving a name via x, can send w via that name just if that name is y; if it is not, the process can do nothing further. This is the syntax of the pure, untyped, π-calculus. The calculus is monadic, in that only one value at a
time can be exchanged. In examples and applications, one often uses the polyadic π-calculus, in which the input and output constructs are refined as follows: a polyadic input process x(y , . . . , yn ). P waits for an ntuple of names z , . . . , zn at x and then continues as P{z , . . . , zn/y , . . . , yn } (i.e., P with the yi ’s replaced by the zi ’s); a polyadic output process x⟨y , . . . , yn ⟩. P emits names y , . . . , yn at x and then continues as P. The input construct x(y). P and the restriction νy P are binders for the free occurrences of y in P, in the same sense as the abstraction construct of the λ-calculus. These binders give rise in the expected way to the definitions of free and bound names of a process. When writing processes, parentheses are used to resolve ambiguity, and the conventions are observed that prefixing and restriction bind more tightly than composition and sum. Thus x⟨v⟩. P ∣ Q is (x⟨v⟩. P) ∣ Q, and νzP ∣ Q is (νzP) ∣ Q. A process x⟨⟩. P is abbreviated as x. P and, similarly, x(). P as x. P.
Examples Below some simple examples are given to illustrate the use of the main constructs. The first example is about encoding of data types. In general, processes interact by passing one another data. The only data of the π-calculus are names. However, it may well be convenient to admit other atomic data, such as integers, and structured data, such as tuples and multisets. It is in the spirit of process calculi generally to allow the data language to be tailored to the application at hand, admitting sets, lists, trees, and so on as convenient. And one should admit the relevant data when using the πcalculus to reason about systems. Remarkably, however, all such data can be expressed. For instance, def
def
T = b(x, y). x. and F = b(x, y). y. . represent encodings of the boolean values true and false, located at name b. These processes receive a pair of names via b; then T will respond by signaling on the first of them, whereas F will signal on the second. The following process R reads the value of a boolean located at b and, depending on the value of the boolean, it will behave as P or Q: def
R = νt νf b⟨t, f ⟩. (t. P + f . Q), It is assumed that t and f are only used to read a boolean, hence they do not appear in P and Q. Now, a system
Pi-Calculus
with both R and T can evolve as follows, using :→ to indicate a single computation step, i.e., a single interaction; below ∼ indicates the application of some simple garbage-collection algebraic laws of the π-calculus that will be discussed later (laws for garbage collecting trailing inactive processes and restrictions on name that are not used anymore). T ∣ R :→ νt νf (t ∣(t. P + f . Q)) :→ νt νf ( ∣ P) ∼
P
In the first step, R sends to T two names, t and f , that were initially private to R; after the interaction, these names are shared between (what remains of) R and T; no other process in the system (in principle, other processes could run in parallel) will now be able to interfere on the following interaction between R and T along t. The capability of creating new names, and sending them around, is at the heart of the expressiveness and the theory of the π-calculus; without it, the π-calculus would not be much different from CCS. In the second step, the interaction along t selects P, and the other branch of the choice is discarded. In the final line, the garbage-collection laws are applied. In the example, the act of interrogating a process T or F for its value destroys it. Persistence of data is represented using replication. The encoding of persistent booleans would then be def
TRUE = !b(x, y). x.
def
FALSE = !b(x, y). y.
Instances of these processes can conduct arbitrarily many dialogues, yielding the same value in each. For example, with R as above, using ⇒ to indicate the transitive closure of :→, and using ∼ as above to indicate application of some cleanup laws, it holds that R ∣ R ∣ TRUE ⇒∼ P ∣ P ∣ TRUE and R ∣ R ∣ FALSE ⇒∼ Q ∣ Q ∣ FALSE. Our second example is about the encoding of higher-order constructs in the π-calculus. Consider a higher-order language with constructs similar to those of the π-calculus but with the possibility of passing processes. In such a language one could write, for instance, def
P = a(x). (x∣x)∣a⟨R⟩. Q
P
On the right-hand side of the composition, a process R is emitted along a; on the left-hand side, an input at a is expecting a process and then two copies of it will be run. Process P evolves as follows: P ⇒ R ∣ R ∣ Q This behavior can be mimicked in the π-calculus as follows. The communication of the higher-order value R is translated as the communication of a private name that acts as a pointer to (the translation of) R and that the recipient can use to trigger a copy of (the translation of) R; the mapping from the higher-order language into the π-calculus is indicated with [ . ] : def
[ P]] = a(x). (x. ∣x. )∣νy a⟨y⟩. (Q∣!y. R) where name y does not occur elsewhere. It holds that [ P]]
:→
νy (y. ∣y. ∣[[Q]]∣!y. [ R]])
:→:→ νy (|∣[[Q]]∣[[R]]|[[R]]∣!y. [ R]]) ∼
[ Q]]∣[[R]]∣[[R]]
where, on the second line, :→:→ represents two consecutive reduction steps. On the third line, the same algebraic laws as above are used, plus a law for garbagecollecting an input-replicated process such as !y. [ R]] if the initial name y is restricted and does not occur elsewhere in the system. The translation separates the acts of copying and of activating the value R; copying is rendered by the replication, and activation by communications along the pointer y. Here again, it is essential that the pointer y is initially private to the process emitting at a. This ensures that the following outputs at y are not intercepted by processes external to [ P]].
Names In the π-calculus, names specify links. But what is a link? The calculus is not prescriptive on this point: the notion of a link is construed very broadly, and names can be put to very many uses. This point is important and deserves some attention. For example, names can be thought of as channels that processes use to communicate. Also, by syntactic means and using type systems, π-calculus names can be used to represent names of processes or names of objects in the sense of object-oriented programming. Further, although the π-calculus does not mention locations explicitly, often
P
P
Pi-Calculus
when describing systems in π-calculus, some names are naturally thought of as locations. Finally, some names can be thought of as encryption keys as done in calculi that apply ideas from the π-calculus to computer security.
Types A type system is, roughly, a mechanism for classifying the expressions of a program. Type systems are useful for several reasons: to perform optimizations in compilers; to detect simple kinds of programming errors at compilation time; to aid the structure and design of systems; to extract behavioral information that can be used for reasoning about programs. In sequential programming languages, type systems are widely used and generally well-understood. In concurrent programming languages, by contrast, the tradition of type systems is much less established. In the π-calculus world, types have quickly emerged as an important part of its theory and of its applications, and as one of the most important differences with respect to CCS-like languages. The types that have been proposed for the π-calculus are often inspired by wellknown type systems of sequential languages, especially λ-calculi. Also, type systems specific to processes have been investigated, for instance, for preventing certain forms of interferences among processes or certain forms of deadlocks. One of the main reasons for which types are important for reasoning on π-calculus processes is the following. Although well-developed, the theory of the pure π-calculus is often insufficient to prove “expected” properties of processes. This is because a π-calculus programmer normally uses names according to some precise logical discipline (the same happens for the λ-calculus, which is hardly ever used untyped since each variable has usually an “intended” functionality). This discipline on names does not appear anywhere in the terms of the pure calculus, and therefore cannot be taken into account in proofs. Types can bring this structure back to light. This point is illustrated with an example that has to do with encapsulation. Facilities for encapsulation are desirable in both sequential and concurrent languages, allowing one to place constraints on the access to components such as data and resources. The need of encapsulation has led to the development of abstract
data types and is a key feature of objects in objectoriented languages. In CCS, encapsulation is given by the restriction operator. Restricting a channel x on a process P, written νx P, guarantees that interactions along x between subcomponents of P occur without interference from outside. For instance, suppose one has two one-place buffers, Buf1 and Buf2, the first of which receives values along a channel x and resends them along y, whereas the second receives on y and resends on z. They can be composed into a two-place buffer that receives on x and resends on z; thus: νy(Buf1 ∣ Buf2). Here, the restriction ensures us that actions on y from Buf1 and Buf2 are not stolen by processes in the external environment. With the formal definitions of Buf1 and Buf2 at hand, one can indeed prove that the system νy (Buf1 ∣ Buf2) is behaviorally equivalent to a two-place buffer. The restriction operator provides quite a satisfactory level of protection in CCS, where the visibility of channels in processes is fixed. By contrast, restriction alone is often not satisfactory in the π-calculus, where the visibility of channels may change dynamically. Consider the situation in which several client processes cooperate in the use of a shared resource such as a printer. Data are sent for printing by the client processes along a channel p. Clients may also communicate channel p so that new clients can get access to the printer. Suppose that initially there are two clients C1 = p⟨ j ⟩. p⟨ j ⟩ . . . C2 = b⟨p⟩ and therefore, writing P for the printer process, the initial system is νp (P ∣ C1 ∣ C2). One might wish to prove that C1’s print jobs represented by j and j are eventually received and processed in that order by the printer, possibly under some fairness condition on the printer scheduling policy. Unfortunately this is false: a misbehaving new client C3 that has obtained p from C2 can disrupt the protocol expected by P and C1 just by reading print requests from p and throwing them away: C3 = p(j). p(j′ ). .
Pi-Calculus
In the example, the protection of a resource (the printer) fails if the access to the resource is transmitted, because no assumptions on the use of that access by a recipient can be made. Simple and powerful encapsulation barriers against the mobility of names can be created using type concepts familiar from the literature of typed λ-calculi. For instance, the misbehaving printer client C3 can be recognized by distinguishing between the input and the output capabilities of a channel. It suffices to assign the input capability on channel p to the printer and the output capability to the initial clients C1 and C2. In this way, new clients that receive p from existing clients will only receive the output capability on p. The misbehaving C3 is thus ruled out as ill-typed, as it uses p for input. The concept of a channel with direction can be formalized by means of type constructs, sometimes called the input/output types. They give rise to a natural subtyping relation, similar to those used for reference types in imperative languages. In the case of the π-calculus encodings of the λ-calculus, this subtyping validates the standard subtyping rules for function types. It is also important when modeling object-oriented languages, whose type systems usually incorporate some powerful form of subtyping. In the λ-calculus, where functions are the unit of interaction, the key type construct is the arrow type. In the π-calculus names are the unit of interaction and therefore the key type construct is the channel (or name) type ♯ T. A type assignment a : ♯ T means that a can be used as a channel to carry values of type T. As names can carry names, T itself can be a channel type. If one adds a set of basic types, such as integer or boolean types, one obtains the analog of the simply-typed λ-calculus, which is therefore called the simply-typed π-calculus. Type constructs familiar from sequential languages, such as those for products, unions, records, variants, recursive types, polymorphism, subtyping, and linearity, can be adapted to the π-calculus. Having recursive types, one may avoid basic types as initial elements for defining types. The calculus with channel, product, and recursive types is the polyadic π-calculus mentioned earlier on. Beyond types inherited from the λ-calculus, several other type systems have been put forward that are specific to processes; they formalize common patterns of interaction among processes. Types may even serve as
P
a specification of (parts of) the behavior of processes, as in session types. The behavioral guarantees that are guaranteed by types are typically safety properties, such as the absence of certain communication errors, of interferences, of deadlock, and information leakage in security protocols. Safety is expressed in the subject reduction theorem, a fundamental theorem of type theory stating the invariance of types under reduction. Type systems for liveness properties have been studied too, for instance for properties such as termination (the fact that computation in a concurrent system will eventually stop) and responsiveness (the fact that a process will eventually provide an answer along a certain name). However, liveness properties, already difficult enough to prove in the presence of concurrency, can be even harder to prove for mobile processes (in the sense that it may be difficult to design a type system capable of guaranteeing the property of interest while being expressive enough to handle most common programming idioms).
Theory Fundamental for a theory of a concurrent language is a means of stating what the behavior of a process is, and what does it means for two behaviors to be equal. These notions in concurrency are usually treated via operational semantics. A brief and informal account of how these issues are treated in the π-calculus is given below, referring to [] for details. Traditionally, the operational semantics of a process algebra is given in terms of a labeled transition system describing the possible evolutions of a process. This contrasts with what happens in term rewriting systems, as based on an unlabeled reduction system. In the λ-calculus, probably the best known term-rewriting system, what makes a reduction system possible is that two terms having to interact are naturally in contiguous positions. This is not the case in process calculi, where interaction does not depend on physical contiguity. To put this another way, a redex of a λ-term is a subterm, while a “redex” in a process calculus is distributed over the term. To allow a reduction semantics on processes, axioms for a structural congruence relation, usually written ≡, are introduced prior to the reduction system, in order to break a rigid, geometrical view of concurrency; then reduction rules can easily be presented in which redexes are indeed subterms again.
P
P
Pi-Calculus
The interpretation of the operators of the language emerges neatly with a reduction semantics, due to the compelling naturalness of each structural congruence and reduction rule. This is not quite the case in the labeled semantics, at least for process algebras expressing mobility: the manipulation of names and the side conditions in the rules are nontrivial and this can make it delicate understanding and justifying the choices made. However, if the reduction system is available, the correctness of the labeled transition system can be shown by proving the correspondence between the two systems. On the other hand, the advantages of a labeled semantics appear later, when reasoning with processes. In a reduction semantics, the behavior of a process is understood relatively to a context in which it is contained and with which it interacts. Instead, with a labeled semantics every possible communication of a process can be determined in a direct way. This allows us to get simple characterizations of behavioral equivalences. Moreover, with a labeled semantics the proofs benefit from the possibility of reasoning in a purely structural way. Another possible problem with a reduction semantics is that it allows one to accommodate only limited forms of the choice operator. The conclusion is that both semantics are useful and that they integrate and support each other. To see an example of the two semantics, consider the process def
S = x(y). P∣z⟨w⟩. R∣x⟨v⟩. Q A reduction semantics would only specify that S has a reduction S :→ P{v/y}∣z⟨w⟩. R∣Q To derive this, structural congruence is first used to bring the participant of the interaction into contiguous positions, using monoidal rules for parallel composition such as T ∣ T ≡ T ∣ T T ∣ (T ∣ T ) ≡ (T ∣ T ) ∣ T with which the order of the components in parallel compositions can be modified. Then the rewriting rule a(b). T∣a⟨u⟩. T ′ :→ T{u/b}∣T ′ can be applied.
In contrast, a labeled transition system reveals, beside the above reduction, also the potentials for S to interact with the environment, namely the input and output transitions: x(y)
S −−−−→ P∣z⟨w⟩. R∣x⟨v⟩. Q
x⟨v⟩
S −−−−→ x(y). P∣z⟨w⟩. R∣Q
z⟨w⟩
S −−−−→ x(y). P∣R∣x⟨v⟩. Q (Note that, with a restriction in front of S, name x becomes private to S and the input and output actions along x are forbidden.) The rules for the labeled transition semantics are formalized following the style of Plotkin’s Structured Operational Semantics, where the derivation proof of a transition is uniquely determined by the position, in the syntax of the term, of the prefixes consumed in the transition. The notion of behavioral equivalence normally adopted in the π-calculus is barbed congruence. Its definition is couched in terms of a bisimulation game on reductions and a simple notion of observation on processes (for instance, a predicate revealing whether a process is capable of emitting a signal at some special name). Two π-terms are deemed barbed congruent if no difference can be observed between the processes obtained by placing them into an arbitrary π-context. The notion of observation has the flavor of the report of the successful outcome of an experiment, as in a theory of testing. Barbed congruence has the advantage of being simple and robust, in that it can be applied to different calculi. Moreover, as its definition involves quantification over contexts, it gives a natural notion of equivalence, properly sensitive to the calculus under consideration (this is important when considering, for instance, type systems, as types implicitly limit the class of contexts in which a process may be placed; as a consequence, more equalities among processes hold). The quantification over contexts, however, makes the definition difficult to work with. This fact motivates the study of auxiliary notions of behavioral equivalences that afford tractable techniques. These labeled equivalences are based on direct comparison of the actions that processes can perform, rather than on the observation of arbitrary systems containing them. There are in fact several notions of labeled equivalence, each of which has characteristics that make it useful in some way. Examples are
Pi-Calculus
late bisimilarity, early bisimilarity, and open bisimilarity [, ]. The labeled equivalences are not as robust as barbed congruence. For instance, they can be much too discriminating on refinements of the π-calculus with types. A number of proof techniques for behavioral equalities on π-calculus processes have been developed. For instance, enhancements of the bisimulation proof method, sometimes called “up-to” techniques []. There has been also some work on the development of (semi-) automatic tools to assist in reasoning, though this remains an active research area. Other behavioral equivalences for the π-calculus have been studied, too, for instance testing equivalence []. The π-calculus has also a well-developed algebraic theory. Equational reasoning is a central technique in process calculus. In carrying out a calculation, appeal can be made to any axiom or rule sound for the equivalence in question. In general, the equivalence of two processes is undecidable, indeed it is not even semi-decidable. On finite processes, however, axiomatizations may be possible. By an axiomatization of an equivalence on a set of terms, one means some equational axioms that, together with the rules of equational reasoning, suffice for proving all (and only) the valid equations. The rules of equational reasoning are reflexivity, symmetry, transitivity, and congruence rules that make it possible to replace any subterm of a process by an equivalent term. Here are examples of axioms for the π-calculus. These axioms are sound, in that the processes so equated are barbed congruent. See [] for more details. P+P=P νx x⟨v⟩. P = νx P = P
if x does not occur in P
x(y). P∣z⟨v⟩. Q = x(y). (P∣z⟨v⟩. Q) + z⟨v⟩. (x(y). P∣Q) +[x = z]τ. (P{v/y}∣Q) The first law is an idempotence law for choice. The second law shows that an output at a name x preceded by a restricted on x is a blocked process. The third law allows the removal a useless restriction. The final law, called expansion, explains the behavior of a parallel composition in terms of choice and matching.
P
A behavioral equivalence abstracts from internal action is sometimes called a weak equivalence, and one that does not a strong equivalence. The above laws are valid for strong barbed congruence, hence also for its weak counterpart. Here is an equality that is only valid for weak barbed congruence: R∣νx (x(y). P∣x⟨v⟩. Q) = R∣νx (P{v/y}∣Q) The law shows that an interaction between two processes along private names cannot be disturbed by other processes.
Variants and Extensions In the π-calculus, communication between processes is synchronous – it is a handshake synchronization between two processes. Variants of the π-calculus have been proposed in which communication may be thought of as being asynchronous. The most distinctive feature of such extensions is that there is no continuation underneath the output prefix (i.e., all outputs are of the form x⟨v⟩), and only limited forms of choice are allowed. A major advantage of the asynchronous variants is that their implementation is simpler, and indeed most programming languages, or constructs for programming languages, inspired by π-calculus are asynchronous. The ordinary π-calculus does not distinguish between channels, or ports, and variables: they are all names. In some variants, such a separation is made. The π-calculus also does not explicitly mention location or distribution of mobile processes. The issue of location and distribution is orthogonal. Several extensions, or variants, of the π-calculus have appeared with the goal of explicitly addressing distribution and all the associated phenomena. Ideas from π-calculus have contributed to the development of their theories. Examples of languages are the Distributed Join Calculus [], the Distributed π-calculus [], the Ambient Calculus [], and Oz [].
Related Entries Actors Bisimulation CSP (Communicating Sequential Processes) Process Algebras
P
P
Pipelining
Bibliographic Notes and Further Reading The first paper on the π-calculus, [], was written by Milner et al. A book that gives a gentle introduction to both CCS and the π-calculus, with an emphasis on motivating the interest in them, is []. A book with an in-depth treatment of the theory of the π-calculus and its main variants is []. A book with a focus on a distributed extension of the π-calculus is []. These textbooks may be consulted for details on the concepts outlined above, including type systems, variants of the π-calculus, behavioral theory, and comparison with higher-order languages. Recent developments, not covered in the above books, include session types [], type systems for deadlock-freedom, and lock-freedom such as []. Experimental typed programming languages, or proposals for typed programming languages, inspired by the π-calculus include Pict [], Join [], XLang, developed at Microsoft as a component of the .NET platform. An active research area is the application of concepts from the π-calculus to languages aimed for Web services (see, e.g., []), biology (see, e.g., [, ]), and security (see, e.g., [, ]).
Bibliography . Abadi M, Gordon AD () A calculus for cryptographic protocols: the spi calculus. Inf Comput ():– . Blanchet B, Abadi M, Fournet C () Automated verification of selected equivalences for security protocols. J Log Algebr Program ():– . Boreale M, De Nicola R () Testing equivalence for mobile processes. Inf Comput :– . Carbone M, Honda K, Yoshida N () Structured communication-centred programming for web services. In: Proceedings of the ESOP , vol , Lecture Notes in Computer Science. Springer, Heidelberg, pp –, . Cardelli L, Gordon AD () Mobile ambients. In: Proceedings of the FoSSaCS’, vol , Lecture Notes in Computer Science. Springer, Heidelberg, pp –, . Fournet C, Gonthier G, Lévy J-J, Maranget L, Rémy D () A calculus of mobile agents. In: Proceedings of the CONCUR’, vol , Lecture Notes in Computer Science. Springer, Heidelberg, pp –, . Fournet C, Gonthier G () The join calculus: a language for distributed mobile programming. In: Summer School APPSEM , vol , Lecture Notes in Computer Science. Springer, Heidelberg, pp –
. Hennessy M, Riely J () Resource access control in systems of mobile agents. In: Proceedings of the HLCL ’: HighLevel Concurrent Languages, vol ., ENTCS. Elsevier Science, . Hennessy M () A distributed pi-calculus. Cambridge University Press, New York . Kobayashi N () A new type system for deadlock-free processes. In: Proceedings of the CONCUR’, vol , Lecture Notes in Computer Science. Springer, Bonn, pp –, . Milner R () Communicating and mobile systems: the / Calculus. Cambridge University Press, Cambridge . Milner R, Parrow J, Walker D () A calculus of mobile processes, (Parts I and II). Inf Comput :– . Milner R, Parrow J, Walker D () Modal logics for mobile processes. Theor Comput Sci :– . Pierce BC, Turner DN () Pict: a programming language based on the pi-calculus. In: Proof, Language and Interaction: Essays in Honour of Robin Milner. MIT Press, Cambridge . Priami C, Quaglia P, Romanel A () Blenx static and dynamic semantics. In: Proceedings of the CONCUR’, vol , Lecture Notes in Computer Science. Springer, Bologna, pp –, . Regev A, Panina EM, Silverman W, Cardelli L, Shapiro EY () Bioambients: an abstraction for biological compartments. Theor Comput Sci ():– . Sangiorgi D () On the bisimulation proof method. In: Proceedings of the MFCS’, vol , Lecture Notes in Computer Science. Springer, pp –, . Sangiorgi D, Walker D () The / -calculus: a theory of mobile processes. Cambridge University Press, Cambridge . Smolka G () The definition of kernel Oz. Research Report RR-, Deutsches Forschungszentrum für Künstliche Intelligenz, Kaiserslautern, Germany . Vasconcelos VT () Fundamentals of session types. In: SFM School, vol , Lecture Notes in Computer Science. Springer, Heidelberg, pp –
Pipelining Pipelining [] is a parallel processing strategy in which an operation or a computation is partitioned into disjoint stages. The stages must be executed in a particular order (could be a partial order) for the operation or computation to complete successfully. Each stage is implemented as a component which could be a hardware device or a software thread. When a stage completes, it becomes available to do other work. Parallelism results from the execution of a sequence of operations
PLAPACK
or computations so that at any given time several components of the sequence are under execution and each one of these is at a different stage of the pipeline. Pipelining is pervasive in today’s machines. Processor control units and arithmetic units are typically pipelined. Also, programs take advantage of pipelined parallelism by partitioning computations into stages.
Related Entries Cray Vector Computers Floating Point Systems FPS-B and Derivatives Fujitsu Vector Computers Stream Programming Languages
Bibliography . Kogge PM () The Architecture of Pipelined Computers. Hemisphere Publishing Corporation, New York
Place-Transition Nets Petri Nets
PLAPACK John A. Gunnels IBM Corp, Yorktown Heights, NY, USA
P
The PLAPACK Project Motivation PLAPACK’s design was motivated by the observation that the parallel implementation of most dense linear algebra operations is a relatively well-understood process. Nonetheless, the creation of general-purpose, high-performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms to a parallel code requires careful manipulation of indices and parameters describing the data, its distribution to processors, and/or the communication required. The creators of PLAPACK believe that these details make such parallel programming highly error prone and that this overhead stands in the way of the parallel implementation of more sophisticated algorithms. An Object-Based Approach The PLAPACK library is constructed so as to allow the user to express their parallel algorithms in a very concise manner. It does this, largely, through the encapsulation of data in objects. To achieve this, PLAPACK adopted an “objectbased” (or object-oriented) approach to programming, inspired by efforts including the Message Passing Interface (MPI) library [], and the PETSc library []. In object-based programming, all of the data related to an operand (matrix, vector, etc.) is cohesive, maintained in a data structure that describes the object. Further, the descriptor is interrogated or modified only indirectly, through the use of accessor and modifier functions.
Definition
Objects and Communications
PLAPACK is a software library for dense parallel linear algebra computations.
Object Types In the PLAPACK library there are several different types of linear algebra objects. These object types have a one-to-one correspondence with the manner in which they are distributed. The PLAPACK object types include:
Discussion Introduction The PLAPACK (van de Geijn, Robert., Using PLAPACK, pp. , , , , , , , © Massachusetts Institute of Technology, by permission of the MIT Press) library is a modern, dense parallel linear algebra library that is extensible, easy to use, and available under an open source license. It is designed to be user-friendly while offering competitive performance when compared to the more traditionally constructed ScaLAPACK library.
●
PLA_Matrix: This object is distributed as a twodimensional object, in a particular block cyclic fashion referred to as the Physically Based Matrix Distribution (PBMD). ● PLA_Mvector: The multivector is distributed over the compute fabric in blocked fashion, viewing the grid as one dimensional. It is used as a vector operand and as an intermediate form for copying data from one object (form) to another.
P
P
PLAPACK
A multivector object may consist of one or more “columns” of data. ● PLA_Pmvector: The projected multivector object is distributed in the same manner as (a set of) rows or columns of a matrix. This object is often duplicated in one dimension of the computational grid. For example, a column-oriented projected multivector is distributed across rows of the processor mesh and may be duplicated across processor columns. PLAPACK code often uses this object type to create buffers for communication. ● PLA_Mscalar: The multiscalar object is most conveniently thought of as a local matrix. It is not distributed, but replicated on all processors. The Special Role of the Multivector Multivectors can be thought of as a number of vectors, side-by-side, distributed over the entire processor mesh, where that mesh is viewed as a one-dimensional array of processors. The multivector can be used as a first-class object or as an intermediate form in various kinds of communication. An example of the former would be the redistribution of the panel from a blocked LU decomposition. In taking the object from, typically, a column of processors to the entire processor mesh, more computation per unit of time can be applied to the object. The latter use of the multivector appears in the PLA_Copy and PLA_Reduce routines. Copying data from one object to another or reducing (e.g., summing) data from several objects into another potentially involves complex mapping and communication, but these are encapsulated in the PLA_Copy and PLA_Reduce routines, respectively. The copy routine is used to seemlessly move data between objects of potentially differing distributions and types while the reduction operator is used to do the same, but is logically limited to using a duplicated object as its source. The central nature of the multivector in the copy operation is illustrated in Fig. .
Referencing (Sub)Objects Views In order to both facilitate concise algorithmic encodings and to eliminate many indexing errors, PLAPACK makes heavy use of linear algebra object (matrix, vector, etc.) “views.” In this context, a view is, abstractly, a subset of an object (a submatrix is the most commonly used view) with its own descriptor.
The promotion of the view to a first-class entity in PLAPACK stems from the fact that many common linear algebra algorithms such as parallel BLAS routines, solvers, and factorization routines, can be expressed in terms of windows into matrices and vectors. Thus, the use of the view (implicit or explicit) is common. By formalizing the use of the view and localizing the functionality related to creating and modifying these objects, PLAPACK both removes the need for the user to create this functionality and places it in a well-tested routine. As the routines used for views are heavily tested and the level of abstraction used to create views is high, relative to explicit indexing, this functionality is intended to reduce the opportunity for programmer error and to make such errors, when introduced, easier to locate. A view is identical to an object created via the PLA_Obj_create function except that it references the same data as the parent object or view from which it was derived. Thus, a change to the values in the data that the view translates to a corresponding change in the data that the parent object describes. It is legitimate in PLAPACK to derive views from views. In PLAPACK, a view of a linear algebra object is created by a call to the PLA_Obj_view routine. The user specifies the dimensions of the new view as well as its offset relative to the parent object and this routine creates a new view into an existing object. Routines for shifting (sliding) views as well as specialized subview operators, partitioning matrices into multiple submatrices or coalescing submatrices into a single-view object are also supplied in the PLAPACK library.
Distributing and Interfacing with Parallel Operands Physically Based Matrix Distribution PLAPACK employs an object distribution called the “Physically Based Matrix Distribution" (PBMD). This distribution scheme is based on the thesis that the elements of vectors are typically associated with data of physical significance, and it is therefore their distribution to nodes that is directly related to the distribution of the problem to be solved. From this point of view, a matrix (discretized operator) merely represents the relation between two vectors (discretized spaces): y = Ax
()
PLAPACK
Pmvector (row, undup)
Interleave
Unravel
Pmvector (row, dup)
Interleave Mvector
Mvector
Mvector
Pmvector (col, dup)
Pmvector (row, undup)
Unravel
Pmvector (row, dup)
Mvector
Mvector
Collect within row Interleave
Scatter within row Pmvector (col, undup)
Collect within col
Shift
Mvector
Local copy
Gather within col
Scatter within col
Local copy
P
Gather within row Interleave
Pmvector (col, dup)
Pmvector (col, undup)
PLAPACK. Fig. A systematic approach to copying (duplicated) (projected) (multi)vectors from and to (duplicated) (projected) (multi)vectors
PLAPACK partitions x and y, and assigns portions of these vectors to nodes. The matrix A should then be distributed to nodes in a fashion consistent with the distribution of the vectors. To employ PBMD, one must start by describing the distribution of the vectors, here, x and y, to nodes, after which the matrix distribution is induced (derived) by the vector distribution as illustrated in Fig. . The Application Programming Interface PLAPACK was architected under the assumption that applications will employ the library to: ●
Create PLAPACK vector, multivector, and matrix objects. ● Fill the entries in these PLAPACK objects. ● Perform a series of parallel linear algebra operations. ● Query elements of the updated PLAPACK objects. Many applications are inherently set up to generate numerous sub-vector, sub-multivector, or submatrix contributions to global linear algebra objects. The contributions of these applications can be viewed as dimensional “sub-objects.” However, frequently they are also partial sums of the global linear algebra objects, in which case the contribution must be added to existing entries. One approach for parallelizing these applications is to partition the numerous sub-object computations among processors. Such a parallelization of
the applications’ linear algebra object generation phase produces sub-vector, sub-multivector, and submatrix partial sums that may or may not be entirely local with respect to the data distribution of the global linear algebra objects. In order to fill or retrieve entries in a linear algebra object, an application enters the API-active state of PLAPACK. In this state, objects can be opened in a shared-memory-like mode, which allows an application to either fill or retrieve individual elements, sub-vectors, or subblocks of linear algebra objects. The PLAPACK application interface supports filling global linear algebra objects with sub-object partial sums.
An Illustrative Example Parallel Cholesky Factorization Fig. is a PLAPACK implementation of a blocked, parallel Cholesky factorization. It utilizes level- BLAS routines locally and is a simple, but efficient implementation of this algorithm. The representation is compact and relatively easy to explain to those familiar with the underlying sequential algorithm (see, e.g., the encyclopedia entry on libflame). The algorithm proceeds by applying (local) Cholesky factorization to the upper left corner of the matrix of interest, updating the remainder of the matrix accordingly, and reducing the active area (the part that will be further updated) of the matrix. The same algorithm
P
P
PLAPACK
0 1 2 0 3 6 9 Row indices
1 4 7 10
Column indices 3 4 5 6 7 8
9 10 11
0 3 6 9 1 4 7 10 2
2 5 8 11
5 8 11 0
1
2
3
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 0,0 0,1 0,2
0,3 0,4 0,5
0,6 0,7 0,8
0,9 0,10 0,11
3,0 3,1 3,2
3,3 3,4 3,5
3,6 3,7 3,8
3,9 3,10 3,11
6,0 6,1 6,2
6,3 6,4 6,5
6,6 6,7 6,8
6,9 6,10 6,11
2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3
9,0 9,1 9,2
9,3 9,4 9,5
9,6 9,7 9,8
9,9 9,10 9,11
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3
1,0 1,1 1,2
1,3 1,4 1,5
1.6 1,7 1,8
1,9 1,10 1,11
4,0 4,1 4,2
4,3 4,4 4,5
4,6 4,7 4,8
4,9 4,10 4,11
7,0 7,1 7,2
7,3 7,4 7,5
7,6 7,7 7,8
7,9 7,10 7,11
10,0 10,110,2
11,3 10,4 10,5
10,6 10,7 10,8
10,9 10,10 10,11
2,2
2,3 2,4 2,5
2,6 2,7 2,8
2,9 2,10 2,11
5,0 5,1 5,2
5,3 5,4 5,5
5,6 5,7 5,8
5,9 5,10 5,11
8,0 8,1 8,2
8,3 8,4 8,5
8,6 8,7
8,8
8,9 8,10 8,11
11,0 11,1 11,2
11,3 11,4 11,5
11,6 11,7 11,8
11,9 11,10 11,11
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3
1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3 2,0 2,1
0 0,0 0,0 0,0 0,1 0,1 0,1 0,2 0,2 0,2 0,3 0,3 0,3 1 1,0 1,0 0,0 1,1 1,1 1,1 1,2 1,2 1,2 1,3 1,3 1,3 2 2,0 2,0 2,0 2,1 2,1 2,1 2,2 2,2 2,2 2,3 2,3 2,3
PLAPACK. Fig. Inducing a matrix distribution from vector distributions. Top: Here each box represents a node of a × mesh. The sub-vectors of x and y are assigned to the mesh in column-major order. Thus the number in the box represents the index of the sub-vector assigned to that node. By projecting the indices of y to the left, the distribution of the matrix row-blocks of A is established. By projecting the indices of x to the top, the distribution of the matrix column-blocks of A is determined. Bottom-left: The resulting distribution of the subblocks of A is given where the indices refer to the indices of the subblocks of A. Bottom-right: The same information, except now from the matrix point of view. The figure shows the matrix partitioned into subblocks, with the indices in the subblocks indicating the node to which the subblock is mapped
is applied to this active part of the matrix until there is no active matrix remaining and the algorithm is complete. In the implementation in Fig. a, the while loop continues until the active portion of the matrix no longer exists. That condition is signaled when the topleft (A_TL) quadrant of the matrix (the part that has
been completely factored) is the entire matrix. Since the matrix is assumed to be (globally) square this condition can be tested by comparing the (global) length of the entire matrix to the factored (sub)section. The calls to PLA_Obj_split_size are used to determine the matrix dimensions of the top-left quadrant of the active matrix that resides on a single
PLAPACK
P
int Chol_blk_var3 ( PLA_Obj A , int nb_alg ) { PLA_Obj ATL = NULL , ATR = NULL , A00 = NULL , A01 = NULL , A02 = NULL , ABL = NULL , ABR = NULL , A10 = NULL , A11 = NULL , A12 = NULL , A20 = NULL , A21 = NULL , A22 = NULL ; PLA_Obj MINUS_ONE = NULL , ZERO = NULL , ONE = NULL ; int b ; /* Create constants -1 , 0 , 1 of the appropriate type */ P L A _ C r e a t e _ c o n s t a n t s _ c o n f _ t o ( A , & MINUS_ONE , & ZERO , & ONE ); PLA_Part_2x2 ( A , & ATL , & ATR , & ABL , & ABR , 0 , 0 , PLA_TL ); while ( PLA_Obj_length ( ATL ) < PLA_Obj_length ( A ) ){ /* Determine how big of a block A11 exists within one block */ P L A _ O b j _ s p l i t _ s i z e ( ABR , PLA_SIDE_TOP , & size_top , & owner_top ); P L A _ Ob j _s p li t _s i ze ( ABR , PLA_SIDE_LEFT , & size_left , & owner_left ); b = min ( ( min ( size_top , size_left ) , nb_alg ); P L A _ R e p a r t _ 2 x 2 _ t o _ 3 x 3 ( ATL , /* */ ATR , & A00 , /* */ & A01 , & A02 , /* ********** */ /* ****************** */ & A10 , /* */ & A11 , & A12 , ABL , /* */ ABR , & A20 , /* */ & A21 , & A22 , b , b , PLA_BR ); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ /* Update A_11 <- L_11 = Chol . Fact .( A_11 ) */ PLA_Local_chol ( PLA_LOWER_TRIANGULAR , A11 ); /* Update A_21 <- L_21 = A_21 inv ( L_11 ’ ) */ PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER _TRIANGULAR , PLA_TRANSPOSE , PLA_NONUNIT_DIAG , ONE , A11 , A21 ); /* Update A_22 <- A_22 - L_21 * L_21 ’ */ PLA_Syrk ( PLA_LOWER_TRIANGULAR , MINUS_ONE , A21 , ONE , ABR ); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ P L A _ C o n t _ w i t h _ 3 x 3 _ t o _ 2 x 2 (& ATL , /* */ & ATR , A00 , A01 , /* */ A02 , A10 , A11 , /* */ A12 , /* *********** */ /* *************** */ & ABL , /* */ & ABR , A20 , A21 , /* */ A22 , PLA_TL ); } PLA_Obj_free ( & ATL ); PLA_Obj_free ( & ATR ); PLA_Obj_free ( & ABL ); PLA_Obj_free ( & ABR ); PLA_Obj_free ( & A00 ); PLA_Obj_free ( & A01 ); PLA_Obj_free ( & A02 ); PLA_Obj_free ( & A10 ); PLA_Obj_free ( & A11 ); PLA_Obj_free ( & A12 ); PLA_Obj_free ( & A20 ); PLA_Obj_free ( & A21 ); PLA_Obj_free ( & A22 ); PLA_Obj_free ( & MINUS_ONE ); PLA_Obj_free ( & ZERO ); PLA_Obj_free ( & ONE ); return PLA_SUCCESS ; }
PLAPACK. Fig. An efficient variant of Cholesky factorization implemented in the PLAPACK API
processor. This is done so that an efficient, local (no communication) Cholesky factorization can be performed via PLA_Local_chol. The PLA_Repart_2x2_to_3x3 and PLA _Cont_with_3x3_to_2x2 calls create multiple views from a single object and coalesce multiple views into a single object, respectively. The comment bars in the code are visual cues as to the semantics of these functions. In the case of PLA_Repart_2x2_to_3x3 function, ABR is split into four subviews. In order to unambiguously carry out this operation only the size
of the A11 subview needs to be specified (here, that size is b by b). The PLA_Cont_with_3x3_to_2x2 function coalesces nine views into four. This is done in order to update the view of the active part of the matrix, shrinking it. Computational work is performed through the calls to PLA_Local_chol, which performs the sequential Cholesky factorization, PLA_Trsm whose purpose is to perform a parallel triangular solve with multiple right-hand-sides, and PLA_Syrk, a parallel version of the symmetric rank-k update.
P
P
PLASMA
Related Entries BLAS (Basic Linear Algebra Subprograms) LAPACK libflame ScaLAPACK
Bibliographic Notes and Further Reading The PLAPACK notation and APIs were used as the basis for those employed in the FLAME [] project and the libflame library []. The first papers that outlined the ideas that PLAPACK is based upon were published in [] and a book has been published that describes PLAPACK in detail [].
Bibliography . Gropp W, Lusk E, Skjellum A () Using MPI. The MIT Press, Cambridge, MA . Balay S, Gropp W, McInnes LC, Smith B () PETSc . users manual. Technical report ANL-/, Argonne National Laboratory, Cass Avenue Argonne . Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA () FLAME: formal linear algebra methods environment. ACM T Math Software ():– . Van Zee FG () Libflame: the complete reference. http://www.lulu.com/content// . Alpatov P, Baker G, Edwards HC, Gunnels J, Morrow G, Overfelt J, van de Geijn R () PLAPACK: parallel linear algebra package design overview. In: Proceedings of the ACM/IEEE conference on supercomputing, ACM, New York, pp . van de Geijn RA () Using PLAPACK. The MIT Press, Cambridge, MA
PLASMA Jack Dongarra, Piotr, Luszczek University of Tennessee, Knoxville, TN, USA
Definition Parallel Linear Algebra Software for Multi-core Architectures (PLASMA) is a free and open-source software library for numerical solution of linear equation systems on shared memory computers with multi-core processors. In particular, PLASMA is designed to give high efficiency on homogeneous multi-core processors and
multi-socket systems of multi-core processors. As of today, majority of such systems are on-chip symmetric multiprocessors with classic super-scalar processors as their building blocks augmented with short-vector SIMD extensions (such as SSE and AltiVec). PLASMA is available for download from the PLASMA Web site (To obtain the PLASMA library logon to: http://icl.cs. utk.edu/plasma/).
Discussion The emergence of multi-core microprocessor designs marked the beginning of a forced march toward an era of computing in which research applications must be able to exploit parallelism at a continuing pace and unprecedented scale []. To answer this challenge PLASMA redesigns LAPACK [] and ScaLAPACK [] for current and future multi-core processor architectures. To achieve high performance on this type of architectures, PLASMA relies on tile algorithms, which provide fine granularity parallelism. The standard linear algebra algorithms can then be represented as a Directed Acyclic Graph (DAG) [] where nodes represent tasks, either panel factorization or update of a block-column, and edges represent dependencies among them. Figure shows a small DAG for a tile QR factorization. Tasks of the same type (implemented by the same function) share the same color of nodes. Moreover, the development of programming models that enforce asynchronous, out of order scheduling of operations is the concept used as the basis for the definition of a scalable yet highly efficient software framework for computational linear algebra applications. In PLASMA, parallelism is no longer hidden inside Basic Linear Algebra Subprograms (BLAS – for more details see http://www.netlib.org/blas/) but is brought to the fore to yield much better performance. Each of the onesided tile factorizations presents unique challenges to parallel programming. Cholesky factorization is represented by a DAG with relatively little work required on the critical path. LU and QR factorizations have corresponding dependency pattern between the nodes of the DAG. These two factorizations exhibit much more severe scheduling and constraints than the Cholesky factorization. Currently, PLASMA schedules tasks statically while balancing the trade-off between load balancing and data reuse. PLASMA’s performance depends
PLASMA
P
DGEQRT
DLARFB
DLARFB
DTSQRT
DLARFB
DLARFB
DSSRFB
DSSRFB
DSSRFB
DTSQRT
DSSRFB
DSSRFB
DSSRFB
DGEQRT
DLARFB
DTSQRT
DLARFB
DSSRFB
DSSRFB
DSSRFB
DGEQRT
DLARFB
DSSRFB
DLARFB
DTSQRT
DSSRFB
DTSQRT
DSSRFB
DTSQRT
DLARFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DSSRFB
DTSQRT
DSSRFB
DSSRFB
DSSRFB
DTSQRT
DSSRFB
DSSRFB
DSSRFB
DTSQRT
DGEQRT
DSSRFB
DLARFB
DTSQRT
DSSRFB
DSSRFB
DGEQRT
PLASMA. Fig. Task DAG for tile QR factorization
strongly on tunable execution parameters, the outer and inner blocking sizes, that trade off utilization of different system resources. The outer block size (NB) trades off parallelization granularity and scheduling flexibility with single core utilization, while the inner block
size (IB) trades off memory load with extra flops due to redundant calculations. Tuning PLASMA consists of finding (NB, IB) pairs that maximize the performance depending on the matrix size and the number of cores.
P
P
PMPI Tools
In terms of numerical calculations, the Cholesky factorization represents one case in which a well-known LAPACK algorithm can be easily reformulated in a tiled fashion. Each operation that defines an atomic step of the LAPACK algorithm can be broken into a sequence of tasks where the same algebraic operation is performed on smaller portions of data, i.e., the tiles. In most of the cases, however, the same approach cannot be applied and novel algorithms must be introduced. For the LU and QR [, ] factorizations, algorithms based on the updating factorizations are used to reformulate the algorithms as was done for out-of-core solvers. Updating factorization methods can be used to derive tiled algorithms for LU and QR factorizations that provide very fine granularity of parallelism and the necessary flexibility that is required to exploit dynamic scheduling of the tasks. It is worth noting, however, that, as in any case where such fundamental changes are made, trade-offs have to be taken into account. For instance, in the case of the LU factorization the tiled algorithm replaces partial pivoting with block pairwise pivoting which results in, on average, slightly worse numerical stability. On the positive side, the current analysis of PLASMA’s tile algorithms suggests that they are numerically backward stable and the ongoing research aims to determine if they are stable. PLASMA provides routines to solve dense general systems of linear equations, symmetric positive definite systems of linear equations, and linear least squares problems using LU, Cholesky, QR, and LQ factorizations. Real arithmetic and complex arithmetic are supported in both single precision and double precision.
Related Entries LAPACK Linear Algebra, Numeric ScaLAPACK
Bibliography . Herb Sutter A () Fundamental turn toward concurrency in software. Dr. Dobb’s J ():– . Anderson E, Bai Z, Bischof C, Blackford LS, Demmel JW, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D () LAPACK Users’ guide. SIAM, Philadelphia . Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walke D, Whaley RC () ScaLAPACK users’ guide. SIAM, Philadelphia
. Christofides N () Graph theory: an algorithmic approach. Academic, New York . Buttari A, Langou J, Kurzak J, Dongarra JJ () Parallel tiled QR factorization for multicore architectures. In: PPAM’: Seventh international conference on parallel processing and applied mathematics, Gdansk, Poland, pp – . Buttari A, Langou J, Kurzak J, Dongarra JJ () Lapack working note : A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS--, Electrical Engineering and Computer Sciences Department, University of Tennessee
PMPI Tools Bernd Mohr Forschungszentrum Jülich GmbH, Jülich, Germany
Synonyms MPI introspection interface; MPI monitoring interface; MPI profiling interface; PMPI
Definition MPI is the predominant parallel programming model used in scientific computing which is demonstrated to work on the largest computer systems available. With PMPI there exists a standardized, and therefore portable, MPI monitoring interface. Since it was part of the MPI standard from the beginning, a multitude of tools for MPI programming exist. In the MPI standard, PMPI is called the Profiling Interface, which is a little bit misleading as it allows to intercept every MPI call made in a parallel program and can be used for all kind of tools, e.g., tracing or validation tools, and not just profiling ones.
Discussion Section of the MPI standard [] describes the so-called Profiling Interface of MPI. The objective of this interface is to ensure that it is relatively easy for authors of MPI tools to interface their codes to MPI implementations on different machines. Since MPI is a machine independent standard with many different implementations, it is unreasonable to expect that the authors of profiling tools for MPI will have access to the source code that implements MPI on any particular machine. It was therefore necessary to provide a portable mechanism by which the implementers of such tools can
PMPI Tools
collect whatever performance information they wish without access to the underlying implementation. This is accomplished by dictating that an implementation of the MPI functions must provide a mechanism through which all of the MPI-defined functions may be accessed with a name shift. Thus, all of the MPI functions (which normally start with the prefix “MPI”) should also be accessible with the prefix “PMPI.” This can be done by using weak symbols or simply by compiling each MPI function twice but with the different name. It is now possible for the implementer of an MPI tool to intercept all of the MPI calls that are made by the user program by implementing a library of so-called wrapper functions, which implement the same interface as the MPI function they intercept. In the wrapper function one can collect whatever information is required for the tool’s task before or after calling the underlying MPI implementation (through its name shifted entry points) to achieve the needed effects. One also has access to all parameter values passed to and returned from the MPI function. The MPI standard also requires that the standard MPI library be built in such a way that the inclusion of MPI functions can be achieved one at a time. This is necessary so that the authors of the wrapper library need only define those MPI functions which they wish to intercept, references to any others being fulfilled by the normal MPI library. So in order to use an MPI tool, the program is first linked to the tool’s wrapper library and only then to the MPI library, as shown in Fig. .
Call to MPI_Send
P
Simple Usage Example The basic structures of any PMPI tool is a collection of wrapper routines that collect the necessary data for each MPI call. At the end of the program execution, e.g., in the MPI_Finalize wrapper, the collected data is potentially aggregated across all processes using MPI communication and then written to disk or presented to the user. Figure shows a very simplistic version of an MPI performance tool that only counts the number of messages sent via MPI_Send. A more realistic version of this wrapper library example would not only count messages, but much more performance metrics, e.g., the time spent in executing the function. Professional MPI performance tools also determine the distribution of the recorded metrics over the receiver rank or the amount of bytes sent. Of course, to get a complete recording of MPI messages all functions that send messages need to be wrapped, i.e., all variants of MPI_Send (MPI_Bsend, MPI_Ssend, MPI_Rsend, …) and some other functions like MPI_Start. The latter function belongs to the group of functions implementing MPI-persistent communication. To monitor these messages, it is also necessary to track all MPI requests in the tool (implemented by wrapping all MPI calls that create, modify, or delete the opaque MPI request objects). Finally, to support all programming languages the MPI standard supports, both the C and the Fortran version of the wrappers need to be implemented. As the MPI standard defines over functions, this means implementing
MPI_Send
MPI_Send PMPI_Send
Call to MPI_Bcast
MPI_Bcast PMPI_Bcast
User program
Wrapper library
MPI library
PMPI Tools. Fig. Linking of user program, wrapper library, and MPI library. The user program calls MPI_Send and MPI_Bcast. If the user program is first linked to the tool’s wrapper library and then to the MPI library, the tool only intercepts MPI_Send calls which uses PMPI_Send to implement the actual sending of the message. Calls to unwrapped functions (here MPI_Bcast) directly call the corresponding function of the MPI library
P
P
PMPI Tools
#i n c l u d e <s t d i o . h> #i n c l u d e ”mpi . h” s t a t i c i n t numsend = 0 ; i n t MPI Send ( v o i d ∗ buf , i n t count , MPI Datatype type , i n t d e s t , i n t tag , MPI Comm comm) { numsend++; r e t u r n PMPI Send ( buf , count , type , d e s t , tag , comm ) ; } i n t MPI Finalize () { i n t me ; PMPI Comm rank (MPI COMM WORLD, &me ) ; p r i n t f (”%d s e n t %d messa ges . \ n ” , me , numsend ) ; r e t u r n PMPI Finalize ( ) ;
PMPI Tools. Fig. Simplistic wrapper library example. The wrapper for MPI_Send increments a global variable which was initialized to zero. The wrapper for MPI_Finalize prints how many messages this particular rank had sent. As MPI_Finalize is executed by every rank of the program, each rank is reporting its own result. Inside the wrapper, the rank is determined via PMPI_Comm_rank in order to avoid invoking the corresponding wrapper function
over wrapper functions; which is why, many MPI tool projects use wrapper generation tools to simplify the implementation of the wrapper library.
Performance Measurement Tools Based on PMPI Many MPI performance tools exist and all of them use the PMPI interface to implement the tool. Two widely accepted, portable, and scalable examples are FPMPI- and mpiP. They are both open-source, so that the reader can easily download them and study their implementation and usage. FPMPI- is a portable, open-source, very lightweight, and scalable MPI profiling library from Argonne National Laboratory []. For each MPI function, it provides the average and the maximum of the sum of metrics over all processes in a single textual output file. The provided metrics are the number of calls and the total execution time of each MPI function. FPMPI- has two special features that set it apart from other MPI profiling libraries: For communication functions, it records not only the amount of data transferred but also the distribution of the message sizes (by using an adaptive bins histogram). Secondly, it optionally tries to determine the actual synchronization time within blocking MPI calls by replacing the actual MPI implementation with a logically
equivalent implementation using busy-wait on the user level, allowing the blocking time to be estimated. mpiP is also a portable, more complete, but still scalable MPI profiling library originating from Lawrence Livermore National Laboratory but meanwhile maintained as an open-source project on sourceforge.net []. It provides also the number of calls, total execution time, bytes sent for each MPI function, and optionally for each MPI call-site. Captured call-paths are determined by a user-specified traceback level. Also, it provides MPI I/O statistics where applicable. The collected data is not aggregated across the processes but is collected in a scalable manner into one single output file which contains the complete data for all processes. Besides these two MPI profiling libraries, there are many more performance tools for MPI that utilize the PMPI interface. Some other tools worth mentioning are: ●
Integrated Performance Monitoring (IPM) [, ] is a portable, open-source toolkit with a focus on providing a low-overhead performance profile of the performance aspects and resource utilization in an MPI program. The level of provided detail is selectable at runtime and presented through a variety of text and web reports. Aside from overall performance, reports are available for load balance,
PMPI Tools
●
●
●
●
task topology, bottleneck detection, and message size distributions. There are also many vendor-specific MPI tools that are commercial products and typically only run on the system sold by the vendor. Examples here are Intel’s Trace Collector or Cray’s CrayPat. VampirTrace [, ] is a portable, open-source tool for collecting detailed traces of MPI programs in the Open Trace Format (OTF), which can be analyzed and visualized by the commercial Vampir trace browser. The latest version is also distributed as part of OpenMPI. The TAU Parallel Performance System [, ] is a very portable and versatile, open-source toolkit for performance analysis of parallel programs. Among many things, it supports profiling and tracing of MPI programs based on the PMPI interface. The Scalasca toolset [, ] provides call-path profiling and event tracing of MPI, OpenMP, and hybrid MPI/OpenMP programs. Its focus is on extreme scalability: In summer , they reported the first successful tracing experiments done on a ,, core BlueGene/P system. It is, together with VampirTrace, the only MPI tools that does complete MPI communicator tracking enabling a detailed collective communication analysis. Finally, Scalasca is currently the only tool supporting a detailed analysis of MPI . RMA functions [].
Verification Tools Based on PMPI While most of the PMPI tools measure and analyze the performance of MPI programs, there are certainly other uses for the PMPI interface. For example, it also possible to verify the correct and portable use of the MPI standard. MPI is widely used to write parallel programs, but it does not guarantee full portability between different MPI implementations. When an application runs without any problems on one platform but crashes or gives wrong results on another platform, developers tend to blame the compiler/architecture/MPI implementation. However, in some cases, the problem is a subtle programming error in the application undetected on the first platform. Finding this bug can be a very strenuous and difficult task. The solution is to use an automated tool designed to check the correctness of MPI applications during runtime. Examples of
P
such violations are the introduction of irreproducibility, deadlocks, and incorrect management of resources such as communicators, groups, datatypes, etc., or the use of non-portable constructs. A PMPI-based correctness checking tool has the advantage that it can be used with any standardconforming MPI implementation and may thus be deployed on any development platform available to the programmer. Although high-quality MPI implementations detect some of these errors themselves, there are many cases where they do not give any warnings. For example, non-portable implementation-specific behavior is not indicated by the implementation itself, nor are checks performed that would decrease the performance too much, such as consistency checks. What is worse is that MPI implementations tolerate quite a few errors without warnings or crashing, by simply giving wrong results. MARMOT [] is one example of a PMPI-based tool designed to detect correctness and portability problems during runtime. For all tasks that require a global view, e.g., deadlock detection or the control of the execution flow, MARMOT uses an additional process, the so-called debug server. Each client registers at the debug server, which in turn gives its clients the permission for execution in a round robin way. In order to ensure that this additional debug process is transparent to the application, MARMOT maps MPI_COMM_WORLD to a MARMOT communicator containing only the application processes. Since all other communicators are derived from MPI_COMM_WORLD they automatically exclude the debug server process. Everything that can be checked locally, e.g., verification of arguments, such as tags, communicators, or ranks, is performed by the clients. Additionally, the clients and the debug server use MPI internally to transfer information. Unfortunately, this server/client architecture inflicts a bottleneck, thus affecting the scalability and performance of the tool, especially for communicationintensive applications. UMPIRE [] is a second example for a PMPI-based MPI correctness-checking tool. Like MARMOT, it performs checks on a rank-local and global level. UMPIRE uses time-out mechanism and dependency graphs to detect deadlocks. Further errors detected by UMPIRE include wrong ordering of collective communication
P
P
PMPI Tools
calls within a communicator, mismatching collective call operations, or errant writes to send buffers before non-blocking sends are completed. UMPIRE does extensive resource tracking. Consequently it is able to unearth resource leaks. For instance, applications can repeatedly create opaque objects without freeing them, leading to memory exhaustion, or there can be lost requests due to overwriting of request handles.
Future Directions While PMPI certainly has been successfully used for many tools, it also comes with a few limitations: One problem is that it requires users to relink their application. More importantly, the PMPI interface only supports one tool at a time. The user cannot use multiple tools in a single run of their application – a significant limitation since not only application programmers might want to perform multiple performance analyses concurrently, but also tool builders cannot use existing tools, such as an MPI profiler, to evaluate the quality of the implementation of new tools, such as a correctness checker. Further, tool builders cannot easily create tool modules since functionality cannot be split into individual PMPI-accessing layers that could be reused by other tools. Thus, it discourages code reuse during tool development. To overcome these limitations, researchers at LLNL propose a new infrastructure called PN MPI [, ]. PN MPI allows users to dynamically load and execute one or more PMPI-based tools concurrently. This is accomplished by linking the PN MPI infrastructure into applications by default. PN MPI then transforms the wrapper libraries included in the PMPI tools into a single tool stack. Once initialized, PN MPI redirects any MPI routine executed by the application into this dynamically created stack and independently calls each tool that contains a wrapper for the routine. This eliminates the need to create a separate executable for each tool and to run each tool separately. Since PN MPI is lightweight by design, it can be included in the default build process thereby removing the need for recompilation to include or remove a tool. PN MPI also provides tool interaction functionality through services in the PN MPI core. Thus, separate modules can now implement common tool functionality to improve code reuse, modularity, and flexibility as well as tool interoperability. Possible usage scenarios for the PN MPI
infrastructure are the transparent use of tracing and profiling tools together, or efficient MPI debugging by combining deterministic reply mechanisms with MPI checker libraries like UMPIRE. Another proposal, the so-called Universal MPI Correctness Interface (UniMCI) [], specifically targets the efficient integration of performance and verification tools. Like PN MPI, it is based on the observation that using multiple tools simultaneously can help pinpoint issues faster, especially the combination of a tracing and a correctness tool can provide the history that leads to a correctness event and add further details to a detected error. Thus, it simplifies the identification of the root cause of an error. Unless PN MPI, it also provides a solution for a generic, portable coupling of any performance host tool with a correctness guest tool provided they both follow the UniMCI interface. This interface provides two functions for each MPI call: one to analyze the initial arguments of the MPI call, called pre check, and one to analyze the results of the MPI call, called post check. All runtime MPI checkers will need an analysis of the initial MPI call arguments to detect errors in the given arguments, which is invoked with the pre check function of UniMCI. The post check is usually needed for MPI calls that create resources, e.g., MPI_Isend, which create new requests. MPI correctness tools have to add these new resources to their internal data structures, in order to be aware of all valid handles and their respective state. By splitting the analysis of the MPI call into two parts, it is possible to return the result of the check to the host tool before the actual MPI call is issued, which is important to guarantee that errors are handled before the application might crash. Results of checks are returned with additional interface functions that have to be issued after each check function. A simple correctness message record is used to return problems detected by the guest tool. Future versions of the interface will contain extensions and rules to handle asynchronous correctness checking tools and multi-threaded applications.
Related Entries Debugging Formal Methods–Based Tools for Race, Deadlock, and Other Errors Metrics
Polaris
MPI (Message Passing Interface) Parallel Tools Platform Performance Analysis Tools
.
Periscope Profiling Scalasca
.
TAU Tracing
.
Vampir .
Bibliography . Louis Turcotte () Message passing interface forum: MPI: a message-passing interface standard. IJSA Special issue on MPI (/) . Accelerated strategic computing initiative: the ASC SMG benchmark code () https://asc.llnl.gov/computing_ resources/purple/archive/benchmarks/smg/ . Argonne national laboratory: the FPMPI- MPI profiling library () http://www-unix.mcs.anl.gov/fpmpi/ . Vetter J, Chambreau C () The mpiP MPI profiling library. http://mpip.sourceforge.net/ . Skinner D, Wright N, Fuerlinger K, Yelick K () Allan Snavely: integrated performance monitoring. http://ipm-hpc.sourceforge. net/ . Fürlinger K, Skinner D (August ) Capturing and visualizing event flow graphs of MPI applications. In: Workshop on productivity and performance (PROPER ) in conjunction with Euro-Par . Jurenz M, Knüpfer A, Brendel R, Lieber M, Doleschal J, Mickler H, Hackenberg D, Heyde H, Müller M () VampirTrace. http://www.tu-dresden.de/zih/vampirtrace/ . Knüpfer A, Brunst H, Doleschal J, Jurenz M, Lieber M, Mickler H, Müller M, Nagel W () The vampir performance analysis tool-set. In: Tools for high performance computing. Springer, Stuttgart, pp – . Shende S, Malony A () Tuning and analysis utilities. http:// tau.uoregon.edu/ . Shende S, Malony A () The TAU parallel performance system. Int J High Perform Comput Appl :–, SAGE Publications . Wolf F, Mohr B, Wylie B, Geimer M () Scalable performance analysis of large-scale applications. http://www.scalasca.org/ . Geimer M, Wolf F, Wylie BJN, Mohr B () A scalable tool architecture for diagnosing wait states in massively parallel applications. Parallel Comput :– . Hermanns M-A, Geimer M, Mohr B, Wolf F () Scalable detection of MPI- remote memory access inefficiency patterns. In: Proc. of the th European PVM/MPI users’ group meeting (EuroPVM/MPI), volume of Lecture Notes in Computer Science, Springer, Espoo, Finland, pp – . Krammer B, Bidmon K, Müller M, Resch M () MARMOT: an MPI analysis and checking tool. In: Procedings of PARCO
P
, volume of Advances in Parallel Computing, Elsevier, pp – Vetter J, de Supinski B () Dynamic software testing of MPI applications with umpire. In: Proceedings of the ACM/IEEE conference on supercomputing (CDROM), Article No. Schulz M, de Supinski B () PnMPI tools: a whole lot greater than the sum of their parts. In: Proceedings of supercomputing, ACM/IEEE conference Schulz M () A flexible and dynamic infrastructure for MPI tool interoperability. International conference on parallel processing (ICPP) Hilbrich T, Jurenz M, Mix H, Brunst H, Knüpfer A, Müller M, Nagel W () An interface for integrated MPI correctness checking. In: Chapman B et al (eds) Advances in parallel computing. Parallel computing: from multicores and GPU’s to petascale, vol . IOS Press, pp –
Pnetcdf NetCDF I/O Library, Parallel
Point-to-Point Switch Buses and Crossbars
P Polaris Rudolf Eigenmann Purdue University, West Lafayette, IN, USA
Synonyms Parallelization
Definition Polaris is the name of a parallelizing compiler and research compiler infrastructure. Polaris performs source-to-source translation; programs written in the Fortran language are converted into restructured Fortran programs – typically annotated with directives that express parallelism. Polaris was created in the mids at the University of Illinois and was one of the most advanced freely available tools of its kind.
P
Polaris
Discussion Introduction Polaris aimed at pushing the forefront of automatic parallelization (see Parallelization, Automatic) and, at the same time, providing the research community with an infrastructure for exploring program analysis and transformation techniques. Polaris followed several earlier projects with similar goals, a key distinction being that the development of this new compiler was driven by real applications. While the benchmarks that prompted earlier parallelizing compiler research were typically small, challenging program kernels, the Polaris project was preceded and motivated by an effort to manually parallelize the most realistic application program suite available at that time – the Perfect Benchmarks []. This effort identified a number of beneficial transformation techniques [], the automation of which constituted the Polaris project. Polaris was able to improve the state of the art in automatic parallelization significantly. Earlier autoparallelizers were measured to parallelize to a significant degree only of the Perfect Benchmarks. By contrast, Polaris achieved success in six of them – increasing the parallelization success rate in this class of science and engineering applications from less than % to nearly %. Polaris was originally developed at the University of Illinois from to [], with significant extensions made at Purdue University and Texas A&M University, in later project phases. Among the examples and predecessors were several parallelizers developed at the University of Illinois and Rice University. An important contemporary was the Stanford SUIF project []. Among the more recent, related compiler infrastructures are Rose [], LLVM [], and Cetus []. The architecture of Polaris exhibits the classical structure of an autoparallelizer. A number of program analysis and transformation passes detect parallelism and map it to the target machine. The passes are supported by an internal program representation (IR) that represents the source code being transformed and offers a range of program manipulation functions.
Detecting Parallelism Polaris includes the program analysis and transformation techniques that were found to be most important
in a prior manual parallelization project []. At the core of any autoparallelizer is a data-dependence detection mechanism. Data dependences prevent parallelism, making dependence-removing techniques essential parts of an autoparallelizer’s arsenal. To this end, Polaris includes passes for data privatization, reduction recognition, and induction variable substitution. The compiler focuses on detecting fully parallel loops, which have independent iterations and can thus be executed simultaneously by multiple processors. For the basics of the following techniques, see Parallelization, Automatic. Data-dependence test. Typical data-dependence tests detect whether or not two accesses to a data array in two different loop iterations could reference the same array element. The detection works well where array subscripts are linear – of the form a ∗ i + b ∗ j, where a,b are integer constants and i,j are index variables of enclosing loops. The Polaris project developed new dependence analysis techniques that are able to detect parallelism in the presence of symbolic and nonlinear array subscript expressions. For example, in the above expression, if a is a variable, the subscript is considered symbolic; if the term i appears, the subscript is nonlinear. If the compiler cannot determine the value of a symbolic term, it cannot assume it is linear. Hence, symbolic and nonlinear expressions are related. In real programs it is common for expressions, including array subscripts, to contain symbolic terms other than the loop indices. Through nonlinear, symbolic data-dependence testing, Polaris was able to parallelize several important programs that previous compilers could not. Privatization. Data privatization [] is a key enabler of improved parallelism detection. A privatization pattern can be viewed as one where a variable, say t, is being used as a temporary storage during a loop iteration. The compiler recognizes this pattern in that t is first defined (written) before used (read) in the loop iteration. By giving each iteration a separate copy of the storage space for t, accesses to t in multiple iterations do not conflict. Polaris extended the basic technique so that it could detect entire arrays that could be privatized. For example, the following loop can be parallelized after privatizing the array tmp. (the notation tmp(:m) means “the array elements from index to m”) DO i=1,n
Polaris
tmp(1:m) = a(1:m)+b(1:m) c(1:m) = tmp(1:m)+sqrt (tmp(1:m)) ENDDO Privatization gives each processor a separate instance of tmp. Without this transformation, each loop iteration would write to and read from the same tmp variable, creating a data dependence and thus inhibiting parallelization. Implementing array privatization is substantially more complex than the basic scalar privatization technique. The compiler must determine the sections of each array that are being defined and used in a loop. If each element of an array that is being used has previously been defined in the same loop iteration, the array is privatizable. More sophisticated analysis may privatize sections of arrays that fit this pattern. To do so, the compiler analysis must be able to combine array accesses into sections. As array subscript expressions may contain symbolic terms, these operations must be supported by advanced symbolic manipulation functions, which is one of Polaris’ strengths. Reduction recognition. This transformation is another important enabler of parallelization. Similar to the way Polaris extended the privatization technique from scalars to arrays, it extended reduction recognition []. The following loop shows an example array reduction (sometimes referred to as irregular or histogram reduction). Different loop iterations modify different elements of the hist array. The pattern of modification is not important; any two loop iterations may modify the same or different array elements. DO i=1,n val = <some computation> hist(tab(i)) = hist(tab(i)) + val ENDDO Polaris recognizes array reductions by searching for loops that contain statements with the following pattern: An assignment statement has a right-hand-side expression that is the sum of the left-hand-side plus a term not involving the reduction array. A loop may have several such statements, but the reduction array must not be used in any other statement of the loop.
P
Because two iterations may access the same element, the compiler has to assume a possible dependence. Reduction parallelization takes advantage of the mathematical property that sum operations can be reordered (even under limited-precision computer arithmetic, reordering is usually valid, although not always). A reduction loop can be executed in parallel like this: Each processor performs the sum operations over the assigned loop iteration space on a private copy of the original reduction array (hist, in the above example). At the end of the loop, the local reduction arrays from all participating processors are summed into the original reduction array. Reduction patterns are common in science and engineering applications. Array reduction parallelization was a key enabler of parallel performance in a number of important loops in the Perfect Benchmarks. Induction variable substitution. Induction variable substitution eliminates dependences by replacing an arithmetic sequence by a closed-form computation. In the following loop, the induction variable ind forms a sequence. ind = ind0 DO i=1,n ind = ind + k a(ind) = 0 ENDDO In the basic form of an induction variable, the next value in the sequence is generated by adding a constant to the previous value. The term k could be an integer constant or a loop-invariant expression. The closed-form expression for the sequence is ind+i∗ k. By substituting this expression into the array subscript, a(ind+i∗ k), the induction statement can be removed and the dependence on the previous value disappears. Polaris extended this well-known transformation to a more general form, where k can be nonconstant, such as another induction variable. For example, if the expression k is the loop index (ind = ind + i), the closed form becomes ind+i∗ (i+)/. The closed form of a generalized induction variable usually contains nonlinear terms. Parallelization in the presence of such terms requires the application of nonlinear data-dependence tests. Further complication arises when the induction variable is used after the loop. In this case, the compiler must compute and assign
P
P
Polaris
the last value. In the above example, it would insert the statement ind = ind+n*k after the loop. Such last-value assignment will be correct if assignment to the induction variable is guaranteed in all iterations. Polaris makes use of symbolic program analysis techniques to make this guarantee where possible.
Advanced Program Analysis Symbolic analysis. In addition to powerful transformations, key to Polaris’ performance is the availability of advanced symbolic analysis and manipulation utilities. For example, when performing generalized induction variable substitution, the compiler needs to evaluate sum expressions, such as ∑n j = n(n + )/. Polaris’ symbolic range analysis technique is able to determine symbolic value ranges that program variables may assume during execution. The technique looks at assignment statements, loop statements, and condition statements to determine constraints on the value range of variables. After an assignment, the left-hand-side variable is known to have the newly given value or symbolic expression. Within a loop, the loop variable is guaranteed to be between the loop bounds. In the then clause of an if statement, the if condition is guaranteed to hold (and not hold in the else clause). For example, the analysis can determine that inside the following loop the inequality ≤ i ≤ n holds and the loop accesses the array elements from position to n + . DO i=1,n a(i+1)=0 ENDDO Polaris created powerful tools for compilation passes to manipulate and reason with symbolic ranges, including comparison, intersection, and union operations. All of the described parallelization techniques depend on the availability of symbolic analysis. Interprocedural analysis. Because structuring a program into subroutines is an important software engineering principle, the ability of compilers to analyze programs across procedure calls is crucial. Advanced parallelizers, such as Polaris, attempt to find parallelism in outer loops. Larger parallel regions can better amortize the overheads associated with parallel execution, as will be discussed later. However, outer loops tend to encompass subroutine calls, making the application of the described techniques difficult. Furthermore,
interprocedural analysis is important even for code sections that do not contain subroutine calls. For many optimization decisions, the compilation passes must collect information from across the program. For example, symbolic analysis will find the value ranges of program variables in the entire program and propagate them to the subroutines where needed. The needs for interprocedural operation are different for each compiler technique. Creating specialized techniques for each optimization pass can be prohibitively expensive. Instead, Polaris includes a capability to expand subroutines inline. By default, small (by a configurable threshold) subroutines are expanded in place of their call statements. This capability obviates the need for specialized interprocedural analysis in most common cases. The drawback of subroutine inline expansion is code growth. Depending on the subroutine calling structure, a code expansion of an order of magnitude is possible. While Polaris has demonstrated that significant additional parallelism can be found this way, the needed compilation time can grow substantially. To address this issue, Polaris also included an interprocedural data access analysis framework [] as a basis for unified interprocedural parallelism detection.
Mapping Parallel Computation to the Target Machine Mapping parallel computation to the target machine has two objectives. First, the parallelism uncovered by an autoparallelizer may not necessarily fit the model of parallel execution by the eventual machine platform. Additional transformations of the parallel execution or the program’s data space may be needed. Second, almost all program transformations incur overheads. Privatization uses additional storage; reduction parallelization adds computation (e.g., summing up local reduction arrays); induction variable substitution creates expressions of higher strength (e.g., addition is replaced by multiplication); starting/ending parallel loops incur fork/join overheads (e.g., communicating to the participating processors what to do and warming up their cache). Advanced optimizing compilers make use of a performance model to estimate the overheads and to decide which transformations best to apply.
Polaris
Scheduling parallel execution. Polaris detects loops that are dependence-free and marks them as fully parallel loops. To express parallel execution, it inserts OpenMP (openmp.org) directives, leaving the code generation up to the target platform’s OpenMP compiler. A simple parallel loop in OpenMP looks like this. !$OMP PARALLEL DO PRIVATE(t) DO i=1,n t = a(i)+b(i) c(i) = t + t*t ENDDO The “OMP” directive expresses that the loop is to be scheduled for parallel execution and the data element t must be placed in private storage. By default, the OpenMP compiler assigns the loop iteration space to the available parallel threads (or cores) in chunks. For example, on an eight-core platform, the first core would usually execute the first n/ iterations, etc. Polaris can influence this choice via OpenMP “Schedule” clauses. For example, if the compiler detects that the amount of computation per loop iteration is irregular, it may choose a “dynamic” schedule, which maps iterations to available processors at runtime. To determine profitability of parallel execution, Polaris applies a simple test. It estimates the size of a loop by considering the number of statements and the number of iterations. If the size can be determined and is below a configurable threshold, the compiler leaves the loop in its original, serial form. While the creation of advanced performance models is an important research area, this simple test worked well in practice. Data placement. The data model is of further importance. Polaris assumes that all non-private variables are shared. All processors simply refer to the data in the same way the original, serial code does. In the above example, all processors participating in the execution of the parallel loop see the arrays a, b, and c in the same way and have direct access. The variable t is stored in processor-private memory. Depending on the architecture, private storage may be actual processor-local memory or it may simply be a processor-private slice of the global address space. Current multicore architectures implement this shared-memory model in hardware and are thus suitable targets for Polaris. Another important class
P
of parallel computers is not: Many high-performance computer platforms have a distributed memory model. Data need to be partitioned and distributed onto the different memories. Processors do not have direct access to data placed on other processors’ memories; explicit communication must be inserted to read and write such data. Autoparallelizers, including Polaris, have not yet been successful in targeting this machine class. Explicit parallel programming by the software engineer is needed.
Internal Organization Polaris is organized into a number of program analysis and transformation passes, which make use of the functionality for program manipulation offered by the abstract internal program representation (IR). The IR represents the Fortran source program in a way that corresponds to the original code closely. The Syntax Tree has a structure that reflects the program’s subroutines, statements, and expressions. The compiler passes see the IR in an abstract form – as a C++ class hierarchy; they perform all operations, such as finding properties of a statement or inserting an expression, via access functions. The objects of the IR class hierarchy are organized in lists, which are traversable with several iteration methods. A typical compiler pass would begin by obtaining a list of Compilation Units (subroutines), traverse them to the desired subroutine, and obtain a reference to the list of statements of that subroutine. Next, the statement list would be traversed to the desired point, where the statement’s properties and expressions could be obtained. Various “filters” are available that let pass writers iterate over objects of a specific type, only. For example, for some loop analysis pass, it may be desirable to skip from loop to loop, ignoring the statements in between them. Among the most advanced IR functions are those that allow a pass to traverse the IR until a certain statement or expression pattern is found. A key design principle of Polaris was that these functions keep the IR consistent at all times. It would not be possible to insert a statement without properly connecting it to the surrounding scope or rename a variable without properly updating the symbol table. These bookkeeping operations are performed internally to the extent possible; they are not visible in the IR’s access functions. This design offers the compiler researcher a
P
P
Polaris
convenient, high-level interface. It was one reason for Polaris’ popularity as a compiler infrastructure. The implementation of the compiler consists of approximately , lines of C++ code. Forty-five percent of the code represents the IR with its access and manipulation functions; % implements the compilation passes.
Uses of Polaris The primary use of Polaris was as an autoparallelizer and compiler infrastructure in the research community. Many contributions to autoparallelization technology have been made using Polaris as an implementation and evaluation test bed. Polaris supported other applications as well. Its ability to perform source-to-source transformations made it a good platform for writing program instrumentation passes. A simple such pass might insert calls to a timing subroutine at the beginning and end of every loop, producing a tool that can create loop profiles. Other uses of Polaris included the creation of functionality that measures the maximum possible parallelism in a program, predicts the parallel performance of application programs, and creates profiles of detected dependences. Eighteen years after it was first conceived, Polaris continues to be distributed to the research community. The last download was recorded in , in the same month this entry was written.
Challenges and Future Directions Polaris was able to advance autoparallelization technology to the point where one in two science and engineering programs can be profitably executed in parallel on shared-memory machines. Nonnumerical programs and distributed-memory computer architectures are not yet amenable to this technology and remains an elusive research goal. In pursuing this goal, the increasing complexity of the compiler is a challenge. Learning the underlying theory and realizing its implementation is highly time-consuming for both the researchers exploring new analysis and transformation passes and the engineers developing production-strength compiler products. One of the severe limitations of Polaris – and autoparallelization in general – is the lack of information that can be gathered from a program at compile time. Both the detection of parallelism and the mapping
to the target architecture depend on knowledge of information that may be available only from the program’s input data set or the target platform. Therefore, many optimization decisions cannot be made at compiletime or can only be made by the compiler making guesses. Guesses are only legal where they do not affect the correctness of the transformed program. Therefore, compilers often must make conservative assumptions, which may limit the degree of optimization. Future compilers will increasingly need to merge with runtime techniques that gather information as the program executes and perform or tune optimizations dynamically. Polaris explored one aspect of this area with runtime data-dependence techniques. These techniques detect at runtime whether or not a loop is dependence-free and choose between serial and parallel execution [].
Related Entries Banerjee’s Dependence Test Code Generation Dependence Analysis GCD Test HPF (High Performance Fortran) Loop Nest Parallelization Modulo Scheduling and Loop Pipelining Omega Test Parallelization, Automatic Parallelization, Basic Block Pipelining Speculative Parallelization of Loops Run Time Parallelization Scheduling Algorithms Semantic Independence Speculation, Thread-Level Trace Scheduling Unimodular Transformations
Bibliography . Blume W, Doallo R, Eigenmann R, Grout J, Hoeflinger J, Lawrence T, Lee J, Padua D, Paek Y, Pottenger B, Rauchwerger L, Tu P (December ) Parallel programming with Polaris. IEEE Comput ():– . Dave C, Bae H, Min S-J, Lee S, Eigenmann R, Midkiff S () Cetus: a source-to-source compiler infrastructure for multicores. IEEE Comput ():– . Eigenmann R, Hoeflinger J, Padua D (January ) On the automatic parallelization of the perfect benchmarks. IEEE Trans Parallel Distrib Syst ():–
Polyhedron Model
. Quinlan DJ et al Rose compiler project. http://www.rosecompiler.org/ . Berry M et al () The perfect club benchmarks: effective performance evaluation of supercomputers. Int J Supercomput Appl ():– . Hall MW, Anderson JM, Amarasinghe SP, Murphy BR, Liao S-W, Bugnion E, Lam MS (December ) Maximizing multiprocessor performance with the SUIF compiler. Computer, pp – . Hoeflinger J, Paek Y, Yi K () Unified interprocedural parallelism detection. Int J Parallel Program ():– . Lattner C, Adve V () Llvm: a compilation framework for lifelong program analysis & transformation. In: CGO’: Proceedings of the international symposium on code generation and optimization. IEEE Computer Society, Washington, DC, p . Pottenger B, Eigenmann R () Idiom recognition in the polaris parallelizing compiler. In: Proceedings of the th ACM international conference on supercomputing, Barcelona . Rauchwerger L, Padua D () The LRPD test: speculative runtime parallelization of loops with privatization and reduction parallelization. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – . Tu P, Padua D (August ) Automatic array privatization. In: Proceedings of the th workshop on languages and compilers for parallel computing, vol , Lecture notes in computer science, pp –
Polyhedra Scanning Code Generation
Polyhedron Model Paul Feautrier , Christian Lengauer CNRS École Normale Supérieure de Lyon, Lyon Cedex , France University of Passau, Passau, Germany
Synonyms Polytope model
Definition The polyhedron model (earlier known as the polytope model [, ]) is an abstract representation of a loop program as a computation graph in which questions such as program equivalence or the possibility and nature of parallel execution can be answered. The
P
nodes of the computation graph, each of which represents an iteration of a statement, are associated with points of Zn . These points belong to polyhedra, which are inferred from the bounds of the surrounding loops. In turn, these polyhedra can be analyzed and transformed with the help of linear programming tools. This enables the automatic exploration of the space of equivalent programs; one may even formulate an objective function (such as the minimum number of synchronization points) and ask the linear programming tool for an optimal solution. The polyhedron model has stringent applicability constraints (mainly to FOR loop programs acting on arrays), but extending its limits has been an active field of research. Beyond autoparallelization, the polyhedron model can be useful in many situations which call for a program transformation, such as in memory or performance optimization.
Discussion The Basic Model Every compiler must have representations of the source program in various stages of elaboration, as for instance by character strings, abstract syntax trees, control graphs, three-address codes, and many others. The basic component of all these representation is the statement, be it a high-level language statement or a machine instruction. Unfortunately, these representations do not meet the needs of an autoparallelizer simply because parallelism does not occur between statements, but between statement executions or instances. Consider: for i = to n− do S: a[i] = . od It makes no sense to ask whether S can be executed in parallel with itself; in this case, parallelism depends both on the way S accesses memory and on the way the loop counter i is updated at each iteration. A loop program must therefore be represented as a set of instances, its iteration domain, here named E. Each instance has a distinct name and consists in the execution of the related statement or instruction, depending on the granularity of the analysis. This set is finite, in the case of a terminating program, or infinite, in the case of a reactive or streaming system. However, this is not sufficient to specify the object program. One needs to know in which order the
P
P
Polyhedron Model
instances are executed; E must be ordered by some relation ≺. If u, v ∈ E, u ≺ v means that u is executed before v. Since an operation cannot be executed before itself, ≺ is a strict order. It is easy to see that the usual control constructs (sequences, loops, conditionals, jumps) are compact ways of defining ≺. It is also easy to see that, in a sequential program, two arbitrary instances are always ordered: one says that, in this case, ≺ is a total order. Consideration of an elementary parallel program (in OpenMP notation): #pragma omp parallel sections S #pragma omp section S #pragma omp end parallel sections shows that S may be executed before or after or simultaneously with S , depending on the available resources (processors) and the overall state of the target system. In that case, neither S ≺S nor S ≺S are true: one says that ≺ is a partial order. As an extreme case, an embarassingly parallel program, in which instances can be executed in any order, has the empty execution order. Therefore, one may say that parallelization results in replacing the total execution order of a sequential program by a partial one, under the constraint that the outcome of the program is not modified. This in turn raises the following question: Under which conditions are two programs with the same iteration domain but different execution orders equivalent? Since program equivalence is undecidable in general, one must be content with conservative answers, i.e., with sufficient but not necessary equivalence conditions. The usual approach is based on the concept of dependences (see also the Dependences entry in this encyclopedia). Assuming that, given the name of an instance u, one can characterize the sets (or supersets) of read and written memory cells, R(u) and W(u), u and v are in dependence, written u δ v, if both access some memory cell shared by them and at least one of them modifies it. In symbols: u δ v if at least one of the sets R(u) ∩ W(v), W(u) ∩ R(v) or W(u) ∩ W(v) is not empty. The concept of a dependence was first formulated by Bernstein []. One can prove that two programs are equivalent if dependent instances are executed in the same order in both.
▸ Aside. Proving equivalence starts by showing that, under Bernstein’s conditions, two independent consecutive instances can be interchanged without modifying the final state of memory. In the case of a terminating program, this is done by specifying a succession of interchanges that convert one order into the other without changing the final result. The proof is more complex for nonterminating programs and depends on a fairness hypothesis, namely, that every instance is to be executed eventually. One can then prove that the succession of values assigned to each variable – its history – is the same for both programs. One first shows that the succession of assignments to a given variable is the same for both programs since they are in dependence, and, as a consequence, that the assigned values are the same, provided all instances are deterministic, i.e., return the same value when executed with the same arguments (see also the Bernstein’s Conditions in this encyclopedia).
To construct a parallel program, one wants to remove all orderings between independent instances, i.e., construct the relation δ ∩ ≺, and take its transitive closure. This execution order may be too complex to be represented by the available parallel constructs, like the parallel sections or the parallel loops of OpenMP. In this case, one has to trade some parallelism for a more compact program. It remains to explain how to name instances, how to specify the index domain of a program and its execution order, and how to compute dependences. There are many possibilities, but most of them ask for the resolution of undecidable problems, which is unsuitable for a compiler. In the polyhedron model, sets are represented as polyhedra in Zn , i.e., sets of (integer) solutions of systems of affine inequalities (inequalities of the form A x ≤ b, where A is a constant matrix, x a variable vector, and b a constant vector). It so happens that these sets are the subject of a well-developed theory, (integer) linear programming [], and that all the necessary tools have efficient implementations. The crucial observation is that the iterations of a regular loop (a Fortran DO loop, or a Pascal FOR loop, or restricted forms of C, C++ and Java FOR loops) are represented by a segment (which is a onedimensional polyhedron), and that the iterations of a regular loop nest are represented by a polyhedron
Polyhedron Model
with as many dimensions as the nest has loops. Consider, for instance, the first statement of the loop program in Fig. a. It is enclosed in two loops. The instances it generates can be named by stating the values of i and j, and the iteration domain is defined by the constraints: ≤ i ≤ n, ≤ j ≤ i + m which are affine and therefore define a polyhedron. In the same way, the iteration domain of the second statement is ≤ i ≤ n. For better readability, the loop counters are usually arranged from outside inward in a vector, the iteration vector of the instance. Observe that, in this representation, n and m are parameters, and that the size of the representation is independent of their values. Also, the upper bound of
P
the loop on j is not constant: iteration domains are not limited to parallelepipeds. The iteration domain of the program is the disjoint union of these two polyhedra, as depicted in Fig. b with dependences. (The dependences and the right side of the figure are discussed below.) To distiguish the several components of the union, one can use statement labels, as in: E = {⟨S , i, j⟩ ∣ ≤ i ≤ n, ≤ j ≤ i + m} ∪ {⟨S , i⟩ ∣ ≤ i ≤ n} The execution order can be deduced from two observations: ●
In a program without control constructs, the execution order is the textual order. Let u
for t = 0 to m+2∗n−1 do parfor p = max(0,t−n+1) to min(t, é(t+m)/2ù) do if 2∗p = t+m+1 then
for i = 1 to n do
S2 : A(p−m,p+1) = A(p−m−1,p) + A(p−m,p)
for j = 1 to i+m do
else S1 : A(t−p+1,p+1) = A(t−p,p+1) + A(t−p+1,p)
S1 : A(i,j) = A(i−1,j) + A(i,j−1)
fi
od S2 : A(i,i+m+1) = A(i−1,i+m) + A(i,i+m)
od od
od
a
Source loop nest
d
⇓
⇑
i
j
p
t
i
b
Source iteration domain
P
Target loop nest
c
Target iteration domain
Polyhedron Model. Fig. Loop nest transformation in the basic polyhedron model
P ●
Polyhedron Model
Loop iterations are executed according to the lexicographic order of the iteration vectors. Let x
In more complex cases, these two observations may be combined to give: ⟨R, x⟩ ≺ ⟨S, y⟩ ≡ x[ : N]
tested for solutions, either by ad hoc conservative methods – see the Banerjee’s Dependence Test entry in this encyclopedia – or by linear programming algorithms – see the Dependences entry. In summary, a program can be handled in the polyhedron model – and is then called a regular or static control program – if its only control constructs are (also called regular) loops with affine bounds and its data structures are either scalars or arrays with affine subscripts in the surrounding loop counters. It should be noted that these restrictions must not be taken syntactically but semantically. For instance, in the program: i = ; k = ; while i < n do a[k] = . ; i = i + ; k =k+ od the loop is in fact regular with counter i, and the subscript of a is really i, which is affine. There are many classical techniques – here, induction variable detection – for transforming such constructs into a more “polyhedron-friendly” form. Regular programs are mainly found in scientific computing, linear algebra, and signal processing, where unbounded iteration domains are frequent. Perhaps more surprisingly, many variants of the Smith and Waterman algorithm [], which is the basic tool for genetic sequence analysis, are regular and can be optimized with polyhedral tools []. Also, while large programs rarely fit in the model, it is often possible to extract regular kernels and to process them in isolation.
Transformations
fR (x) = FR x + gR ,
The main devices for program optimization in the polyhedron model are coordinate transformations of the iteration domain.
where FR is a matrix of dimension dA×dR , with dR being the number of loops surrounding R, and gR is a vector of dimension dA . One may associate with each candidate dependence a system of constraints by gathering the subscript equations, the constraints which define the iteration domains of R and S, and the sequencing predicate above. All of these constraints are affine, with the exception of the sequencing predicate which is a disjunction of affine constraints. Each disjunct can be
An Example Consider Fig. as an illustration of the use of transformations. Figure a presents a sequential source program with two nested loops. The loop nest is imperfect: not all statements belong to the innermost loop body. Figure b depicts the iteration domain of the source program, as explained in the previous section. The arrows represent dependences and impose a partial
Polyhedron Model
order on the loop steps. Apart from these ordering constraints, steps can be executed in any order or in parallel. In the source iteration domain, parallelism is in some sense hidden. The loop on j is sequential since the value stored in A(i, j) at iteration j is used as A(i, j−) in the next iteration. The same is true for the loop on i. However, parallelism can be made visible by applying a skewing transformation as in Fig. c. For a given value of t, there are no dependences between iterations of the p loop, which is therefore parallel. The required transformation can be viewed as a change of coordinates or a renaming: ⎛ t ⎞ ⎛ ⎞ ⎛ i ⎞ ⎛ − ⎞ ⎟ ⎜ ⎟⎜ ⎟ + ⎜ ⎟ S : ⎜ ⎜ ⎟=⎜ ⎟⎜ ⎟ ⎜ ⎟ p j − ⎝ ⎠ ⎝ ⎠⎝ ⎠ ⎝ ⎠ ⎛ t ⎞ ⎛⎞ ⎛ m− ⎟ = ⎜ ⎟( i ) + ⎜ S : ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ ⎝p⎠ ⎝ ⎠ ⎝ m
⎞ ⎟ ⎟ ⎠
Observe that the transformation for S has the non⎛ ⎞ ⎟ and, hence, is bijective. Fursingular matrix ⎜ ⎜ ⎟ ⎝ ⎠ thermore, the determinant of this matrix is (the matrix is unimodular), which means that the transformation is bijective in the integers. A target loop nest which corresponds to the target iteration domain is depicted in Fig. d. The issue of target code generation is addressed later. For now, just note that the target loop nest is much more complex than the source loop nest, and that it would be cumbersome and error-prone to derive it manually. On the other hand, the fact that both transformation matrices are unimodular simplifies the target code: both loops have unit stride.
The Search for a Transformation The fundamental constraint on a transformation in the polyhedron model is affinity. As explained before, each row of the transformation matrix corresponds to one axis of the target coordinate system. Each axis represents either a sequential loop or a parallel loop. Iterations of a sequential loop are executed successively; hence, the loop counter can be interpreted as (logical)
P
time. Iterations of a parallel loop are executed simultaneously (available resources permitting) by different processors; the values of their loop counters correspond to processor names. Finding the coefficients for the sequential axes constitutes a problem of scheduling, finding the coefficients for the parallel axes one of placement or allocation. Different methods exist for solving these two problems. The order in which sequential and parallel loops are nested is important. One can show that it is always possible to move the parallel loops deeper inside the loop nest, which generates lock-step parallelism, suitable for vector or VLIW processors. For less tightly coupled parallelism, suitable for multicores or message-passing architectures, one would like to move the parallel loops farther out, but this is not always possible.
Scheduling A schedule maps each instance in the iteration domain to a logical date. In contrast to what happens in (task graph scheduling), the number of instances is large, or unknown at compile time, or even infinite, so that it is impossible to tabulate this mapping. The schedule must be a closed-form function of the iteration vector; it will become clear presently that its determination is easy only if restricted to affine functions. Let θ R (i) be the schedule of instance ⟨R, i⟩. Since the source of a dependence must be executed before its destination, the schedule must satisfy the following causality constraint: ∀i, j : ⟨R, i⟩ δ ⟨S, j⟩ ⇒ θ R (i) < θ S ( j). There are as many such constraints as there are dependences in the program. The unknowns are the coefficients of θ R and θ S . The first step in the solution is the elimination of the quantifiers on i and j. There are general methods of quantifier elimination [] but, due to the affinity of the constraints in the polyhedron model, more efficient methods can be applied. In fact, the form of the causality constraint above asserts that the affine delay θ S ( j) − θ R (i) must be positive inside the dependence polyhedron {i, j ∣ ⟨R, i⟩ δ ⟨S, j⟩}. To this end, it is necessary and sufficient that the delay be positive at the vertices of the dependence polyhedron, or that it be
P
P
Polyhedron Model
an affine positive combination of the dependence constraints (Farkas lemma). The result of quantifier elimination is a linear system of inequalities which can be solved by any linear programming tool. This system of constraints may not be feasible, i.e., it may have no solution. This means simply that no linear-time parallel execution exists for the source program. The solution is to construct a multidimensional schedule. In the target loop nest, there will be as many sequential loops as the schedule has dimensions. More information on scheduling can be found in the Scheduling Algorithms entry of this encyclopedia.
Placement A placement maps each instance to a (virtual) processor number. Again, this mapping must be in the form of a closed affine function. In contrast to scheduling, there is no legality constraint for placements: any placement is valid, but may be inefficient. For each dependence between instances that are assigned to distinct processors, one must generate a communication or a synchronization, depending on whether the target architecture has distributed or shared memory. These are costly operations, which must be kept at a minimum. Hence, the aim of a placement algorithm is to find a function: π : E → [, P] where P is the number of processors, such that the size of the set: C = {u, v ∈ E ∣ u δ v, π(u) =/ π(v)} is minimal. Since counting integer points inside a polyhedron is difficult, one usually uses the following heuristics: try to “cut” as many dependences as possible. A dependence from statement R to S can be cut if the following constraint holds: ⟨R, i⟩ δ ⟨S, j⟩ ⇒ πR (i) = πS ( j). This condition can be transformed into a system of homogeneous linear equations for the coefficients of π. The problem is that, in most cases, if one tries to satisfy all the cutting constraints, the only solution is π(u) = ,
which corresponds to execution on only one processor: this, indeed, results in the minimal number of synchronizations (namely zero)! A possible way out is to solve the cutting constraints one at time, in order of decreasing size of the dependence polyhedron, and to stop just before generating the trivial solution. The uncut dependences induce synchronization operations. If all dependences can be cut, the program has communicationfree parallelism and can be rewritten with one or more outermost parallel loops. In the special case of a perfect loop nest with uniform dependences, one may approximate the dependence graph by the translations of the lattice generated by the dependence vectors. If the determinant of this lattice is larger than , the program can be split into as many independent parts []. Lastly, instead of assigning a processor number to each instance, one may assign all iterations of one statement to the same processor [, ]. This results in the construction of a Kahn process network [].
Code Generation In the polyhedron model, a transformation of the source iteration domain which optimizes some objective function can be found automatically. The highest execution speed (i.e., the minimum number of steps to be executed in sequence) may be the first thing that comes to mind, but many other functions are possible. Unfortunately, it is not trivial to generate efficient target code from the optimal solution in the model. There are several factors that can degrade performance seriously. The enumeration of the points in the target iteration domain involves tests for the lower and upper border. If the code is not chosen wisely, these tests will often degrade scalability. For example, in Fig. , a maximum and a minimum is involved. The example of Fig. also shows that additional control (the IF statement) may be introduced, which degrades performance. Of course, synchronizations and communications can also degrade performance seriously. For details on code generation in the polyhedron model, see the Code Generation.
Extensions The following extensions have successively been made to the basic polyhedron model.
Polyhedron Model
WHILE Loops The presence of a WHILE loop in the loop nest turns the iteration domain from a finite set (a polytope) into an infinite set (a polyhedron). If the control dependence that the termination test of the loop imposes is being respected, the iteration must necessarily be sequential. However, the steps of a WHILE loop in a nest with further (FOR or WHILE) loops may be distributed in space. There have been two approaches to the parallelization of WHILE loops. The conservative approach [, ] respects the control dependence. One challenge here is the discovery of global termination. The speculative approach [] does not respect the control dependence. Thus, several loop steps may be executed in parallel if there is no other dependence between them. The price paid is the need for storage of intermediate results, in case a rollback needs to be done when the point of termination has been discovered but further steps have already been executed. In some cases, overshooting the termination point does not jeopardize the correctness of the program and no rollback is needed. Discovering this property is beyond the capability of present compilers. Conditional Statements The basic model permits only assignment statements in the loop body. The challenge of conditionals is that a dependence may hold only for certain executions, i.e., not for all branches. A static analysis can only reveal the union of these dependences []. Iteration Domain Splitting In some cases, the schedule can be improved by orders of magnitude if one splits the iteration domain in appropriate places []. One example is depicted in Fig. . With the best affine schedule of ⌊i/⌋, each parallel
Tiling The technique of domain splitting has a further, larger significance. The polyhedron model is prone to yielding very fine-grained parallelism. To coarsen the grain when not enough processors are available, one partitions (parts of) the iteration domain in equally sized and shaped tiles. Each tile covers a set of iterations and the points in a tile are enumerated in time rather than in space, i.e., the iteration over a tile is resequentialized. One can tile the source iteration domain or the target iteration domain. In the latter case, one can tile space and also time. Tiling time corresponds to adding hands to a clock and has the effect of coarsening the grain of processor communications. The habilitation thesis of Martin Griebl [] offers a comprehensive treatment of this topic and an extensive bibliography. See also the Tiling entry of this encyclopedia. Treatment of Expressions In the basic model, expressions are considered atomic. There is an extension of the polyhedron model to the parallelization of the evaluation of expressions []. It also permits the identification of common subexpressions and provides a means to choose automatically the suitable point in time and the suitable place at which to
A(i) = ... A(2∗n−i−1)
for i = 0 to 2∗n − 1 do od
=⇒
od for i = n to 2∗n − 1 do A(i) = ... A(2∗n−i−1) od
Polyhedron Model. Fig. Iteration domain splitting
step contains two loop iterations, i.e., the execution is sped up by a factor of . (The reason is that the shortest dependence has length .) The domain split on the right yields two partitions, each without dependences between its iterations. Thus, all iterations of the upper loop (enumerating the left partition) can be executed in a first parallel step, and the iterations of the lower loop (enumerating the right partition) in a second one, for a speedup of n/.
for i = 0 to n − 1 do
A(i) = ... A(2∗n−i−1)
P
P
P
Polyhedron Model
evaluate it just once. Its value is then communicated to other places.
scalar or part of an array – is modified more than once. The transformation proceeds in two steps: ●
Relaxations of Affinity The requirement of affinity enters everywhere in the polyhedron model: in the loop bounds, in the array index expressions, in the transformations. Quickly, after the polyhedron model had been developed, the desire arose to transcend affinity in places. Iteration domain splitting is one example. Lately, a more encompassing effort has been made to leave affinity behind. One circumstance that breaks the affinity of index expressions is that the so-called structure parameters (e.g., variables n and m in the loops of Figs. and ) enter multiplicatively as unevaluated variables, not as constants. For example, when a twodimensional array is linearized, array subscripts are of the form n i + j, with i, j being the loop iterators. As a consequence, subscript equations are nonlinear in the structure parameters, too. An algorithm for computing the solutions of equation systems with exactly one such structure parameter exists []. In transformations and code generation, nonlinear structure parameters, as in expressions n i, n i, or n m i, can be handled by generalizing existing algorithms (for the case without nonlinear parameters) using quantifier elimination []. Code generation can even be generalized to handle nonlinear loop indices, as in n i , n i , or i j. To this end, cylindrical algebraic decomposition (CAD) [], which corresponds to Fourier–Motzkin elimination in the basic model, is used for computing loops nests which enumerate the points in the transformed domains efficiently. This extends the frontier of code generation to arbitrary polynomial loop bounds.
Applications Other than Loop Parallelization Array Expansion It is easy to see that, if a loop modifies a scalar, there is a dependence between any two iterations, and the loop must remain sequential. When the modification occurs early in the loop body, before any use, the dependence can be removed by expanding the scalar to a new array, with the loop counter as its subscript. This idea can be extended to all cases in which a memory cell – be it a
Replace the left side of each assignment by a fresh array, subscripted by the counters of all enclosing loops. ● Inspect all the right sides and replace each reference by its source []. The source of a use is the latest modification that precedes the use in the sequential execution order. It can be computed by parametric integer programming. The result of this transformation is a program in dynamic single-assignment form. Each memory cell is written to just once in the course of a program execution. As a consequence, the sets W(u) ∩ W(v) are always empty: the transformed program has far fewer dependences and, occasionally, much more parallelism than the original.
Array Shrinking A consequence of the previous transformation is a large increase in the memory footprint of the program. In many cases, the same degree of parallelism can be achieved with less expansion, or the target architecture cannot exploit all parallelism there is, and some of the parallel loops have to be sequentialized. Another situation, in a purely sequential context, is when a careless programmer has used more memory than strictly necessary to implement an algorithm. The aim of array shrinking is to detect these situations and to reduce the memory needs by inserting modulo operators in subscripts. Suppose, for instance, that in the following code: for i = to n− do a[i] = . . . ; od one replaces a[i] by a[i mod ]. The dimension of a, which is n in the first version, is reduced to in the second version. Of course, this means that the value stored in a[i] is destroyed after iterations of the loop. This transformation may change the outcome of the program, unless one can prove that the lifetime of a[i] does not exceed iterations. Finding an automatic solution to this problem has been the subject of much work since (Darte [] offers a good discussion). The proposed solution is to construct an interference polyhedron for the elements
Polyhedron Model
of a fixed array and to cover it by a maximally tight lattice such that only the lattice origin falls inside the polyhedron. The basis vectors of the lattice are taken as coordinate axes of the reduced array, and their lengths are related to the modulus of the new subscripts.
Communication Generation When constructing programs for distributed memory architectures, be it with data distribution directives in languages like High-Performance Fortran (HPF) or under the direction of a placement function, one has to generate communication code. It so happens that this is also a problem of polyhedron scanning. It can be solved by the same techniques and the same tools that are used for code generation. Locality Enhancement Most modern processors have caches: small but fast memories that retain a copy of recently accessed memory cells. A program has locality if memory accesses are clustered such that there is a high likelihood of finding a copy of the needed information in cache rather than in main memory. Improving the locality of a program is highly beneficial for performance since caches are usually accessed in one cycle while memory latency may range from ten to a hundred cycles. Since the cache controller returns old copies to memory in order to find room for new ones, locality is enhanced by changing the execution order such that the reuse distance between successive accesses to the same cell is minimal. This can be achieved, for instance, by moving all such accesses to the innermost loop of the program []. Another approach consists of dividing a program into chunks whose memory footprints are smaller than the cache size. Conceptually, the program is executed by filling the cache with the necessary data for one chunk, executing the chunk without any cache miss, and emptying the cache for the next chunk. One can show that the memory traffic will be minimal if each datum belongs to the footprint of only one chunk. The construction of chunks is somewhat similar to scheduling []. It is enough to have asymptotic estimates of the footprint sizes. One advantage of this method is that it can be adapted easily to the management of scratchpad memories, software-controlled caches as can be found in embedded processors.
P
Dynamic Optimization Dynamic optimization resulted from the observation that modern processors and compilers are so complex that building a realistic performance estimator is nearly impossible. The only way of evaluating the quality of a transformed program is to run it and take measurements. In the polyhedron model, one can define the polyhedron of all legal schedules (see the previous section on Scheduling). Usually, one selects one schedule in this polyhedron according to some simple objective function. Another possibility is to generate one program for each legal schedule, measure its performance, and retain the best one. Experience shows that, in many cases, the best program is unexpected, the proof of its legality is not obvious, and the reasons for its efficiency are difficult to fathom. As soon as the source program has more than a few statements, the size of the polyhedron of legal schedules explodes: sophisticated techniques, including genetic algorithms and machine learning are needed to restrict the exploration to “interesting” solutions [, ].
Tools There is a variety of tools which support several phases in the polyhedral parallelization process.
Mathematical Support PIP [] is an all integer implementation of the Simplex algorithm, augmented with Gomory cuts for integer programming []. The most interesting feature of PIP is that it can solve parametric problems, i.e., find the lexicographically minimal x such that Ax ≤ By + c as a function of y. Omega [] is an extension of the Fourier–Motzkin elimination method to the case of integer variables. It has been extended into a full-fledged tool for the manipulation of Presburger formulas (logical formulas in which the atoms are affine constraints on integer variables). There are many so-called polyhedral libraries; the oldest one is the PolyLib []. The core of these libraries is a tool for converting a system of affine constraints into the vertices of the polyhedron it defines, and back. The PolyLib also includes a tool for counting the number
P
P
Polyhedron Model
of integer points inside a parametric polyhedron, the result being an Ehrhart polynomial []. More recent implementations of these tools, occasionally using different algorithms, are the Parma Polyhedral Library [], the Integer Set Library [], the Barvinok Library [], and the Polka Library []. This list is probably not exhaustive.
Related Entries Banerjee’s Dependence Test Code Generation Dependence Abstractions Dependence Analysis HPF (High Performance Fortran) Loop Nest Parallelization OpenMP
Code Generation CLooG [] takes as input the description of an iteration domain, in the form of a disjoint union of polyhedra, and generates an efficient loop nest that scans all the points in the iteration domain in the order given by a set of scattering functions, which can be schedules, placements, tiling functions, and more. For a detailed description of CLooG, see the Code Generation entry in this encyclopedia. Full-Fledged Loop Restructurers LooPo [] was the first polyhedral loop restructurer. Work on it was started at the University of Passau in , and it was developed in steps over the years and is still being extended. LooPo is meant to be a research platform for trying out and comparing different methods and techniques based on the polyhedron model. It offers a number of schedulers and allocators and generates code for shared-memory and distributed memory architectures. All of the extensions mentioned above have been implemented and almost all are being maintained. Pluto [] was developed at Ohio State University. Its main objective is to use placement functions to improve locality and to integrate tiling into the polyhedron model. Its target architectures are multicores and graphical processing units (GPUs). GRAPHITE [] is an extension of the GCC compiler suite whose ultimate aim is to apply polyhedral optimization and parallelization techniques, where possible, to run-of-the-mill programs. GRAPHITE looks for static control parts (SCoPs) in the GCC intermediate representation, generates their polyhedral representation, applies transformations, and generates target code using CLooG. At the time of writing, the set of available transformations is still rudimentary, but is supposed to grow.
Scheduling Algorithms Speculative Parallelization of Loops Task Graph Scheduling Tiling
Bibliographic Notes and Further Reading The development of the polytope model was driven by two nearly disjoint communities. Hardware architects wanted to take a set of recurrence equations, expressing, for instance, a signal transformation, and derive a parallel processor array from it. Compiler designers wanted to take a sequential loop nest and derive parallel loop code from it. One can view the seed of the model for architecture in the seminal paper by Karp, Miller, and Winograd on analyzing recurrence equations [] and the seed for software in the seminal paper by Lamport on Fortran DO loop parallelization []. Lamport used hyperplanes (the slices in the polyhedron that make up the parallel steps), instead of polyhedra. In the early s, Quinton drafted the components of the polyhedron model [], still in the hardware context (at that time: systolic arrays). The two communities met around the end of the s at various workshops and conferences, notably the International Conference on Supercomputing and CONPAR and PARLE, the predecessors of the Euro-Par series. Two developments made the polyhedron model ready for compilers: parametric integer programming, worked out by Feautrier [], which is used for dependence analysis, scheduling and code generation, and seminal work on code generation by Irigoin et al. [, ]. Finally, Lengauer [] gave the model its name. The s saw the further development of the theory underlying the model’s methods, particularly for scheduling, placement, and tiling. Extensions and
Polyhedron Model
applications other than loop parallelization came mainly in the latter part of the s and in the following decade. A number of textbooks focus on polyhedral methods. There is the three-part series of Banerjee [–], a book on tiling by Xue [], and a comprehensive book on scheduling by Darte et al. []. Collard [] applies the model to the optimization of loop nests for sequential as well as parallel execution and studies a similar model for recursive programs. In the past several years, the polyhedron model has become more mainstream. The seed of this development was an advance in code generation methods []. With the GCC community taking an interest, it is to be expected that polyhedral methods will increasingly find their way into production compilers.
Bibliography . Ancourt C, Irigoin F () Scanning polyhedra with DO loops. In: Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), ACM, pp – . Bagnara R, Hill PM, Zaffanella E () The Parma polyhedra library: toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci Comput Program (–):–, http://www.cs.unipr.it/ppl . Banerjee U () Loop transformations for restructuring compilers: the foundations. Series on Loop Transformations for Restructuring Compilers. Kluwer, Norwell . Banerjee U () Loop parallelization. Series on Loop Transformations for Restructuring Compilers. Kluwer, Norwell . Banerjee U () Dependence analysis. Series on Loop Transformations for Restructuring Compilers. Kluwer, Norwell . Bastoul C () Code generation in the polyhedral model is easier than you think. In: Proceedings of the th International Conference on Parallel Architecture and Compilation Techniques (PACT), IEEE Computer Society Press, pp –, http://www. cloog.org/ . Bastoul C, Feautrier P () Improving data locality by chunking. In: Compiler Construction (CC), Lecture Notes in Computer Science, vol . Springer, Berlin, pp – . Bernstein AJ () Analysis of programs for parallel processing. In: IEEE Transactions on Electronic Computers, EC-:– . Bondhugula U, Hartono A, Ramanujam J, Sadayappan P () A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Notices, ():–, . http://pluto-compiler. sourceforge.net/ . Clauss P () Counting solutions to linear and non-linear constraints through Ehrhart polynomials. In: Proceedings of the ACM/IEEE Conference on Supercomputing, ACM, pp –
P
. Clauss P, Loechner V () Parametric analysis of polyhedral iteration spaces, extended version. J VLSI Signal Process ():–, http://icps.u-strasbg.fr/polylib/ . Collard J-F () Reasoning about program transformations – imperative programming and flow of data. Springer, Berlin . Collard J-F, Griebl M () A precise fixpoint reaching definition analysis for arrays. In: Carter L, Ferrante J (eds) Languages and Compilers for Parallel Computing (LCPC), Lecture Notes in Computer Science, vol , Springer, Berlin, pp – . Collard J-F () Automatic parallelization of while-loops using speculative execution. Int J Parallel Program ():– . Darte A, Robert Y, Vivien F () Scheduling and automatic parallelization. Birkhäuser, Boston . Darte A, Schreiber R, Villard G () Lattice-based memory allocation. IEEE Transaction on Computers TC-():– . D’Hollander EH () Partitioning and labeling of loops by unimodular transformations. IEEE Trans Parallel Distrib Syst ():– . Faber P () Code optimization in the polyhedron model – improving the efficiency of parallel loop nests. PhD thesis, Department of Informatics and Mathematics, University of Passau, . http://www.fim.unipassau.de/cl/publications/docs/ Faber.pdf . Feautrier P () Parametric integer programming. Oper Res ():–, http://www.piplib.org . Feautrier P () Dataflow analysis of scalar and array references. Parallel Program ():– . Feautrier P () Automatic parallelization in the polytope model. In: Perrin G-R, Darte A (eds) The data parallel programming model. Lecture Notes in Computer Science, vol , Springer, Berlin, pp – . Griebl M () The mechanical parallelization of loop nests containing WHILE loops. PhD thesis, Department of Mathematics and Informatics, University of Passau, January . http://www. fim.unipassau.de/cl/publications/docs/Gri.pdf . Griebl M () Automatic parallelization of loop programs for distributed memory architectures. Habilitation thesis, Department of Informatics and Mathematics, University of Passau, June . http://www.fim.unipassau.de/cl/publications/docs/Gri. pdf . Griebl M, Feautrier P, Lengauer C () Index set splitting. Int J Parallel Process ():– Special Issue on the International Conference on Parallel Architectures and Compilation Techniques (PACT’) . Griebl M, Lengauer C () On the space-time mapping of WHILE loops. Parallel Process Lett ():– . Griebl M, Lengauer C () The loop parallelizer LooPo – announcement. In: Sehr D (ed) Languages and Compilers for Parallel Computing (LCPC). Lecture Notes in Computer Science, vol , pp –. Springer, Berlin, http://www.infosun.fim. uni-passau.de/cl/loopo/ . Größlinger A () The challenges of non-linear parameters and variables in automatic loop parallelisation. PhD thesis, Department of Informatics and Mathematics, University of
P
P .
.
.
.
.
.
.
.
. .
. .
.
.
.
Polytope Model
Passau, December . http://nbnresolving.de/urn:nbn:de:bvb: -opus- Größlinger A, Griebl M, Lengauer C () Quantifier elimination in automatic loop parallelization. J Symbolic Computation ():– Größlinger A, Schuster S () On computing solutions of linear diophantine equations with one non-linear parameter. In: Proceedings of the th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), IEEE Computer Society Press, pp – Guerdoux-Jamet P, Lavenier D () SAMBA: hardware accelerator for biological sequence comparison. Comput Appl Biosci ():– Irigoin F, Triolet R () Dependence approximation and global parallel code generation for nested loops. In: Cosnard M, Robert Y, Quinton P, Raynal M (eds) Parallel and distributed algorithms. Bonas, North-Holland, pp – Jeannet B, Miné A () APRON: a library of numerical abstract domains for static analysis. In: Computed Aided Verification (CAV). Lecture Notes in Computer Science, vol , Springer, pp –, http://apron.cri.ensmp.fr/library Kahn G () The semantics of simple language for parallel programming. In: Proceedings of the IFIP Congress, Stockholm, pp – Karp RM, Miller RE, Winograd S () The organization of computations for uniform recurrence equations. J ACM ():– Kienhuis B, Rijpkema E, Ed Deprettere F () Compaan: deriving process networks from matlab for embedded signal processing architectures. In: Vahid F, Madsen J (eds) Proceedings of the Eighth International Workshop on Hardware/Software Codesign (CODES ), ACM pp – Lamport L () The parallel execution of DO loops. Comm ACM ():– Lengauer C () Loop parallelization in the polytope model. In: Best E (ed) CONCUR’. Lecture Notes in Computer Science, vol , pp –, Springer, Loos R, Weispfenning V () Applying linear quantifier elimination. The Computer J ():– Pouchet L-N, Bastoul C, Cohen A, Cavazos J () Iterative optimization in the polyhedral model: Part II, multidimensional time. SIGPLAN Notices ():– Pouchet L-N, Bastoul C, Cohen A, Vasilache N () Iterative optimization in the polyhedral model: Part I, one-dimensional time. In IEEE/ACM Fifth International Symposium on Code Generation and Optimization (CGO’), IEEE Computer Society Press, pp – Pugh W () The Omega test: a fast and practical integer programming algorithm for dependence analysis. In: Proceedings of the th International Conference on Supercomputing, ACM, pp –. . http://www.cs.umd.edu/projects/omega Quinton P () The systematic design of systolic arrays. In: Soulié FF, Robert Y, Tchuenté M (eds) Automata networks in computer science, chapter , pp –. Manchester University
. . .
.
.
.
.
.
Press, . Also: Technical Reports and , IRISA (INRIARennes) Schrijver A () Theory of linear and integer programming. Wiley, New York Smith TF, Waterman MS () Identification of common molecular subsequences. J Molecular Biology ():– Trifunovic K, Cohen A, Edelsohn D, Feng L, Grosser T, Jagasia H, Ladelsky R, Pop S, Sjödin J, Upadrasta R () GRAPHITE two years after. In: Proceedings of the nd International Workshop on GCC Research Opportunities (GROW), pp –, January . http://gcc.gnu.org/wiki/GROW- Verdoolaege S () An integer set library for program analysis. In: Advances in the theory of integer linear optimization and its extensions. AMS Western Section, , http://freshmeat. net/projects/isl/ Verdoolaege S, Nikolov H, Todor N, Stefanov P () Improved derivation of process networks. In Proceedings of the th International Workshop on Optimization for DSP and Embedded Systems (ODES), , http://www.ece.vill.edu/~deepu/odes/ odes-_digest.pdf Verdoolaege S, Seghir R, Beyls K, Loechner V, Bruynooghe M () Counting integer points in parametric polytopes using Barvinok’s rational functions. Algorithmica ():–, http:// freshmeat.net/projects/barvinok Wolf ME, Lam MS () A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), ACM, pp – Xue J () Loop tiling for parallelism. Kluwer, Boston
Polytope Model Polyhedron Model
Position Tree Suffix Trees
POSIX Threads (Pthreads) POSIX Threads [] are those created, managed, and synchronized following the POSIX standard API. POSIX stands for “Portable Operating System Interface [for Unix].”
Power Wall
Bibliography . Nichols B, Buttlar D, Farrell JP () Pthreads programming. O’Reilly & Associates, Inc., Sebastopol
P
examine the fundamental causes that have led us up to this power wall, and discuss the key solution approaches at hand to circumvent this barrier.
Power Trends
Power Wall Pradip Bose IBM Corp. T.J. Watson Research Center, Yorktown Heights, NY, USA
Definition The “Power Wall” refers to the difficulty of scaling the performance of computing chips and systems at historical levels, because of fundamental constraints imposed by affordable power delivery and dissipation. The single biggest factor that has led the industry into encountering this wall in the past decade is the significant change in traditional CMOS chip design evolution, which were driven previously by Dennard scaling rules [, ].
Discussion Introduction Power delivery and dissipation limits have emerged as a key constraint in the design of microprocessors and associated systems even for those targeted for the high end server product space. At the low end of the performance spectrum, power has always dominated over performance as the primary design constraint. However, while battery life expectancies have shown modest increases, the larger demand for increased functionality and speed has increased the severity of the power constraint in the world of handheld and mobile systems. At the high end, where performance was always the primary driver, we are witnessing a trend (dictated primarily by CMOS technology scaling constraints) where increasingly, energy and power limits are dictating the high-level processing paradigms, as well as the lower level issues related to clocking and circuit design. Thus, regardless of the application domain, power consumption constitutes a primary barrier or “wall” when it comes to achieving cost-effective performance growth in future systems. In this entry, we will
Figure shows the expected maximum chip power (for high performance processors) through the year . The data plotted is based on the (Fig. a) and then updated projections (Fig. b) made by the International Technology Roadmap for Semiconductors (ITRS) []. The projection indicated that beyond the continued growth period (through year ) for high-end microprocessors, there will be a saturation in the maximum chip power (with a projected cap of around W from years all the way through ). This is due to thermal/packaging and die size limits that had already started to kick in by . Single-thread performance scaling had actually started to get limited by chip-level power density since the latter part of the s, as depicted in Fig. (adapted from a keynote speech by David Yen [], who was then the executive VP of scalable systems at Sun Microsystems). Beyond a certain power (and power density) regime, air cooling is not sufficient to dissipate the heat generated; on the other hand, use of liquid cooling and refrigeration causes a sharp increase of the cost–performance ratio. Thus, power-aware design techniques, methodologies, and tools are of the essence at all levels of design. The ITRS projection shows that the long-term power cap for high-performance, server-class microprocessors has been revised downward to W. This has to do with factors like the current delivery limits in typical blade server form factors that have dominated the midrange server products since .
CMOS Technology Determinants In this subsection, we present a summary of the technological trends that have caused chip-level power and power density to reach the levels that have caused us to identify the power wall as a fundamental barrier to achieving cost-effective performance growth in future computing systems. At the elementary transistor gate (e.g., an inverter) level, total power dissipation can be formulated as the
P
P
Power Wall
250
250
200
200 Power (watts)
Power (watts)
150 100 50
150 100 50
0 2005
2009 2007
2013 2011
a
0 2009
2017 2015
2013 2011
2019
b
2006 roadmap
2017 2015
2021 2019
2009 roadmap
Power Wall. Fig. Maximum chip power projection – high performance with heatsink 100
10 Spec /(W/cm2)
1
0.1 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04
Power Wall. Fig. Performance/power-density (SPEC/[W/cm ]) trends from to []
sum of three major components: switching loss, leakage and short-circuit loss [, , , ]. Powerdevice = (/) C. Vdd . Vswing . a. f + Ileakage. Vdd + Isc . Vdd
()
where, C is the output capacitance, V dd is the supply voltage, f is the chip clock frequency, and a is the activity factor ( ⩽ a ⩽ ) which determines the device switching frequency; V swing is the maximum voltage swing across the output capacitor, which in general can be less than V dd ; I leakage is the leakage current and I sc is the short-circuit current. In the literature, V swing is often approximated to be equal to V dd (or simply V
for short) making the switching loss ∼ (/)C. V . a. f . Also, as discussed in [], for a prior generation range of V dd (say V– V) switching loss: (/)CV af was the dominant component, assuming the activity factor to be above a reasonable minimum. So, as a first-order approximation, for chips belonging to the previous generation (e.g., CMOS nm, and before), we may ignore the leakage and short-circuit components in Eq. . In other words, in the context of switching power dominated technology generations, we may formulate the power dissipation to be: Powerchip = (/) [∑ Ci . Vi . ai . fi ]
()
Power Wall
100
. If we consider the very worst-case activity factor for each unit i, i.e., if ai = for all i, then, an upper bound on the maximum chip power may be formulated as: MaxPowerchip = KV . V = KF . f
Active
Power
Density
10
5
y sit
(2
C)
n
1
er
ive
0.1
De
w Po
ss
Pa
Gate-Leakage
0.01 0.001 0.0001 1E-5
f
Powerchip = V . (∑ Kiv . ai ) = f . (∑ Ki . ai ) ()
1000
Power (W/cm2)
. Where, Ci , Vi , ai , and fi are unit- or block-specific average values in the most general case; the summation is taken over all blocks or units i, at the microarchitecture level (e.g., icache, dcache, integer unit, floating point unit, load-store unit, register files and buses [if not included in individual units], etc). Also, for the voltage range considered, the operating frequency is roughly proportional to the supply voltage; and the capacitance C remains roughly the same if we keep the same design but scale the voltage. If a single voltage and clock frequency are used for the whole chip, the above reduces to:
P
1
0.1 Lpoly (um)
0.01
Power Wall. Fig. Active and major leakage power component trends []
()
where KV and KF are design-specific constants. Note that an estimation of peak or maximum power is important, for the purposes of determining the packaging and cooling solution required. The larger the maximum power, the more expensive is the net cooling solution. Note that the formulation in Eq. is overly conservative, as stated. In practice, it is possible to estimate the worst-case achievable maximum for the activity factors. This allows the designers to come up with a tighter bound on maximum power before the packaging decision is made. The last Eq. is what leads to the so-called cuberoot rule [], where redesigning a chip to operate at / the voltage (and frequency) results in the power dissipation being lowered to (/) or / of the original. This implies the single-most efficient method for reducing power dissipation for a processor that has already been designed to operate at high frequency, namely, reduce the voltage (and hence the frequency). There is a limit, however, of how low V dd can be reduced (for a given technology), which has to do with manufacturability and circuit reliability issues. Thus, a combination of microarchitecture and circuit techniques to reduce power consumption, without necessarily employing multiple or variable supply voltages is of special relevance in the design of robust systems.
Figure shows the analytically projected trend [] of escalation in three of the major components of microprocessor power consumption, namely, active or capacitive switching power, subthreshold leakage power and gate leakage power. (The leakage current referred to in Eq. above consists of two major components: subthreshold leakage and gate leakage. There are other components of leakage as well, as discussed below. The short-circuit loss referred to in Eq. is not a technology-dependent component, and so is ignored in this discussion). In post- nm technologies, static (i.e., leakage or standby) power has increasingly become a major (if not the dominating) component of chip power. As discussed in [], the three major types of leakage effects are (a) sub-threshold, (b) gate, and (c) reverse-biased, drain- and source-substrate junction band-to-band tunneling (BTBT). With technology scaling, each of these leakage components tends to increase drastically. For example, as technology scales downward, the supply voltage (V dd ) must also scale down to reduce dynamic power and maintain device reliability. However, this requires the scaling down of the threshold voltage (V th ) to maintain reasonable gate overdrive (and therefore performance), which is a function of (V dd − V th ). However, lowering the threshold voltage causes substantial increases in leakage current, and therefore standby power, inspite of the lower V dd . The subthreshold channel leakage current in an MOS
P
P
Power Wall
device is governed by an equation that looks like []: Ileakage = Kw ∗ W. −Vth/S
()
where Kw , measured in units of micro-amps per micron (μA/μm) can be thought of as the width-sensitivity coefficient; W is the device width and S is the subthreshold swing (measured in millivolts, like the threshold voltage V th ). (In [], the value of Kw is quoted to be ). S is a parameter that is defined to characterize the efficiency of a device in turning on or off. It can be shown that the turn-off characteristic of a device is proportional to the thermal voltage (kT/q) and the ratio of junction capacitance (Cj ) to oxide capacitance (C ox ) []. The parameter S can be formulated as: S = . (kT/q). ( + Cj /Cox )
()
This parameter is usually specified in units of millivolts per decade and it defines how many millivolts (mV) the gate voltage must drop before the drain current is reduced by one decade. The thermal voltage kT/q is equal to mV at room temperature. Thus, at room temperature, the minimum value of S is about mV per decade. This means that an ideal device at room temperature would experience a X reduction in drain current for every mV reduction of the gate voltage V gs in the subthreshold region. In the deep submicron era, a typical transistor device has an S value in the range of – mV per decade. Note also, that the threshold voltage Vth (Eq. ) is itself a function of temperature (T); in fact Vth decreases by . mV/K as temperature increases. Also, Kw itself is a strong function of temperature (∼ T ). Thus, as T increases, leakage current goes up dramatically, both because of its dependence on T and because V th goes down. The delay of an inverter gate is given by the alpha-power model [] as: Tg ∼
Leff⋅Vdd α μ(T)⋅(Vdd − Vth )
()
where, α is typically around . and μ is the mobility of carriers (which is a function of temperature T, μ(T) ∼ T −. ). As V th decreases, (V dd −V th ) increases so the inverter becomes faster. As T increases, (V dd − V th ) increases, but μ(T) decreases []. This latter effect dominates; so, with higher temperatures, the logic gates in a processor generally become slower.
Power-Performance Efficiency Metrics The most common (and perhaps obvious) metric to characterize the power-performance efficiency of a microprocessor is a simple ratio, like mips/watt. This attempts to quantify the efficiency by projecting the performance achieved or gained (measured in millions of instructions per second) for every watt of power consumed. Clearly, the higher the number, the “better” the machine is. Dimensionally, mips/watt equates to the inverse of the average energy consumed per instruction. This seems a reasonable choice for some domains where battery life is important. However, there are strong arguments against it in many cases, especially when it comes to characterizing higher end processors. Performance has typically been the key driver of such server-class designs and cost or efficiency issues have been of secondary importance. Specifically, a design team may well choose a higher frequency design point (which meets maximum power budget constraints) even if it operates at a much lower mips/watt efficiency compared to one that operates at better efficiency but at a lower performance level. As such, (mips) /watt or even (mips) /watt may be the metric of choice at the high end. On the other hand, at the lowest end, where battery-life (or energy consumption) is the primary driver, one may want to put an even greater weight on the power aspect than the simplest mips/watt metric, i.e., one may just be interested in minimizing the watts for a given workload run, irrespective of the execution time performance, provided the latter does not exceed some specified upper limit. The “mips” metric for performance and the “watts” value for power may refer to average or peak values, derived from the chip specifications. For example, for a GHz (= cycles/s) processor which can complete up to instructions per cycle, the theoretical peak performance is , mips. If the average completion rate for a given workload mix is p instructions per cycle, then the average mips would equal , times p. However, when it comes to workload-driven evaluation and characterization of processors, metrics are often controversial. Apart from the problem of deciding on a “representative” set of benchmark applications, there are fundamental questions which persist about how to boil down “performance” into a single (“average”) rating that is meaningful in comparing a set of machines. Since power consumption varies, depending on the
Power Wall
program being executed, the issue of benchmarking is also relevant in assigning an average power rating. In measuring power and performance together for a given program execution, one may use a fused metric like power-delay product (PDP) or energy-delay product (EDP) [, ]. In general, the PDP-based formulations are more appropriate for low-power, portable systems, where battery-life is the primary index of energy efficiency. The mips/watt metric is an inverse PDP formulation, where delay refers to average execution time per instruction. The power-delay product, being dimensionally equal to energy, is the natural metric for such systems. For higher end systems (e.g., workstations) the EDP-based formulations are deemed to be more appropriate, since the extra delay factor ensures a greater emphasis on performance. The (mips) /watt metric is an inverse EDP formulation. For the highest performance, server-class machines, it may be appropriate to weight the “delay” part even more. This would point to the use of (mips) /watt, which is an inverse ED P formulation. Alternatively, one may use (cpi) .watt as a direct ED P metric, applicable on a “per instruction” basis (see []). The energy*(delay) metric, or perf /power formula is analogous to the cube-root rule [] which follows from constant voltage scaling arguments (see Section “CMOS Technology Determinants”, Eq. ). Clearly, to formulate a voltage-invariant power-performance characterization metric, we need to think in terms of perf /(power). When we are dealing with the SPEC benchmarks, one may therefore evaluate efficiency as (SPECrating)x /watt, or (SPEC)x /watt for short; where the exponent value x (=, , or ) may depend on the class of processors being compared. Brooks et al. [] discuss the power-performance efficiency data for a range of commercial processors of approximately the same generation (circa year ). SPEC/watt, SPEC /watt, and SPEC /watt are used as the alternative metrics, where SPEC stands for the processor’s SPEC rating []. The data validates our assertion that depending on the metric of choice, and the target market (determined by workload class and/or the power/cost) the conclusion drawn about efficiency can be quite different. For performance-optimized, high-end processors, the SPEC /watt metric seems to be fairest. For “power-first” processors, SPEC /watt seems to be the fairest.
P
Recently, there has been a strong motivation made for energy-proportional computing [], in the context of future data centers and cloud computing. In this style of measuring system-level energy efficiency, it is not enough to assess the efficiency at the peak utilization levels; rather, how well the system is able to adjust the power level downward, as the workload demand decreases, is also of interest. The recently announced SPEC power benchmark [] is intended, in part, to measure the degree to which a given server system is energy proportional. The currently used workload in SPECpower is the specjbb application, and the evaluation modality is to measure the power and performance across a range of system-level utilizations, from %, down through % ( data points). The benchmark calls for adding up the corresponding power and performance numbers at those data points and then computing the ratio of the summed performance value and the summed power number. The higher the number, the better is the energy proportional characteristic of the measured system.
A Review of Key Ideas in Power-Aware Architectures In this section, a brief review of power-efficient design concepts will be covered. The motivation, of course, is to examine solution approaches for avoiding the power wall, while preserving performance growth in next generation systems. The initial attention will be on dynamic (also known as “active” or “switching”) power governed by the CV af formula. Recall that C refers to the switching capacitance, V is the supply voltage, a is the activity factor ( < a < ), and f is the operating clock frequency. Power reduction ideas must therefore focus on one or more of these basic parameters. Reducing active power generally results in reduction of on-chip temperatures, and this indirectly causes leakage power to go down as well. Similarly, any increase in efficiency directed at lowering the latch count (e.g., by reducing the basic pipeline depth, or by reducing the number of back-end execution pipes within a given functional unit) also results in area and leakage reduction as a side benefit. However, later in this section, we also deal with the problem of mitigating leakage power directly, by providing microarchitectural support to what are primarily circuit-level mechanisms.
P
P
Power Wall
Power Efficiency at the Processor Core Level In this sub-section, we examine the key ideas that have been proposed in terms of (micro)architectural support for power-efficiency, at the level of a single processor core. Early research ideas started being presented since at a few key conference workshops (e.g., [–]); later the field of power-aware microarchitectures matured to the point where all major computer architecture conferences now all have specific sessions devoted to ideas for performance growth in the current power-constrained design era. The effective (average) value of C can be reduced by using: (a) area-efficient designs for various macros; and (b) adaptive structures, that change in effective size, latency, or communication bandwidth depending on the needs of the input workload. The average value of V can be reduced via dynamic voltage scaling, i.e., by reducing the voltage as and when required or possible. Microarchitectural support, in this case, is not required, unless the mechanisms to detect “idle” periods or temperature overruns make use of counter-based “proxies,” specially architected for this purpose. Note again, however, that since reducing V also requires (or results in) reduction of the operating frequency, f , net power reduction has a cubic effect; thus, dynamic voltage and frequency scaling (DVFS) is one of the most effective way of power reduction). Deciding when and how to apply DVFS, as a function of the input workload characteristics and overall operating environment, is very much a microarchitectural issue. It is a problem that is increasingly relevant in the era of variabilitytolerant, power-efficient multi-core chip design, and we will touch on it briefly in section “Conclusions.” The average value of the activity factor, a, can be reduced by: (a) the use of clock-gating, where the normally free-running, synchronous clock is disabled in selected units or sub-units within the system based on information or predictions about current or future activity in those regions; (b) reducing unnecessary “speculative waste” resulting from executing instructions in mis-speculated branch paths or prefetching useless instructions and data into caches, based on wrong or ill-timed guesses; and (c) the use of data representations and instruction schedules that result in reduced switching.
Microarchitectural support is provided in the form of added mechanisms to: detect, predict, and control the generation of the applied gating signals; or aid in power-efficient data and instruction encodings. Compiler support for generating power-efficient instruction scheduling and data partitioning or special instructions for “nap/doze/sleep” control, if applicable, must also be considered under this category. Power-efficient task scheduling at the system software (i.e., OS and hypervisor) level is also an example of dynamic load balancing that can make the activity distribution uniform over a multi-core processor system, and thereby help reduce power density (and hence temperature and overall power). While clock-gating helps eliminate (or drastically reduce) active or switching power when a given macro, sub-unit or unit is idle, power-gating can be used to also eliminate the residual leakage power of that idle entity. In this case, as described in detail later on, the power supply voltage V is itself gated off from the target circuit block, with the help of a header or footer transistor. Here, the need for microarchitectural support in the form of predictive control of the gating signal is even stronger because of the relatively large performance overheads that would be incurred without such support. There are other techniques, like adaptive body-biasing that are also targeted at leakage power control; and these too require some degree of microarchitectural support. However, these techniques are most relevant to bulk-CMOS designs (as opposed to siliconon-insulator [SOI]-CMOS technology), and are predominantly device- and circuit-level methods. As such, we do not dwell on them in this entry. Lastly, the average value of the design frequency, f , can be controlled or reduced by using: (a) variable, multiple or locally asynchronous (self-timed) clocks – e.g., in GALS [] designs; (b) clock-throttling, where the frequency is reduced dynamically in response to power or temperature overrun indicators; (c) reduced pipeline depths in the baseline microarchitecture definition. Henceforth, the focus of consideration is: poweraware microarchitectural constructs that use C, a, or f as the primary lever for reducing active power; and those that use the supply voltage V as the primary lever for reducing leakage power. In any such proposed processor architecture, the efficacy of the particular
Power Wall
Optimal Pipeline Depth
A fundamental question that is asked has to do with pipeline depth. Is a deeply pipelined, high frequency (“speed demon”) design better than an IPC-centric lower frequency (“braniac”) design? In the context of the topic of this entry, “better” must be judged in terms of power-performance efficiency. Let us consider, first, a simple, hazard-free, linear pipeline flow process, with k stages. Let the time for the total logic (without latches) to compute one answer be T. Assuming that the k stages into which the logic is partitioned are of equal delay, the time per stage and thus the time per computation becomes (see [], Chap. ): t = T/k + D ()
Power = (L. k + W) (T/k + D) Performance = L. T + W. D + (L. D. k + W. T) /k ()
Figure shows the general shape of this curve as a function of k. Differentiating the right hand side expression in Eq. and setting it to zero, one can solve for the optimum value of k for which the power-performance efficiency is maximized, i.e., the minimum of the curve in Fig. b can be shown to occur when √ () k (opt.) = (W. T) /(L. D) Larson [] first published the above analysis, albeit from a cost/performance perspective. This analysis shows that, at least for the simplest, hazard-free pipeline flow, the highest frequency operating point achievable in a given technology may not be the most energy-efficient! Rather, the optimal number of stages (and hence operating frequency) is expected to be at a point which increases for greater W or T and decreases for greater L or D. For a prior generation POWER-class (∼. μ) super scalar processor operating at around GHz, [, ], the floating point arithmetic unit is estimated to yield values of T = . ns, D = . ns,
where D is the delay added due to the staging latch. The inverse of t determines the clocking rate or frequency of operation. Similarly, if the energy spent (per cycle, per second or over the duration of the program run) in the logic is W and the corresponding energy spent per level of staging latches is L, then the total energy equation for the k-stage pipelined version is roughly, ()
a
Number of stages, k
Performance (operations per second)
Energy, E (per unit of time)
The energy equation assumes that the clock is freerunning, i.e., on every cycle, each level of staging latches
L.k + W
is clocked to enable the advancement of operations along the pipeline. (Later, we shall consider the effect of clock-gating). Equations and , when plotted as a function of k, are depicted in Fig. a and b respectively. As the number of stages increases, the energy or power consumed increases linearly; while, the performance also increases, but not as fast. In order to consider the PDP-based power-performance efficiency, we compute the ratio:
power reduction method that is used must be assessed by understanding the net performance impact. Here, depending on the application domain (or market), a PDP, EDP or ED P metric for evaluating and comparing power-performance efficiciencies must be used. (See earlier discussion in section “Power-Performance Efficiency Metrics”).
E = L. k + W
P
b
1/(T/k + D)
Number of stages, k
Power Wall. Fig. Power and performance curves for idealized pipeline flow
P
P Power/Performance
Power Wall
minimum k (opt) Number of stages, k ----->
Power Wall. Fig. Power–performance ratio curve for idealized pipeline flow
SPEC workloads. For commercial workloads like TPC-C, the optimal point is shown to shift to shallower pipelines (– FO). In contrast, note that if one considered a power-unaware performance-only metric, like BIPS, the optimal pipeline depth for SPEC is around FO per stage. For TPC-C, the performanceonly optimal point is reported to be [, ] pretty flat across the – FO points. For scientific workloads, loop tuning for performance optimization [] can alter measured power-performance efficiency metrics significantly in some cases. Compiler techniques for super scalar efficiency enhancements are omitted for brevity in this entry. Exploiting Parallelism to Scale the Power Wall
W = . W, and L = . W. This yields a k(opt.) ∼ (rounded down from .), if we use the idealized formalism (Eq. ) above. For real super scalar machines, the number of latches in the overall design tends to go up much more sharply with k than the linear assumption in the above model. This tends to make k (opt) even smaller. Also, in real pipeline flow with hazards, e.g., in the presence of branch-related stalls and disruptions, performance actually peaks at a certain value of k before decreasing [, ] (instead of the asymptotically increasing behavior shown in Fig. a). This effect would also lead to decreasing the effective value of k (opt). (However, k(opt) increases if we use EDP or ED P metrics instead of the PDP metric used.). As the number of pipeline stages (k) is increased for a given computation data path, we say that the depth of the pipeline increases. This also implies that the levels of combinational logic within each pipeline stage decreases; so, the combinational logic delay per stage decreases as k increases. In detailed simulation-based analysis of a POWERclass super scalar machine, it has been shown [, ] that the optimal pipeline depth using a ED P metric like (BIPS) /W (where BIPS is the standard performance metric of billions of instructions completed per second) corresponds to around FO (Fan-out-offour (FO) delay is defined as the delay of one inverter driving four copies of an equally sized inverter. The amount of logic and latch overhead per pipeline stage is often measured in terms of FO delay. Decreasing logic FO delay per pipeline stage means deeper pipelines (larger values of k) and vice versa.) per pipe stage for
As single-thread frequency and performance growth stalls (driven by technology trends), multi-core parallelism is the new trend. Figure is shown to explain the fundamentals of why parallelism helps power efficiency. If a single task, executed at a given voltagefrequency point, can be split into two independent tasks, each operating at half the voltage and frequency, the net throughput performance remains the same, but the active power density scales down by a factor of . A particular application of this concept, even at the level of a single processor core is that of SIMD acceleration as described in the next sub-section. Figure depicts the current trend [] of growth on number of cores over technology generation in the current regime of power density constrained microprocessor design. Vector/SIMD Processing and Hybrid Architectures
Vector/SIMD modes of parallelism present in current architectures afford a power-efficient method of extending performance for vectorizable codes. Fundamentally, this is because: for doing the work of fetching and processing a single (vector) instruction, a large amount of data is processed in a parallel or pipelined manner. If we consider a SIMD machine, with p k-stage functional pipelines then looking at the pipelines alone, one sees a p-fold increase of performance, with a p-fold increase in power, assuming full utilization and hazard-free flow, as before. Thus, an SIMD pipeline unit offers the potential of scalable growth in performance, with commensurate growth in power, i.e., at constant power-performance efficiency. If, however, one includes the front-end instruction cache and fetch/dispatch unit
Power Wall
P
Logic Block Level Vdd Logic Block
Vdd/2 Freq =1 Vdd =1 Throughtput = 1 Power =1 Area = 1 Power Den = 1
Logic Block Logic Block
Freq =0.5 Vdd =0.5 Throughtput = 1 Power =0.125 Area = 2 Power Den = 0.125
Power Wall. Fig. The classic argument of how parallelism can be exploited to reduce power
Core Count
100
of hybrid or heterogeneous core architectures [] is an example of the latter extreme. In a super scalar machine with a vector/SIMD (or other accelerator) extension, the overall power-efficiency increase is limited by the fraction of code that runs in vector/SIMD (or other accelerator) mode (per Amdahl’s Law).
10
1
0.1 2000
Clock-Gating and Microarchitectural Support 2002
2004 2006 Year
2008
2010
Power Wall. Fig. Microprocessor core count over time []
that are shared across the p SIMD pipelines, then power-performance efficiency actually grows with p. This is because, the power dissipation behavior of the instruction cache (memory) and the fetch/decode path remains essentially invariant with p, while net performance grows linearly with p. In terms of net energy, the front-end consumption actually decreases significantly in SIMD mode, since the number of instructions executed is much less than in scalar mode; and overall, since the execution time decreases, the net savings in leakage energy usually results in a significant net positive benefit for the full machine. The SIMD extension is actually an example of the general concept of using power-efficient specialized hardware (or accelerators) as part of processor design. Such accelerators can be turned off (e.g., power-gated off) when not in use, while significant pieces of the general purpose (“scalar”) core can be switched off when the program encounters a long burst of a code that can be offloaded to an active accelerator. Such accelerators can be fixed function, “table-lookup”-type logic at one extreme, all the way through to programmable sub-cores at the other end of the spectrum. The concept
Clock-gating refers to circuit-level control (e.g., see [, ]) for disabling the clock to a given set of latches, a macro, a bus, to a cache or register file access path, or an entire unit, on a particular machine cycle. In current generation server-class microprocessors, about –% of the active (switching) power is consumed by the clock distribution network and its latch load alone. As reported in [], the major part of the clock power is dissipated close to the leaf nodes of the clock tree that drive latch banks. Since a clock-gated latch keeps its current data value stable, clock gating prevents signal transitions of invalid data from propagating down the pipeline thereby reducing switching power in the combinational logic between latches. In addition to reducing dynamic power, clock gating can also reduce static (leakage) power. As already explained, leakage current in CMOS devices is exponentially dependent on temperature. The temperature reduction brought on by clock-gating can therefore significantly reduce the leakage power as well. (Micro)architectural support for conventional clockgating can be provided in at least three ways: (a) dynamic detection of idle modes in various clocked units or regions within a processor or system; (b) static or dynamic prediction of such idle modes; (c) using “data valid” bits within a pipeline flow path to selectively enable/disable the clock applied to the pipeline stage latches. If static prediction is used, the compiler inserts special “nap/doze/sleep/wake” type instructions
P
P
Power Wall
where appropriate, to aid the hardware in generating the necessary gating signals. Methods (a) and (b) result in coarse-grain clock-gating, where entire units, macros or regions can be gated off to save power; while, method (c) results in fine-grain clock-gating, where unutilized pipe segments can be gated off during normal execution within a particular unit, like the FPU. The detailed circuit-level implementation of gated-clocks, the potential performance degradation, inductive noise problems, etc., are not discussed in this entry. However, these are very important issues that must be dealt with adequately in today’s power-constrained designs. Referring back to section “A Review of Key Ideas in Power-Aware Architectures” and Fig. , note that since (fine-grain) clock-gating effectively causes a fraction of the latches to be “gated off,” we may model this by assuming that the effective value of L decreases when such clock-gating is applied. This has the effect of increasing k (opt.), i.e., the operating frequency for the most power-efficient pipeline operation can be increased in the presence of clock-gating. This is an added benefit. In recently reported work [], the limits of clockgating efficiency has been examined and then stretched by adding a couple of new advances: transparent pipeline clock-gating (TCG) [] and elastic pipeline clock-gating (ECG) []. TCG introduces a new way of clock-gating pipelines. In traditional clock-gating, latches are held opaque to avoid data races between adjacent latch stages; thus, N clock pulses are needed to propagate a single data item through an N-stage pipeline, even if at a given clock cycle all other (i.e., N − ) stages have invalid input data. In a transparent clock-gated pipeline, latches are held transparent by default. TCG is based on the concept of data separation. Assume that a pair of data items A and B simultaneously move through a TCG pipeline. A data race between A and B is avoided by separating the two data items by clocking or gating a latch stage opaque, such that the opaque latch stage acts as a barrier separating the two data items from each other. The number of clock pulses required for a data item A to move through an N-stage pipeline is no longer only dependent on N, but also on the number of clock cycles that separate A from the closest upstream data item B. For an N-stage pipeline, where B follows n clock cycles behind A, only floor (N/n) clock pulses have to be generated to move A safely through
the pipeline. Elastic pipeline clock gating (ECG) is a different technique that achieves further efficiency by exploiting the inherent storage redundancy afforded by a traditional master–slave latch pair. ECG allows the designer to allow stall signals to propagate backward in pipeline flow logic in a stage-by-stage fashion, without incurring the leakage power and area overhead of explicitly inserted stall buffers. Logic-level details of TCG and ECG are available in the originally published papers [–]. As reported there, TCG enables clock power reduction to the tune of % over traditional stage-level clock gating under commercial (TPC-C) class workloads. Even under heavy floating point workloads where fewer bubbles are available in the pipeline, the clock power in the floating point pipeline can be reduced by %. The significant reduction in dynamic stall power (%) and leakage power (%) afforded by ECG in a floating point unit design have also been reported in the published literature.
Predictive Power Gating
As previously indicated, leakage power is a major (if not dominant) component of total power dissipation in current and future CMOS microprocessors. Cutting off the power supply (Vdd) to major circuit blocks to conserve idle power (“sleep mode”) is not a new concept, especially for batter-powered mobile systems. However, dynamically effecting such gating, on a unitby-unit basis as function of input workload demand is not a design technique that has seen widespread usage yet, especially in server-class microprocessors. The main reasons have been the perceived risks or negative effects arising from: (a) performance and area overheads; (b) inductive noise on the power supply grid; and (c) potential design tools and verification concerns. Advances in circuit design have minimized the area and cycle-time delay overhead concerns in recent industrial practices. Microarchitectural predictive techniques (e.g., [, , , ]) have recently advanced to the point where now they equip designers with the tools needed to minimize any architectural performance overheads as well. The inductive noise concerns do persist, but there are known solution approaches in the realm of power distribution networks and package design that will no doubt mature to help mitigate those concerns. The design tools and verification challenge, of course
Power Wall
will prevail as a difficult roadblock – but again, solutions will eventually emerge to get rid of that concern. Per-core power gating, within a multi-core setting has recently gained acceptance within the chip design community (e.g., see Intel’s Nehalem processor design []). The power-efficiency benefits of per-core power gating have been quantified recently for server and data center class workloads [, ]. Variable Bit-Width Operands
One of the techniques proposed for reducing dynamic power consists of exploiting the behavior of data in programs, which is characterized by the frequent presence of small values. Such values can be represented as and operated upon as short bit-vectors. Thus, by using only a small part of the processing datapath, power can be reduced without loss of performance. Brooks and Martonosi [] analyzed the potential of this approach in the context of -bit processor implementations (e.g., the Compaq AlphaTM architecture). Their results show that roughly % of the instructions executed had both operands whose length was less than or equal to bits. Brooks and Martonosi proposed an implementation that exploits this by dynamically detecting the presence of narrow-width operands on a cycle-by-cycle basis. Adaptive Microarchitectures
Another method of reducing power is to adjust the size of various storage resources within a processor or system, with changing needs of the workload. Albonesi [] proposed a dynamically reconfigurable caching mechanism, that reduces the cache size (and hence power) when the workload is in a phase that exhibits reduced cache footprint. Such downsizing also results in improved latency, which can be exploited (from a performance viewpoint) by increasing the cache cycling frequency on a local clocking or self-timed basis. Maro et al. [] have suggested the use of adapting the functional unit configuration within a processor in tune with changing workload requirements. Reconfiguration is limited to “shutting down” certain functional pipes or clusters, based on utilization history or IPC performance. In that sense, the work by Maro et al. is not too different from coarse-grain clock-gating support, as discussed earlier. In early work done at IBM Watson, Buyuktosunoglu et al. [] designed an adaptive issue
P
queue that can result in (up to) % power reduction when the queue is sized down to its minimum size. This is achieved with a very small IPC performance hit. Another example is the idea of adaptive register files (e.g., see []) where the size and configuration of the active size of the storage is changed via a banked design, or through hierarchical partitioning techniques. A recent tutorial article by Albonesi et al. [] provides an excellent coverage of advances in the field of adaptive architectures. Dynamic Thermal Management
Most clock-gating techniques are geared toward the goal of reducing average chip power. As such, these methods do not guarantee that the worst-case (maximum) power consumption will not exceed safe limits. The processor’s maximum power consumption dictates the choice of its packaging and cooling solution. In fact, as discussed in [], the net cooling solution cost increases in a piecewise linear manner with respect to the maximum power, and the cost gradient increases rather sharply in the higher power regimes. This necessitates the use of mechanisms to limit the maximum power to a controllable ceiling, one defined by the cost profile of the market for which the processor is targeted. Most recently, in the high performance world, Intel’s Pentium processor is reported to use an elaborate on-chip thermal management system to ensure reliable operation []. At the lower end, the G and G PowerPC microprocessors [, ] include a Thermal Assist Unit (TAU) to provide dynamic thermal management. In recently reported academic work, Brooks and Martonosi [] discuss and analyze the potential reduction in “maximum power” ratings without significant loss of performance, by the use of specific dynamic thermal management (DTM) schemes. The use of DTM requires the inclusion of on-chip sensors to monitor actual temperature, or proxies of temperature [] estimated from on-chip counters of various events and rates. Dynamic Throttling of Communication Bandwidths
This idea has to do with reducing the width of a communication bus dynamically, in response to reduced needs or in response to temperature overruns. Examples of on-chip buses that can be throttled are:
P
P
Power Wall
instruction fetch bandwidth, instruction dispatch/issue bandwidths, register renaming bandwidth, instruction completion bandwidths, memory address bandwidth, etc. In the G and G PowerPC microprocessors [, ], the TAU invokes a form of instruction cache throttling as a means to lower the temperature when a thermal emergency is encountered. Speculation Control
In current generation, high performance microprocessors, branch mispredictions and mis-speculative prefetches end up wasting a lots of power. Manne et al. [, ] and Karkhanis et al. [] have described means of detecting or anticipating an impending mispredict and using that information to prevent mis-speculated instructions from entering the pipeline. These methods have been shown to reduce energy waste, with minimal impact on performance.
Power-Efficient Microarchitecture Paradigms Now that we have examined specific microarchitectural constructs that aid power-efficient design, let us briefly examine the inherent power-performance scalability and efficiency of selected paradigms that are currently established or are emerging in the high-end processor roadmap. In particular, we consider: (a) wideissue, speculative super scalar processors; (b) multicluster superscalars; (c) simultaneously multithreaded (SMT) processors; and (d) chip multiprocessors (CMP): those that use single program speculative multithreading, as well as those that are general multi-core SMP or throughput engines. Single-Core Superscalar Processor Paradigm
One school of thought anticipates a continued progression along the path of wider, aggressively superscalar paradigms. Researchers continue to innovate in an attempt to extract the last “ounce” of IPC-level performance from a single-thread instruction-level parallelism (ILP) model. Value prediction advances (pioneered by Lipasti et al. []) promise to break the limits imposed by true data dependencies. Trace caches (Smith et al. []) ease the fetch bandwidth bottleneck, which can otherwise impede scalability. However, increasing the superscalar width beyond a certain limit tends to yield diminishing gains in net performance. At
the same time, the power-performance efficiency metric (e.g., performance per watt or (performance) /watt, etc) tends to degrade beyond a certain complexity point in the single-core superscalar design paradigm. The microarchitectural trends beyond the current superscalar regime are effectively targeted toward the goal of extending the power-performance efficiency factors. Whenever we reach a maximum in the powerperformance efficiency curve, it is time to invoke the next paradigm shift. Next, we examine some of the promising new trends in microarchitecture that can serve as the next platform for designing power-performance scalable machines.
Multicluster Superscalar Processors
Zyuban et al. [, ] studied the class of multicluster superscalar processors as a means of extending the power-efficient growth of the basic super scalar paradigm. One way to address the energy growth problem at the microarchitectural level is to replace a classical superscalar CPU with a set of clusters, so that all key energy consumers are split among clusters. Then, instead of accessing centralized structures in the traditional superscalar design, instructions scheduled to an individual cluster would access local structures most of the time. The main advantage of accessing a collection of local structures instead of a centralized one is that the number of ports and entries in each local structure is much smaller. This reduces the latency and energy per access. If the non-accessed sub-structures of a resource can be “gated off ” (e.g., in terms of the clock), then, the net energy savings can be substantial. According to the results obtained in Zyuban’s work, the energy dissipated per cycle in every unit or subunit within a superscalar processor can be modeled to vary (approximately) as IPCunit * (IW)g , where IW is the issue width, IPCunit is the average IPC performance at the level of the unit or structure under consideration, and g is the energy growth parameter for that unit. Then, the energy-delay product (EDP) for the particular unit would vary as:
EDPunit
IPCunit ∗ (IW)g IPCoverall
()
Power Wall
Zyuban shows that for real machines, where the overall IPC always increases with issue width in a sublinear manner, the overall EDP of the processor can be bounded as: (IPC)g- ≤ EDP ≤ (IPC)g
()
where g is the energy-growth factor of a given unit and IPC refers to the overall IPC of the processor; and IPC is assumed to vary as (IW). . Thus, according to this formulation, superscalar implementations that minimize g for each unit or structure will result in energy-efficient designs. The eliot/elpaso tool does not model the effects of multi-clustering in detail; however, from Zyuban’s work, we can infer that a carefully designed multicluster architecture has the potential of extending the powerperformance efficiency scaling beyond what is possible using the classical superscalar paradigm. Of course, such extended scalability is achieved at the expense of reduced IPC performance for a given superscalar machine width. This IPC degradation is caused by the added inter-cluster communication delays and other power management overhead in a real design. Some of the IPC loss (if not all) can be offset by a clock frequency boost which may be possible in such a design, due to the reduced resource latencies and bandwidths. High-performance processors (e.g., the Compaq Alpha and the IBM POWER/) certainly have elements of multi-clustering, especially in terms of duplicated register files and distributed issue queues. Zyuban proposed and modeled a specific multicluster organization in his work. This simulation-based study determined the optimal number of clusters and their configurations, for the EDP metric. Simultaneous Multithreading (SMT)
Let us examine the SMT paradigm [] to understand how this may affect our notion of power-performance efficiency. With SMT, assume that we can fetch from two threads (simultaneously, if the icache is dual-ported, or in alternate cycles if the icache remains single-ported). The back-end execution engine is shared across the two threads, although each thread has its own architected register space. This facility allows the utilization factors, and the net throughput performance to go up, without a commensurate increase in the maximum clocked power. This is because, the issue width W is not increased, but the execution and resource stages or
P
slots can be filled up simultaneously from both threads. The added complexity in the front-end, of maintaining two program counters (fetch streams) and the global register space increase adds to the power a bit, but the throughput gain is significantly higher. SMT is clearly a very area-efficient (and hence leakage-efficient) microarchitectural paradigm. IBM’s POWER processor family has progressed from -way SMT per core in POWER [] to -way SMT per core in POWER []. Seng and Tullsen [] presented analysis to show that using a suitably architected SMT processor, the perthread speculative waste can be reduced, while increasing the utilization of the machine resources by executing simultaneously from multiple threads. This was shown to reduce the average energy per instruction by %. Chip Multiprocessing
In a multiscalar-like speculative-multi-threading machine [], different iterations of a single loop program could be initiated as separate tasks or threads on different core processors on the same chip. Iterations beyond the first one are speculatively executed, assuming no loop-carried dependencies. Register values set in one task are forwarded in sequence to dependent instructions in subsequent tasks. Execution on each processor proceeds speculatively, assuming the absence of loadstore address conflicts between tasks; dynamic memory address disambiguation hardware is required to detect violations and restart task executions as needed. In this paradigm, if the performance can be shown to scale well with the number of tasks, and if each processor is designed as a limited-issue, limited-speculation (low complexity) core, it is possible to achieve better overall scalability of performance-power efficiency. A key real trend in high-end microprocessors is classical chip multiprocessing (CMP), where multiple user programs (or sub-tasks of a single program) execute separately (and non-speculatively) on different processors on the same chip. A commonly used paradigm in this case is that of (shared memory) symmetric multiprocessing (SMP) on a chip (see Hammond et al. in []). Larger SMP server nodes can be built from such chips. Server system vendors like IBM have relied on such CMP paradigms as the scalable paradigm for the immediate future. IBM’s POWER and POWER designs [, , ] were the first examples of this industry-wide trend; later examples being:
P
P
Power Wall
Intel’s Montecito [] and Sun’s Niagara []. Most recent server-class multi-core/multi-threaded processors include Intel’s Nehalem [] and IBM’s POWER []. Such CMP designs offer the potential of convenient coarse-grain clock-gating and “power-down” modes, where one or more processors on the chip may be “turned off ” or “slowed down” to save power when needed. In general, in a design era where technological constraints like power dissipation have caused a significant slowdown in the growth of single-thread execution clock frequency (and performance), parallelism via multiple cores on a die, each operating at lower than achievable single-core frequency (i.e., at larger FO per stage than prior generation processors), is the natural paradigm of choice for power-efficient, scalable performance growth through the next several generations of processor design. Heterogeneity in the form of different types of cores, accelerators and mixed signal components [, , ] or variable per-core frequency support are already prevalent concepts. Such heterogeneity is needed to support the diverse set of workloads that will enable the next generation of computing systems in a power-constrained design era.
Conclusions In this entry, we first defined the “power wall” and examined the root technological causes behind the onset of the current power-constrained design era. We then discussed issues related to power-performance efficiency and metrics from an architect’s viewpoint. We showed that depending on the application domain (or market), it may be necessary to adopt one metric over another in comparing processors within that segment. Next, we described some of the promising new ideas in power-aware microarchitecture design. This discussion included circuit-centric solutions like clock-gating, and power-gating where microarchitectural support is needed to make the right decisions at run time. We then looked at the trends in coreand chip-level microarchitecture design paradigms at a high-level, given the reality of the power wall. We limited our focus to a few key ideas and paradigms of interest in future power-aware processor design. Many new ideas to address various aspects of power reduction have been presented in recent research papers. All of these could not be discussed in this entry; but the
interested reader should certainly refer to the cited references and recent conference publications for further detailed study.
Bibliography . Agrawal A, Mukhopadhyay S, Raychowdhury A, Roy K, Kim CH () Leakage power analysis and reduction in nanoscale circuits. IEEE Micro ():– . Albonesi D () Dynamic IPC/clock rate optimization. In: Proceedings of the th annual international symposium on computer architecture (ISCA), ACM/IEEE Computer Society, Barcelona, pp – . Albonesi DH, Balasubramonian R, Dropsho SG, Dwarkadas S, Friedman EG, Huang MC, Kursun V, Magklis G, Scott ML, Semeraro G, Bose P, Buyuktosunoglu A, Cook PW, Schuster SE () Dynamically tuning processor resources with adaptive processing. IEEE Comput ():–, Special Issue on Power-Aware Computing . Barroso L, Holzle U () The case for energy proportional computing. IEEE Comput ():– . Bohr M () A year retrospective on dennard’s MOSFET scaling paper. P Solid St Circ Soc ():– . Borkar S () Design challenges of technology scaling. P IEEE Micro ():– . Bose P, Kim S, O’Connell FP () Ciarfella WA Bounds modeling and compiler optimizations for superscalar performance tuning. J Syst Architect :–; Elsevier . Brews JR () Subthreshold behavior of uniformly and nonuniformly doped long-channel MOSFET. IEEE T Electron Dev :– . Brooks D, Bose P, Srinivasan V, Gschwind MK, Emma PG, Rosenfield MG () New methodology for earlystage, microarchitecture-level power-performance analysis of microprocessors. IBM J Res Dev (/):– . Brooks D, Martonosi M () Dynamically exploiting narrow width operands to improve processor power and performance. In: Proceedings of the th international symposium on highperformance computer architecture (HPCA-), IEEE Computer Society, Orlando . Brooks D, Martonosi M () Dynamic thermal management for high-performance microprocessors. In: Proceedings of the th internatinal symposium On high performance computer architecture, IEEE Computer Society, Nuevo Leone, pp – . Brooks DM, Bose P, Schuster SE, Jacobson H, Kudva PN, Buyuktosunoglu A, Wellman J-D, Zyuban V, Gupta M, Cook PW () Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors. P IEEE Micro ():– . Buyuktosunoglu A et al () An adaptive issue queue for reduced power at high performance. In: Proceedings ISCA Workshop on complexity-effective design (WCED), Vancouver . Cruz J-L, Gonzalez A, Valero M, Topham NP () Multiplebanked register file architectures. In: Proceedings of the international symposium on computer architecture (ISCA), Vancouver, pp –
Power Wall
. Dennard R et al () Design of ion-implanted MOSFETs with very small physical dimensions. IEEE J Solid St Circ SC():– . Diefendorff K () POWER focuses on memory bandwidth. Microprocessor Report ():– . Dubey PK, Flynn MJ () Optimal pipelining. J Parallel Distr Com ():– . Flautner K, Kim NS, Martin S, Blaauw D, Mudge T () Drowsy Caches: simple techniques for reducing leakage power. In: Proceedings of the international symposium on computer architecture (ISCA), IEEE Computer Society, Anchorage . Flynn MJ, Hung P, Rudd K () Deep-submicron microprocessor design issues. P IEEE Micro ():– . Gara A et al () Overview of the blue gene/L system architecture. IBM J Res Dev (/):– . Gonzalez R, Horowitz M () Energy dissipation in general purpose microprocessors. IEEE J Solid-St Circ ():– . Grunwald D, Klauser A, Manne S, Pleszkun, A () Confidence estimation for speculation control. In: Proceedings th annual international symposium on computer architecture (ISCA), ACM/IEEE Computer Society, Barcelona, pp – . Gunther SH, Binns F, Carmean DM, Hall JC () Managing the impact of increasing microprocessor power consumption. In: Proceedings of the Intel Technology Journal . Hartstein A, Puzak TR () The optimum pipeline depth for a microprocessor. Proceedings of the th international symposium on computer architecture (ISCA-), IEEE Computer Society, Anchorage . http://www.specbench.org . Hu C () Device and technology impact on low power electronics. In: Rabaey J (ed) Low power design methodologies. Kluwer, Boston, pp – . Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P () Microarchitectural techniques for power gating of execution units. In: Proceedings of the international symposium on low power electronics and design (ISLPED), ACM, Newport Beach . ISSCC (International Solid State Circuits Conference) () Trends Report, http://isscc.org/doc//ISSCC_ TechTrends.pdf . Iyer A, Marculescu D () Power-performance evaluation of globally asynchronous, locally synchronous processors. In: Proceedings of the international symposium on computer architecture (ISCA), IEEE Computer Society, Anchorage . Jacobson H, Bose P, Hu Z, Eickemeyer R, Eisen L, Griswell J () Stretching the limits of clock-gating efficiency in serverclass processors. In: Proceedngs of the international symposium on high performance computer architecture (HPCA), IEEE Computer Society, San Francisco . Jacobson HM () Improved clock-gating through transparent pipelining. In: Proceedings of the international symposium on low Power electronics and design (ISLPED), ACM, California . Jacobson HM et al () Synchronous interlocked pipelines. In: Proceedings of the international symposium on advanced research in asynchronous circuits and systems, IEEE Computer Society, Manchester
P
. Kahle JA et al () Introduction to the Cell multiprocessor. IBM J Res Dev (/):– . Kalla R, Sinharoy B, Starke WJ, Floyd MJ () Power: IBM’s next generation server processor. IEEE Micro ():– . Kalla R, Sinharoy B, Tendler J () IBM POWER chip: a dualcore multithreaded processor, IEEE Micro ():– . Kanda K et al () Design impact of positive temperature dependence on drain current in sub--V CMOS VLSIs. IEEE JSSC, ():– . Karkhanis T, Bose P, Smith J () Saving energy with just in time instruction delivery. In: Proceedings of the intenational symposium on low power electronics and design (ISLPED), ACM, Monterey . Kaxiras S, Hu Z, Martonosi M () Cache Decay: exploiting generational behavior to reduce cache leakage power. In: Proceedings of the international symposium on computer architecture (ISCA), Goteborg . Kogge PM () The architecture of pipelined computers. Hemisphere Publishing Corporation, New York . Kongetira P () A -way multithreaded SPARC processor. Presented at Hot Chips . Kumar R, Tullsen D, Jouppi N, Ranganathan P () Heterogeneous chip multiprocessors. IEEE Comput ():– . Larson AG () Cost-effective processor design with an application to fast fourier transform computers. Digital systems laboratory report SU-SEL--, Stanford University, Stanford; see also, Larson and Davidson () Cost-effective design of special purpse processors: a fast fourier transform case study. In: Proceedings th annual allerton conference on circuits and system theory. University of Illinois, Champaihn-Urbana, pp – . Leverich J, Monchiero M, Talwar V, Ranganathan P, Kozyrakis C () Power management of datacenter workloads using per-core power gating. IEEE Comput Archit Lett (): – . Lungu A, Bose P, Buyuktosunoglu A, Sorin D () Dynamic power gating with quality guarantees. In: Proceedings of the international symposium on low power electronics and design (ISLPED), ACM, New York . Madan NS, Buyuktosunoglu A, Bose P, Annavaram M () Guarded power gating in a multi-core setting, presented at ISCA workshop on energy-efficient design (WEED), June ; to appear as a Lecture notes on computer science (LNCS) volume in ; see also full paper in Proceedings of the th international Symposium on high performance computer architecture (HPCA) . Manne S, Klauser A, Grunwald D () Pipeline gating: speculation control for energy reduction. In: Proceedings of the th annual international symposium on computer architecture (ISCA), ACM/IEEE Computer Society, Barcelona, pp – . Maro R, Bai Y, Bahar RI () Dynamically reconfiguring processor resources to reduce power consumption in highperformance processors. In: Proceedings of power aware computer systems (PACS) Workshop, held in conjunction with ASPLOS, Cambridge . McNairy C, Bhatia R () Montecito: a dual-core, dual-thread Itanium Processor. IEEE Micro ():– (see also, Hot Chips )
P
P
PRAM (Parallel Random Access Machines)
. Oklobdzija VG () Architectural tradeoffs for low power. In: Proceedings of the ISCA Workshop on Power-Driven Microarchitectures, Barcelona . Reed P et al () MHz W RISC microprocessor with onchip L cache controller. Dig Tech Pap IEEE Int Solid St Circ Conf : . Sakurai T, Newton R () Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas. IEEE JSSC ():– . Sanchez H et al () Thermal management system for high performance PowerPC microprocessors. In: Digest of papers, IEEE COMPCON, p . Seng JS, Tullsen DM, Cai G () The power efficiency of multithreaded architectures. In: Invited talk presented at: ISCA Workshop on Complexity-Effective Design (WCED), Vancouver . Singhal R () Inside intel core microarchitecture (Nehalem). Presented at Hot Chips-, Stanford . Sohi G, Breach SE, Vijaykumar TN () Multiscalar Processors. In: Proceedings of the nd annual international symposium on computer architecture, IEEE CS Press, Los Alamitos, pp – . Srinivasan V, Brooks D, Gschwind M, Bose P, Zyuban V, Strenski PN, Emma PG () Optimizing pipelines for power and performance. In: Proceedings of the th annual IEEE/ACM symposium on microarchitecture (MICRO-), ACM/IEEE, Istanbul . Talks presented at the ISCA workshops on complexity effective design (WCED- through WCED-), http://www.csl. cornell.edu/~albonesi/wced.html . Talks presented at the kool chips workshops () http://www. cs.colorado.edu/~grunwald/LowPowerWorkshop . Talks presented at the power aware computer systems (PACS) Workshops; e.g. the offering: http://www.ece.cmu.edu/ ~pacs/ . Tendler JM, Dodson JS, Fields JS Jr, Le H, Sinharoy B () POWER system microarchitecture. IBM J Res Dev :():– . The International Technology Roadmap for Semiconductors. http://www.itrs.net/reports.html . Theme issue () The future of processors. IEEE Comput ():– . Tiwari V et al () Reducing power in high-performance microprocessors. In: Proceedings of the IEEE/ACM Design Automation Conference. ACM, New York, pp – . Tullsen DM, Eggers SJ, Levy HM () Simultaneous Multithreading: Maximizing On-Chip parallelism. In: Proceedings of the nd annual internatonal symposium on computer architecture, Santa Margherita Ligure, pp – . Yen D () Chip multithreading processors enable reliable high throughput computing. Keynote speech at international symposium on reliability physics (IRPS) . Zyuban V () Inherently lower-power high performance super scalar architectures. PhD thesis, Department of Computer Science and Engineering, University of Notre Dame . Papers in the special issue of IBM Journal of Research and Development, March/May . Zyuban V, Brooks D, Srinivasan V, Gschwind M, Bose P, Strenski P, Emma P () Integrated analysis of power and
performance for pipelined microprocessors. IEEE T Comput ():– . Zyuban V, Kogge P () Optimization of high-performance superscalar architectures for energy efficiency. In: Proceedings of the IEEE symposium on low power electronics and design, ACM, New York
PRAM (Parallel Random Access Machines) Joseph F. JaJa University of Maryland, College Park, MD, USA
Definition The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation which assumes that all the processors operate synchronously under a single clock and are able to randomly access a large shared memory. In particular, a processor can execute an arithmetic, logic, or memory access operation within a single clock cycle.
Discussion Introduction Parallel Random Access Machines (PRAMs) were introduced in the late s as a natural generalization to parallel computation of the Random Access Machine (RAM) model. The RAM model is widely used as the basis for designing and analyzing sequential algorithms. The PRAM model assumes the presence of a number of processors, each identified by a unique id, which have access to a single unbounded shared memory. The processors operate synchronously under a single clock such that each processor can execute an arithmetic or logic operation or a memory access operation within a single clock cycle. In general, each processor under the PRAM model can execute its own program, which can be stored in some type of a private program memory. However, almost all the known PRAM algorithms are of the SPMD (Single Program Multiple Data) type in which a single program is executed by all the processors such that an instruction can refer to the id of the processor, which is responsible for executing the corresponding instruction. A processor may or may not be active during any given clock cycle. In fact, most of the
P
PRAM (Parallel Random Access Machines)
PRAM algorithms are of the SIMD (Single Instruction Multiple Data) type in which, during each cycle, a processor is either idle or executing the same operation as the remaining active processors. As a simple example of a PRAM algorithm, consider the computation of the sum S of the elements of an array A[ : n] using an n-processor PRAM, where the processors are indexed by pid = , , . . . , n. A PRAM algorithm can be organized as a balanced binary tree with n leaves such that each internal node represents the sum computation applied to the values available at the children. The PRAM algorithm proceeds level by level, executing all the computations at each level in a single parallel step. Hence the processors complete the process in ⌈log n⌉ parallel steps. The algorithm is illustrated in Fig. . The PRAM algorithm written in the SPMD style is given next where for simplicity n = k for some positive integer k. The array B[ : n] is used to store the intermediate results.
Algorithm
PRAM SUM Algorithm
Input: An array A of size n = k stored in shared memory. Output: The sum of the elements of A. begin B[pid] = A[pid] d=n for h = to k do d = d if pid ≤ d then B[pid] = B[pid − ] + B[pid] end if end for return(B[]) end
Given that the processors can simultaneously access the shared memory within the same clock cycle, several variants of the PRAM model exist depending on the assumptions made for simultaneous access to the same memory location. ●
Exclusive Read Exclusive Write (EREW) PRAM. No simultaneous accesses to the same location are allowed under the EREW PRAM model.
+
+
+
+
A[1]
+
A[2]
A[3]
+
A[4]
A[5]
+
A[6]
A[7]
A[8]
PRAM (Parallel Random Access Machines). Fig. A Balanced Binary Tree for the Sum Computation. The PRAM algorithm proceeds from the leaves to the root while executing the operations at each level within a single clock cycle
●
Concurrent Read Exclusive Write (CREW) PRAM. Simultaneous read accesses to the same location are allowed but no concurrent write operations to the same location are permitted. ● Concurrent Read Concurrent Write (CRCW) PRAM. Simultaneous read or write operations to the same memory location are allowed. This model has several subversions depending on how the concurrent write operations to the same location are resolved. The Common CRCW PRAM model assumes that the processors are writing the same value, while the Arbitrary CRCW PRAM model assumes that one of the processors attempting a concurrent write into the same location succeeds. The Priority CRCW assumes that the indices of the processors are linearly ordered, and allows the processor with the highest priority to succeed in case of a concurrent write. The different assumptions on the concurrent accesses to a single location lead to parallel algorithms that can differ significantly in performance. Consider, for example, the problem of determining whether all the entries of a binary array M[ : n] are all equal to . The output A of this computation is in fact the logical AND of all the entries in M. An n-processor Common CRCW PRAM algorithm can perform this computation as follows. Initially, A is set equal to . For all ≤ pid ≤ n, processor pid tests whether the entry M[pid] = , and in the affirmative, executes the operation A = . Clearly this algorithm correctly computes the AND of the n elements
P
P
PRAM (Parallel Random Access Machines)
of M in constant time on the Common CRCW PRAM. However, it can be shown, under a very general model, that a CREW PRAM will require Ω(log n) parallel steps regardless of the number of processors available. On the other hand, the computational gap between the various versions of the PRAM model is limited in the sense that the strongest p-processor PRAM model, namely the priority CRCW from our list above, can be simulated by a p-processor EREW (weakest PRAM model) with a slowdown of a factor of O(log p). From a theoretical perspective, the design of a PRAM algorithm amounts to the development of a strategy that achieves the highest level of parallelism, that is, the smallest number of parallel steps, using a “reasonable” number of processors. Note that the number of processors can depend on the input size. In particular, the class of “well-parallelizable” problems under the PRAM model can be defined as the class of the problems that can be solved in polylogarithmic number of parallel steps (i.e., O(logk n) for some fixed constant k), using a polynomial number of processors. Such a class is referred to as the NC complexity class, which has been related to the circuits model used in traditional computational complexity. See the bibliographic notes for references to related work.
Complexity Measures and Work–Time Framework A PRAM algorithm can be evaluated by several complexity measures, where a complexity measure is a worst case asymptotic estimate of the performance as a function of the input length n. Given a p-processor PRAM algorithm to solve a problem with input size n, let TP (p, n) be the number of parallel steps used by the algorithm. The optimal (or often the best known) sequential complexity is denoted by TS (n). The speedup is defined to be TS (n) , TP (p, n) which represents the speedup of the parallel algorithm relative to the best sequential algorithm. Clearly the best possible speedup is Θ(p). On the other hand, the relative speedup is defined to be TP (, n) , TP (p, n)
which refers to the speedup achieved by the algorithm with p processors relative to the same algorithm running with one processor. Another related measure is the relative efficiency of a PRAM algorithm defined as the ratio TP (, n) . pTP (p, n) The two PRAM algorithms presented earlier assume the presence of n processors, where n is the input size. To derive the complexity measures defined above, these algorithms can be mapped into p-processor (p < n) PRAM algorithms by distributing the concurrent operations carried out at each parallel step as evenly among the p processors as possible. In particular, the SUM algorithm can be executed in n/ n n ⌉ + ⋯ + = O( + log n) ⌈ ⌉+⌈ p p p parallel time on a p-processor PRAM algorithm. Hence the speedup is Θ(p) whenever p = O(n/ log n). However, writing the corresponding algorithm for each processor (using processor ids) is in general a tedious process that distracts from the simplicity of the PRAM model. An alternative is to describe a PRAM algorithm in the so-called Work–Time or simply WT framework, which is also related to the data parallel paradigm. The algorithm does not assume any specific number of processors, nor does it refer to any processor index, but rather it is expressed as a sequence of steps, where each step is either a typical sequential operation or a set of concurrent operations that can be executed in parallel. A set of parallel operations is specified by the pseudo-instruction pardo or forall. For example, the previous PRAM SUM algorithm can be expressed in this framework as shown in Algorithm . Under the WT framework, the complexity of a PRAM algorithm can be measured by two parameters – work W(n) and time TP (n). The work W(n) is the total number of operations required by the algorithm as a function of n and the time TP (n) is defined as the number of parallel steps needed by the algorithm assuming an unlimited number of processors. The corresponding functions for the SUM algorithm are W(n) = O(n) and TP (n) = O(log n), respectively.
PRAM (Parallel Random Access Machines)
Algorithm Parallel SUM Algorithm in the WT Framework Input: An array A of size n = k residing in memory. Output: The sum of the elements of A. begin d=n for all ≤ i ≤ n pardo B[i] = A[i] end for for h = to k do d = d for all ≤ i ≤ d pardo B[i] = B[i − ] + B[i] end for end for return(B[]) end
In general, a WT-parallel algorithm with complexity W(n) and TP (n) can be simulated on a p-processor PRAM using Brent’s scheduling principle (see bibliographic notes) in time TP (p, n) = O(
W(n) + TP (n)) p
The algorithm is called work optimal if W(n) = Θ(TS (n). In this case, the corresponding p-processor PRAM algorithm achieves optimal speedup as long as p = O( TTPS (n) ). (n) Under the WT framework, a typical goal is to develop an algorithm that achieves the fastest parallel time among the work-optimal algorithms. That is, the main goal to develop a work-optimal algorithm that is as fast as possible.
Basic PRAM Techniques Basic PRAM techniques are illustrated through the description of PRAM algorithms for the following computations: matrix multiplication, prefix sums or scan, list ranking, fractional independent set, and computing the maximum. The WT framework is used to express these algorithms.
Matrix Multiplication Given two matrices A and B of dimensions m × n and n × q, respectively, the product C = AB is a matrix of
P
dimension m × q such that n
C[i, j] = ∑ A[i, k]B[k, j], ≤ i ≤ m, ≤ j ≤ q. k=
Computing the matrix C on the PRAM model is quite simple since both matrices A and B are initially stored in shared memory, and their entries can each be accessed randomly within a single clock cycle. The algorithm in the WT framework amounts to computing in parallel all the products A[i, k]B[k, j], for ≤ i ≤ m, ≤ k ≤ n, and ≤ j ≤ q. The PRAM SUM algorithm can then be used to compute all the entries C[i, j] in parallel, ≤ i ≤ m and ≤ j ≤ q. Thus, the parallel complexity of the resulting algorithm is TP = O(log n) and the total work is W = O(nmq). This algorithm is based on the standard sequential algorithm. Should the initial sequential algorithm be one of the faster sequential matrix multiplication algorithm, the corresponding PRAM algorithm will then run in logarithmic parallel time using the same number of operations as the initial algorithm.
Prefix Sums or Scan Given a set of elements stored in an array A[ : n] and a binary associative operation ⊗, the prefix sums of A consist of the n partial sums defined by PS[i] = A[] ⊗ A[] ⊗ ⋯ ⊗ A[i], ≤ i ≤ n The straightforward sequential algorithm computes PS[i] = PS[i − ] ⊗ A[i] for ≤ i ≤ n starting with PS[] = A[], and hence the sequential complexity is TS (n) = Θ(n). A simple fast PRAM algorithm can be designed using a balanced binary tree built on top of the n elements of A. The algorithm consists of a forward sweep through the tree in a way similar to that carried out by the SUM algorithm. A backward sweep will then compute the prefix sums of the set of elements available at each level. A recursive version is described in Algorithm . The algorithm is illustrated in Fig. , where the left part illustrates the forward sweep and the right part illustrates the backward sweep. Since the algorithm involves a forward and a backward sweep through a balanced binary tree of height O(log n), the parallel complexity of the algorithm satisfies TP (n) = O(log n). Also, since the number of
P
P
PRAM (Parallel Random Access Machines)
a
1
Σ1,2
3
Σ1,4
5
Σ5,6
7
Σ1,8
1
Σ1,2
3
Σ1,4
5
Σ5,6
7
Σ1,8
1
Σ1,2
3
Σ1,4
5
Σ5,6
7
Σ5,8
1
Σ1,2
3
Σ1,4
5
Σ5,6
7
Σ1,8
1
Σ1,2
3
Σ3,4
5
Σ5,6
7
Σ7,8
1
Σ1,2
3
Σ1,4
5
Σ1,6
7
Σ1,8
1
2
3
4
5
6
7
8
1
Σ1,2 31,3 Σ1,4 51,5 Σ1,6 71,7 Σ1,8
Forward Sweep of Prefix Sums Algorithm
b
Backward Sweep that Completes the Prefix Sums Computation
PRAM (Parallel Random Access Machines). Fig. Computation of prefix sums through a forward sweep and a backward sweep of the balanced binary tree algorithm. The notation ∑i,j represents the sum of elements from A[i] through A[ j], and at each level an entry with two arrows pointing to it represent a sum operation
Algorithm
Prefix Sums Algorithm
Input: An array A of size n = stored in shared memory. Output: The prefix sums of A stored in the array PS. begin if n = then return(PS[] = A[]) end if for all ≤ i ≤ n/ pardo Y[i] = A[i − ] ⊗ A[i] end for Recursively compute the prefix sums of Y[ : n/] in place for all ≤ i ≤ n pardo i even: set PS[i] = Y[i/] i = : set PS[] = A[] else: set PS[i] = Y[(i − )/] ⊗ A[i] end for end k
internal nodes of a binary tree on n leaves is n − , the total work is clearly W(n) = O(n), and hence this algorithm is work optimal. It follows that a p processor PRAM can compute the prefix sums of n elements in time TP (p, n) = O( np + log n).
List Ranking Consider a linked list L of n nodes whose successor relationship is represented by an array S[ : n] such that S[i] is equal to the index of the successor node of i.
For the last node k, its successor is denoted by S[k] = . The list ranking problem is to determine for each node i the distance R[i] of the node i to the end of the list. While an optimal sequential algorithm is relatively straightforward – start by inverting the list and proceed to compute the ranks incrementally following the predecessor links – the problem may seem at first sight to be quite difficult to parallelize. It turns out that a fast PRAM algorithm can be developed through the introduction of the pointer jumping technique as illustrated in Algorithm . Note that T is used for the intermediate manipulation of the pointers, while the initial pointers stored in S remain intact. The list ranking algorithm is illustrated in Fig. , where the pointer jumping technique is applied on each node i satisfying T[i] ≠ and T[T[i]] ≠ . Since the execution of the last parallel loop involves replacing the successor by its successor, the distance between a node and its successor doubles after each iteration and hence the T pointer of each node will point to the last node after ⌈log n⌉ iterations. Therefore, the algorithm terminates correctly after O(log n) iterations, and each iteration involves O(n) operations. The algorithm has parallel complexity O(log n) using O(n log n) total number of operations. The work can be made optimal but the corresponding PRAM algorithm is more complex.
Fractional Independent Set The purpose of introducing the next problem is to illustrate the use of randomization in the design of PRAM
PRAM (Parallel Random Access Machines)
1
T
1
T
1
T
1
T
1
T
1
T
1
T
P
0
2
2
2
2
2
2
1
0
4
4
4
4
3
2
1
0
7
6
5
4
3
2
1
0
PRAM (Parallel Random Access Machines). Fig. Application of the list ranking algorithm on a list with eight elements
Algorithm List Ranking Algorithm Input: An array S of size n representing the successor relationship of a linked list. Output: The distance R[i] of node i to the end of the list, ≤ i ≤ n. begin for all ≤ i ≤ n pardo if S[i] ≠ then R[i] = else R[i] = end for for all ≤ i ≤ n pardo T[i] = S[i] end for repeat ⌈log n⌉ times do { for all ≤ i ≤ n pardo if T[i] ≠ and T[T[i]] ≠ then { R[i] = R[i] + R[T[i]]; T[i] = T[T[i]] } end for } end
algorithms to break symmetry. Given a directed cycle C =< v , v , . . . vn >, a fractional independent set is a subset U of the vertices such that: (i) U is an independent set, that is, no two vertices in U are connected by a directed edge; and (ii) the size of U is a constant fraction of the size of V. The fractional independent set problem is to determine such a set.
It is trivial to develop a sequential algorithm to solve this problem. Starting from any vertex, place every other vertex in U. A simple and fast PRAM algorithm can be designed using randomization as follows. Each vertex, in parallel, is randomly assigned a label or with equal probability. Clearly, the expected number of vertices of each label is n/. However, this does not necessarily solve the problem since there is no guarantee that the vertices with the same label form an independent set. To ensure that this is indeed the case, another parallel step is carried out which involve changing the label of a vertex to if the labels of this vertex and its successor are both equal to . The remaining vertices of label are now guaranteed to form an independent set. It can be shown that, with high probability, the size of such an independent set is a constant fraction of the size of the original vertex set V. On the PRAM model, this algorithm runs in O() parallel time using O(n) operations. It is worth noticing that this algorithm can in particular be used to make the list ranking algorithm more efficient (i.e., using a total number of operations which is asymptotically less than n log n).
Superfast Maximum Algorithm The algorithm presented in this section illustrates an important PRAM technique, called accelerated cascading, in combining two strategies. The first amounts to a work optimal algorithm (possibly a sequential
P
P
PRAM (Parallel Random Access Machines)
Algorithm Randomized Fractional Independent Set Algorithm Input: A directed cycle with a vertex set V whose arcs are specified by an array S[ : n], i.e., < i, S[i] > is an arc. Output: A fractional independent set U ⊂ V. begin for all v ∈ V pardo Randomly assign label(v) = or with equal probability if label(v) = and lavel(S[v]) = then label(v) = end for return(U = {v∣label(v) = }) end
algorithm) to reduce the size of the problem below a certain threshold. The second strategy uses a very fast PRAM algorithm that involves a nonoptimal number of operations. The maximum algorithm will be used to illustrate such a technique and to also demonstrate the extra power of the PRAM model when concurrent write operations are allowed. The PRAM strategy introduced earlier to compute the sum of n elements can be used to compute the maximum of n elements, resulting in the parallel time complexity TP (n) = O(log n) and total work of W(n) = O(n). However, it is possible to develop a constant time algorithm on the Common CRCW PRAM model as illustrated in Algorithm . Algorithm Constant Time Maximum Algorithm Input: An array A[ : n] consisting of n distinct elements Output: A Boolean array M[ : n] such that M[i] = if, and only if, A[i] is the maximum element begin for all ≤ i, j ≤ n pardo ifA[i] ≥ A[j] then B[i, j] = else B[i, j] = end for for all ≤ i ≤ n pardo M[i] = ⋀nj= B[i, j] end for end The Boolean AND of n binary variables can be performed in O() parallel steps using O(n) operations on
the Common CRCW PRAM. Therefore, the above algorithm achieves constant parallel time but uses O(n ) operations, and thus it is extremely inefficient. To remedy this problem, and still achieve faster than O(log n) parallel time, a doubly logarithmic depth tree is used instead of the balanced binary tree. Essentially, the root √ of a doubly logarithmic depth tree has Ω( n ) children, and each subtree is defined similarly. It is not hard to see that such a tree has height O(log log n), where n is the number of leaves. Using the doubly logarithmic depth tree and the constant time maximum algorithm at each node of the tree, the maximum can be computed in O(log log n) time using O(n log log n) operations. The accelerated cascading strategy is now used to turn this fast but nonoptimal work algorithm into a fast and work optimal algorithm as follows: ●
Partition the array A into n/ log log n blocks, such that each block contains approximately log log n elements and use the sequential algorithm to compute the maximum element in each block. ● Use the doubly logarithmic depth tree on the maxima computed in the first step. The resulting algorithm has a parallel time complexity of O(log log n) using only O(n) operations, and hence is work optimal. Therefore, the maximum can be computed in O(log log n) parallel time using a work optimal strategy on the Common CRCW PRAM.
Bibliographic Notes and Further Reading The PRAM model was initially introduced through a number of papers, most notably in the papers by Fortune and Wyllie [], Goldschlager [], and Ladner and Fisher []. The Work–Time framework, which ties Brent’s scheduling principle [] and the PRAM model, was first observed by Shiloach and Vishkin [] and used extensively in the book by JaJa []. Related data parallel algorithms were introduced by Hillis and Steele []. Substantial work dealing with the design and analysis of PRAM algorithms and the theoretical underpinnings of the model has been carried out in the s and s. An early survey is the paper by Karp and Ramachandran [], and an extensive coverage of the topic can be found in JaJa’s book [].
Preconditioners for Sparse Iterative Methods
An intriguing relationship exists between the PRAM model and the circuits model used in traditional computational complexity.The NC class was formally introduced by Pippenger [] for circuits. An important related theoretical direction is based on the notion of P-completness, which tries to shed light on problems that do not seem to be highly parallelizable under the PRAM model with polynomial numbers of processors. Interested reader can consult the reference [] for a good overview of this topic. Some recent efforts have been devoted to advocate the PRAM as a practical parallel computation model, both in terms of developing prototype hardware that supports the model and software that enables the writing of PRAM programs. The book of Keller, Kessler, and Traff [] and the recent paper by Wen and Vishkin [] are illustrative of such efforts.
Bibliography . Brent R () The parallel evaluation of general arithmetic expressions. JACM ():– . Fortune S, Wyllie J () Parallelism in random access machines. In: Proceedings of the tenth ACM symposium on theory of computing. San Diego, CA, pp – . Goldschlager L () A unified approach to models of synchronous parallel machines. In: Proceedings of the tenth ACM symposium on theory of computing. San Diego, CA, pp – . Greenlaw R, Hoover HJ, Ruzzo WL () Limits to Parallel Computation: P-Completeness Theory. In: Topics in parallel computation. Oxford University Press, Oxford . Hillis WD, Steele GL () Data parallel algorithms. Commun ACM ():– . JaJa J () An introduction to parallel algorithms. Addison Wesley Publishing Co., Reading, MA . Karp RM, Ramachandran V () Parallel algorithms for shared-memory machines. In: van Leeuwen J (ed) Handbook of theoretical computer science, North Holland, Amsterdam, The Netherlands, Chapter , pp – . Keller J, Kessler C, Traff J () Practical PRAM programming. Wiley, New York . Ladner R, Fisher M () Parallel prefix computations. JACM ():– . Pippenger N () On simultaneous resource bounds. In: Proceedings twentieth annual IEEE symposium on foundations of computer science. San Juan, Puerto Rico, pp – . Shiloach Y, Vishkin U () An O(n log n) parallel max-flow algorithm. J Algorithms ():– . Wen X, Vishkin U () FPGA-based prototype of a PRAM-onchip processor. In: Proceedings of the ACM conference on computing frontiers. Ischia, Italy, pp –
P
Preconditioners for Sparse Iterative Methods Anshul Gupta IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms Linear equations solvers
Definition Iterative methods for solving sparse systems of linear equations are potentially less memory and computation intensive than direct methods, but often experience slow convergence or fail to converge at all. The robustness and the speed of Krylov subspace iterative methods is improved, often dramatically, by preconditioning. Preconditioning is a technique for transforming the original system of equations into one with an improved distribution (clustering) of eigenvalues so that the transformed system can be solved in fewer iterations. A key step in preconditioning a linear system Ax = b is to find a nonsingular preconditioner matrix M such that the inverse of M is as close to the inverse of A as possible and solving a system of the form Mz = r is significantly less expensive than solving Ax = b. The system is then solved by solving (M − A)x = M − b. This particular example shows what is known as left preconditioning. There are two other formulations, known as right preconditioning and split preconditioning. The basic concept, however, is the same. Other practical requirements for successful preconditioning are that the cost of computing M itself must be low and the memory required to compute and apply M must be significantly less than that for solving Ax = b via direct factorization.
Discussion Preconditioning methods are being actively researched and have been for a number of years. There are several classes of preconditioners; some are more amenable to being computed and applied in parallel than others. This chapter gives an overview of the generation (i.e., computing M in parallel) and application (i.e., solving a system of the form Mz = r in parallel) of the most commonly used parallel preconditioners.
P
P
Preconditioners for Sparse Iterative Methods
While solving a sparse linear system Ax = b in parallel, the matrix A and the vectors x and b are typically partitioned. The partitions are assigned to tasks that are executed by individual processes or threads in a parallel processing environment. Both the creation and the application of a preconditioner in parallel is affected by the underlying partitioning of the data. A commonly used effective and natural way of partitioning the data involves partitioning the graph, of which the coefficient matrix is an adjacency matrix. Other than partitioning the problem for parallelization, the graph view of the matrix plays a useful role in many aspects of solving sparse linear systems. Figure illustrates a partitioning of the rows of a matrix among four tasks based on a four-way partitioning of its graph.
which is an approximation of the original coefficient matrix. While stationary iterative methods themselves have poor convergence properties, the approximating matrix can serve as a preconditioner for Krylov subspace methods.
Jacobi and Block-Jacobi Preconditioners One of the simplest preconditioners is the point-Jacobi preconditioner, which is nothing but the diagonal D of the matrix A of coefficients. Applying the preconditioner in parallel is straightforward. It simply involves division with the entries (or multiplication with their inverses) of the part of the diagonal corresponding to the portion of the matrix that each thread or process is responsible for. In fact, scaling the coefficient matrix by the diagonal so that the scaled matrix has all ’s on the diagonal is equivalent to Jacobi preconditioning. A block-Jacobi preconditioner is made up of nonoverlapping square diagonal blocks of the coefficient matrix. These blocks may be of the same or different sizes. These blocks are usually factored or inverted (independently, in parallel) during the preconditioner construction phase so that the preconditioner can be applied inexpensively during the Krylov solver’s iterations.
Simple Preconditioners Based on Stationary Methods Stationary iterative methods are relatively simple algorithms that start with an initial guess of the solution (like all iterative methods) and attempt to converge toward the actual solution by repeated application of a correction equation. The correction equation uses the current residual and a fixed (stationary) operator or matrix, 0 0
1
2
3
4
7
8
X X
2
X X
X
4
X
5 6 8
X
10 X X X
13
X
X
X X
X
X X
X X
X
X
10 12
P3
7
5
A 16 ⫻ 16 symmetric sparse matrix
2
P3 X
P2
1
P0 X
X
X
P0
P1 X
X
15
X X P2 X
X
X
14 X
P0 X X X P2 P1
X
X
12
X
X
X
0
3
P0
X
X
X
9 11
P1
X
X
P1
X P2 X
X X X
7
11 6
P3
X
X X X
X
P3 X
X
8
X P0
X
X X
4
9 10 11 12 13 14 15 X X
3
a
6
X
1
15
5
13
P2
14
P3
X P1
b
9 The associated graph and its four partitions
Preconditioners for Sparse Iterative Methods. Fig. A × sparse matrix with symmetric structure and its associated graph partitioned among four tasks
Preconditioners for Sparse Iterative Methods
Gauss–Seidel Preconditioner Let the coefficient matrix A be represented by a threeway splitting as L + D + U, where L is the strictly lower triangular part of A, D is a diagonal matrix that consists of the principal diagonal of A, and U is the strictly upper triangular part of A. The Gauss–Seidel preconditioner is defined by M = (D + L)D− (D + U).
()
A system Mz = r is then trivially solved as y = (D + L)− r, w = Dy, and z = (D + U)− w. Thus, applying the Gauss–Seidel preconditioner in parallel involves solving a lower and an upper triangular system in parallel. In general, equation i of a lower triangular system can be solved for the i-th unknown when equations . . . i − have been solved. Similarly, equation i of an N × N upper triangular system can be solved after equations i + . . . N have been solved. Since the matrices L and U in our case are sparse, the i-th equation while computing y = (D + L)− r depends only on those unknowns that have nonzero coefficients in the i-th row of L. Similarly, the i-th equation while computing x = (D + U)− w depends only on those unknowns that have nonzero coefficients in the i-th row of U. As a result, while solving both these systems, multiple unknowns may be computed simultaneously – those that do not depend on any unsolved unknowns. This is an obvious source of parallelism. This parallelism can be maximized by reordering the rows and columns of A (and hence those of L and U) in a way that maximizes the number of independent equations at each stage of the solution process. Figure illustrates one such ordering, known as redblack ordering, that can be used to parallelize the application of the Gauss–Seidel preconditioner for a matrix arising from a finite-difference discretization. The vertices of the graph corresponding to the matrix are assigned colors such that no two neighboring vertices have the same color. All vertices and the corresponding rows and columns of the matrix are numbered first, followed by those of the other color. Assignment of matrix rows to tasks is based on a partitioning of the graph. With red-black ordering, each triangular solve is performed in two phases. During the lower triangular solve, first, all red unknowns are computed in parallel because they are all independent. After this step, all black unknowns can be computed. The order
P
is reversed during the upper triangular solve. On a distributed memory platform, each computation phase is followed by a communication phase. During a communication phase, each process communicates the values of the unknowns corresponding to graph vertices on the partition boundaries with its neighboring processes. The idea of red-black ordering can be extended to general sparse matrices and their graphs, for which more than two colors may be required to ensure that no neighboring vertices have the same color. The triangular solves are then performed in as many parallel phases as the number of colors. A multicolored ordering with four colors is illustrated in Fig. . For improved cache performance, block variants of multicolored ordering can be constructed by assigning colors to clusters of graph vertices and ensuring that no two clusters that have an edge connecting them are assigned the same color. The reader is cautioned that often the convergence of a Krylov subspace method is sensitive to the ordering of matrix rows and columns. While red-black and multicolor orderings enhance parallelism, they may result in a deterioration of the convergence rate in some cases.
SOR Preconditioner A significant increase in convergence rate may be obtained by a modification of Eq. as follows, with < ω < : M=(
ωD− D D + L) ( + U) , ω −ω ω
()
although determining an optimal value of ω can be expensive. The preconditioner specified by Eq. is known as successive overrelaxtion or SOR preconditioner. Its symmetric formulation, when L = U, is known as symmetric SOR or SSOR preconditioner. The issues in parallel application of the SOR preconditioner are identical to those for the Gauss–Seidel preconditioner.
Preconditioners Based on Incomplete Factorization A class of highly effective but conceptually simple preconditioners is based on incomplete factorization methods. Recall that iterative methods are used in applications where a factorization of the form A = LU is not feasible because the triangular factor matrices L and U
P
P
Preconditioners for Sparse Iterative Methods
Red points P0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 P0 0 X X X P1 1 X X X X P0 2 X X X X X P1 3 X X X X P2 4 X X X X 5 X X X X X P3 6 X X P2 X X 7 X X P3 X P0 8 X X X X P1 9 X X X
P1
0
8
1
9
10
2
11
3
4
12
5
13
14
6
15
P2
10 X X X 11 X X X X 12 X X X X 13 X X X 14 X X 15 X X X
7 P3
Black points
a
A 4 ⫻ 4 grid with red-black ordering
b
P0
X
P1
X
P2
X
P3
X X
P2 X P3
Corresponding coefficient matrix
Preconditioners for Sparse Iterative Methods. Fig. The sparse matrix corresponding to a × finite-difference grid with red-black ordering, partitioned among four parallel tasks 0
8
1
9
2
10
3
11
16
24
17
25
18
26
19
27
4
12
5
13
6
14
7
15
20
28
21
29
22
30
23
31
Preconditioners for Sparse Iterative Methods. Fig. Multicolored ordering of a hypothetical finite-element graph using four colors
are much denser than A, and therefore, too expensive to compute and store. The idea behind incomplete factorization is to perform a factorization of A along the lines of a regular Gaussian elimination or Cholesky factorization, but dropping a large proportion of nonzero entries from the triangular factors along the way. Depending on the underlying factorization method, incomplete factorization is referred to as ILU (incomplete LU) or IC (incomplete Cholesky) factorization. Due to
̃ are dropping, the resulting triangular factors ̃ L and U much sparser than L and U and are computed with significantly less computing effort. Entries are chosen for dropping using some criteria that strive to keep ̃ as close to the inverse of the inverse of M = ̃ LU A as possible. Devising effective dropping criteria has been an active area of research. Incomplete factorization methods can be broadly classified as follows based on the dropping criteria.
Preconditioners for Sparse Iterative Methods
Static-Pattern Incomplete Factorization A static-pattern incomplete factorization method is one in which the locations of the entries that are kept in the factors and those that are dropped are determined a priori, based only on the structure of A. The simplest form of incomplete factorizations are ILU() and IC(), in ̃ is identical to structure of A; which the structure of ̃ L+U i.e., only those factor entries whose locations coincide with those of nonzero entries in the original matrix are saved. A generalization of static-pattern incomplete factorization is a level-κ incomplete factorization, referred to as ILU(κ) or IC(κ) factorization in the literature. The structure of a level-κ incomplete factorization is computed symbolically as follows. Initially, level(i, j) = if aij ≠ ; otherwise, level(i, j) = ∞. This is followed by an emulation of factorization where each numerical update step of the form aij = aij − aik . akj is replaced by updating level(i, j) = min(level(i, j), level(i, k) + level(k, j) + ). During the subsequent numerical factorization phases, entries in locations with a level greater than κ are dropped. In graph terms, upon the completion of symbolic factorization, level(i, j) is the length of the shortest path between vertices i and j − . During the computation and application of an ILU(κ) or IC(κ) preconditioner, the computation corresponding to a vertex in the graph requires data associated with vertices that are up to κ + edges away from it. For example, in the graph shown in Fig. , data exchange is required among vertex pairs (p, q) and (q, r) for κ = . For κ = , additional exchanges among vertex pairs such as (s, t) and (u, v) are required. The figure also illustrates that in a parallel environment, data associated with κ + layers of vertices adjacent to a partition boundary needs to be exchanged with a neighboring task. Both the computation per vertex of the graph and the data exchange overhead per task in each iteration of a Krylov solver using a level-κ incomplete factorization preconditioner increase as κ increases. On the other hand, the overall number of iterations typically declines. The optimum value of κ is problem dependent. Threshold-Based Incomplete Factorization Although it permits a relatively easy and fast parallel implementation, static-pattern incomplete factorization is robust for a few classes of problems only, including those with diagonally dominant coefficient
P
Level 0 interactions
p
q
v
r s
u
t
Level 1 interactions
Task boundary
Preconditioners for Sparse Iterative Methods. Fig. Illustration of data exchange across a task boundary when level and level fill is permitted in incomplete factorization
P matrices. It can, and often does delete fill entries of large magnitudes that do not happen to be located in the predetermined locations. The resulting large error can make the preconditioner ineffective. Threshold-based incomplete factorization rectifies this problem by dropping entries from the factors as they are computed. Regardless of their locations, entries greater in magnitude than a user-defined threshold τ are kept and the others are dropped. Typically, a second threshold γ is also used to limit the factors to a predetermined size. If ni is the number of nonzeros in row (or column) i of the coefficient matrix, then at most γni entries (those with the largest magnitudes) are permitted in row (or column) i of the incomplete factor. Successful and scalable parallel implementations of threshold-based incomplete factorization preconditioning use graph partitioning and graph coloring for balancing computation and minimizing communication
P
Preconditioners for Sparse Iterative Methods
among parallel tasks. The use of these two techniques in the context of sparse matrix computations has already been discussed earlier. Graph partitioning enables parallel tasks to independently compute and apply (i.e., perform forward and back substitution) the preconditioner independently for matrix rows and columns corresponding to the internal vertices of the graph. An internal vertex and all its neighbors belong to the same partition. Graph coloring permits parallel incomplete factorization and forward and back substitution of matrix rows and columns corresponding to the boundary vertices. In the context of incomplete factorization, coloring is applied to a graph that includes only the boundary vertices (i.e., after the internal vertices have been eliminated) but includes the additional edges (fillin) created as a result of the elimination of the internal vertices. The reason is that the dependencies among the rows and columns of the matrix are determined by all the nonzeros in the incomplete factors, both original and those created as a result of fill-in.
Incomplete Factorization Based on Inverse-Norm Estimate This is a relatively new class of incomplete factorization preconditioners in which the dropping criterion takes into account and seeks to minimize the growth of the norm of the inverse of the factors. These preconditioners have been shown to be more robust and effective than incomplete factorization with dropping based solely on the position or absolute value of the entries. The issues in the parallel generation and application of these preconditioners are very similar to those in threshold-based incomplete factorization – in both cases, the location of nonzeros in the factors cannot be determined a priori.
Sparse Approximate Inverse Preconditioners While the incomplete factorization preconditioners ̃ as approximations of the seek to compute ̃ L and U actual factors L and U of the coefficient matrix A, sparse approximate preconditioners seek to compute M − as an approximation to its inverse A− . The problem of computing M − is framed as the problem of minimizing the norm ∥I − AM∥ or ∥I − MA∥. To support parallelism, these minimization problems can be reduced to
independent subproblems for computing the rows and columns of M− . Note that the actual inverse of a sparse A is dense, in general. It is therefore imperative that a number of entries be dropped in order to keep M − sparse. Just like incomplete factorization, the dropping can be structural (static), or based on values (dynamic), or both. In graph terms, while computing the rows or columns of M − in parallel, dropping is typically orchestrated in a way that confines the interaction to pairs of vertices that are either immediate neighbors or have short paths connecting them in the graph of A. Graph partitioning is used to facilitate load balance and minimize interaction among parallel tasks – both during the computation and the application of the preconditioner. There are some important advantages to explicitly using an approximation of A− for preconditioning, rather than using A’s approximate factors. First, the preconditioner computation avoids the kind of breakdowns that are possible in incomplete factorization due to small or zero (or negative, in case of incomplete Cholesky) diagonals. Secondly, the application of the preconditioners involves a straightforward multiplication of a vector with the sparse matrix M − , which may be simpler and more easily parallelizable than the for̃ However, ward and back substitutions with ̃ L and U. − ̃ ̃ just like L and U, M may be denser than A. Therefore, the graph of M − may have many more edges than the graph of A and multiplying a vector with M − may require more communication than multiplying a vector with A.
Multigrid Preconditioners Multigrid methods are a class of iterative algorithms for solving partial differential equations (PDEs) efficiently, often by exploiting more problem-specific information than a typical Krylov subspace method. Each iteration of a multigrid solver is a somewhat complex recursive procedure. In many applications, the effectiveness of a multigrid algorithm can be substantially enhanced by using it to precondition a Krylov subspace method rather than using it as the solver. This is done by replacing the preconditioning step of the Krylov subspace solver with one iteration of the multigrid algorithm; i.e., treating the approximate solution obtained by an iteration of the multigrid algorithm as the solution with respect to a hypothetical preconditioner matrix.
Preconditioners for Sparse Iterative Methods
Geometric Multigrid Multigrid methods were originally designed for solving elliptic PDEs by discretizing them using a hierarchy of regular grids of varying degrees of fineness over the same domain. For example, consider a domain D and a sequence of successively finer discretizations G , G , . . ., Gm . Here G is the coarsest discretization and Gm is the finest discretization over which the eventual solution to the PDE is desired. Figure shows a square domain with m = . As the figure shows, the grid points in Gi are a subset of the grid points in Gi+ . A simple formulation of multigrid would work as follows: . First, the linear system corresponding to discretization G is solved. Since G has a small number of points, the associated linear system is small and can be solved inexpensively by an appropriate direct or iterative method. . The solution at G is interpolated to obtain an initial guess of the solution of the system corresponding to G , which is four times larger. Among various ways of interpolation (also known as prolongation),
P
a simple one involves approximating the value of the solution at a point that is in G but not in G by the average of the values of its neighbors. . Starting with the initial guess obtained by interpolation, a few steps of relaxation (also referred to as smoothing) are used to refine the solution at G . Often, the relaxation steps are simply a few iterations of a relatively inexpensive stationary method such as Jacobi or Gauss–Siedel. . The process of relaxation and interpolation continues from Gi to Gi+ , until i + = m. After the relaxation at Gm , a first approximation xm to the solution xm of the linear system Am xm = bm corresponding to Gm is obtained. Successively more accurate approximations xm , xm , . . . are obtained by m xi+ = xim + dim , where dim is obtained by solving Adim = rim ≡ (bm − Axim ). . The system Adim = rim in the i-th multigrid iteration is solved by a recursive process, in which the residual rim corresponding to Gm is projected on to Gm− and so on. The process of projection (also known as restriction) is the reverse of interpolation. At the end
P
a
Discretization G 0
b
Discretization G 1
c
Discretization G 2
d
Discretization G 3
Preconditioners for Sparse Iterative Methods. Fig. Successively finer discretizations of a domain
P
Preconditioners for Sparse Iterative Methods
of the recursive projection steps, a relatively small system of equations Adi = ri corresponding to G is obtained, which is readily solved by an appropriate iterative or direct method. . The cycle of interpolation and relaxation is then repeated to obtain dim . . The process is stopped after k multigrid iterations if rkm is smaller than a user-defined threshold. When the multigrid method is used as a preconditioner, then instead of repeating the multigrid cycle k times to solve the problem, the approximate solution after one cycle is substituted as the solution with respect to the preconditioner inside a Krylov subspace algorithm. Whether multigrid is used as a solver or a preconditioner, the parallelization process is the same. The first step in implementing a parallel multigrid procedure is to partition the domain among the tasks such that each task is assigned roughly the same number of points of the finest grid. For example, Fig. shows the partitioning of the domain of Fig. and its discretizations G –G into four parallel tasks. Unlike Figs. and , the domain in a real problem may be irregular and the partitioning may not be trivial. The partitioning of the domain implicitly defines a partitioning of the grids at all levels of discretization. It is easily seen that
P0
P1
P2
P3
Preconditioners for Sparse Iterative Methods. Fig. The domain and discretizations G –G of Fig. mapped onto four parallel tasks
the parallel interpolation, relaxation, and projection at any grid level require a task to exchange information corresponding to the grid points along the partition boundary with its neighboring tasks. All computations corresponding to each partition’s interior points can be performed independently.
Algebraic Multigrid The algebraic multigrid (AMG) method is a generalization of the hierarchical approach of the geometric multigrid method so that it is not dependent on the availability of the meshes used for discretizing a PDE, but can be used as a black-box solver for a given linear system of equations. In the geometric multigrid method, successively finer meshes are constructed by a geometric refining of the coarser meshes. On the other hand, the starting point for AMG is the final system of equations (analogous to the finest grid), from which successively smaller (coarser) systems are constructed. The coefficients of a coarse system in AMG are only algebraically related to the coefficients of the finer systems, which is in contrast to the geometric relationship between successive grids in geometric multigrid. Just like geometric multigrid, an AMG method can be used either as a solver or a preconditioner. In either case, the method involves two phases. In the first setup phase, the hierarchy of coarse systems is constructed from the original linear system and the prolongation and restriction operators are defined. The second solution phase consists of the prolongation, relaxation, and restriction cycles. Efficient parallelization of AMG is much harder than that of geometric multigrid. As usual, the basis of parallelization is a good partitioning of the graph corresponding to the coefficient matrix and assigning the partitions to individual tasks. The solution phase, in principle, can then be parallelized with computation corresponding to the interior nodes remaining independent and that involving the nodes at or close to the boundary involving exchange of data with neighboring partitions. The boundary communication can be more involved than in the case of geometric multigrid because of the irregularity of the graph and the fact that the sets of interacting boundary nodes and the interaction pattern can be different for prolongation, relaxation, and restriction. However, the main difficulty in parallelizing AMG is in the set-up phase.
Preconditioners for Sparse Iterative Methods
Effective parallelization of the set-up phase is essential for the overall scalability of parallel AMG because this phase can account for up to one-fourth of the total execution time. The process of construction of a coarser linear system in the classical AMG approach relies on a notion of the “strength” of dependencies among coefficients. This process is not only sequential in nature, but also involves a highly nonlocal pattern of interaction between the vertices of the graph of the coefficient matrix. Therefore, the algorithm must be adapted for parallelization, which can make the convergence rate and per iteration cost of AMG dependent on the number of parallel tasks. Care must be taken to ensure that parallelization does not adversely affect the convergence rate or the iteration complexity of AMG. Typical approaches to parallelizing coarsening in AMG rely on decoupling the partitions, using parallel independent sets (similar to multicoloring described earlier), or performing subdomain blocking, which starts the coarsening at the partition boundaries and then proceeds to the interior nodes. Note that the coarsening scheme has an impact on the inter-task interaction during the prolongation and restriction steps of the solution phase. When parallelizing AMG on large parallel machines, the number of parallel tasks may exceed the number of points in some of the grids at the coarsest levels. This situation requires special treatment. A commonly used work-around to this problem is agglomeration, in which neighboring domains are coalesced leaving some tasks idle during the processing of the coarsest levels. Another approach is to stop the coarsening when the number of coefficients per task becomes too small, which makes the behavior of the overall algorithm dependent on the number of parallel tasks.
Stochastic Preconditioners There has been a fair amount of research on algorithms for approximating the solution of linear systems based on random sampling of the coefficient matrix or on random walks in the graph corresponding to it. Most of these methods have been proven to work on limited classes of linear systems only, such as symmetric diagonally dominant systems. A few practical solvers have been developed recently by using some of these techniques for preconditioning Krylov subspace methods. An attractive property of these methods is that they are usually trivially parallelizable. The quest for scalable massively parallel sparse linear solvers may prompt
P
more active research into statistical techniques for preconditioning.
Matrix-Free Methods and Physics-Based Preconditioners Note that this chapter discusses preconditioners derived explicitly from the coefficient matrix A of the system Ax = b that needs to be solved. In some applications, A is never constructed explicitly to save time and storage. Instead, it is applied implicitly to compute the matrix-vector products required in the Krylov subspace solver. In some of these cases, preconditioning is also applied implicitly, or the knowledge of the physics of the application is utilized to construct the preconditioner, which cannot be derived from the coefficient matrix in a matrix-free method. Such preconditioners are called physics-based preconditioners. Due to the highly application-specific nature of matrix-free methods and physics-based preconditioners, these topics are not covered in further detail in this chapter.
Related Entries Graph Partitioning Linear Algebra Software Rapid Elliptic Solvers
Bibliographic Notes The readers are referred to Saad’s book [] and the survey by Benzi [] for a fairly comprehensive introduction to various preconditioning techniques. These do not cover parallel preconditioners and may not included some of the most recent work in preconditioning. However, these are excellent resources for gaining an insight into the state of the art of preconditioning circa . Hysom and Pothen [] and Karypis and Kumar [] cover the fundamentals of scalable parallelization of incomplete factorization–based preconditioners. The work of Grote and Huckle [] and Edmond Chow [, ] is the basis of modern parallel sparse approximate inverse preconditioners. Chow et al.’s survey [] should give the readers a good overview of parallelization techniques for geometric and algebraic multigrid methods. Last, but not the least, almost all parallel preconditioning techniques rely on effective parallel heuristics for two critical combinatorial problems – graph partitioning and graph coloring. The readers are referred
P
P
Prefix
to papers by Karypis and Kumar [, ] and Bozdag et al. [] for an overview of these.
Bibliography . Benzi M () Preconditioning techniques for large linear systems: a survey. J Computat Phys ():– . Bozdag D, Gebremedhin AH, Manne F, Boman EG, Catalyurek UV () A framework for scalable greedy coloring on distributed memory parallel computers. J Parallel Distrib Comput ():– . Chow E () A priori sparsity patterns for parallel sparse approximate inverse preconditioners. SIAM J Sci Comput (): – . Chow E () Parallel implementation and practical use of sparse approximate inverse preconditioners with a priori sparsity patterns. Int J High Perform Comput appl ():– . Chow E, Falgout RD, Hu JJ, Tuminaro RS, Yang UM () A survey of parallelization techniques for multigrid solvers. In Heroux MA, Raghavan P, Simon HD (eds) Parallel processing for scientific computing. SIAM, Philadelphia . Grote MJ, Huckle T () Parallel preconditioning with sparse approximate inverses. SIAM J Sci Comput (): – . Hysom D, Pothen A () A scalable parallel algorithm for incomplete factor preconditioning. SIAM J Sci Comput (): – . Karypis G, Kumar V () Parallel threshold-based ILU factorization. Technical report TR -, Department of Computer Science, University of Minnesota, Minnesota . Karypis G, Kumar V () ParMETIS: parallel graph partitioning and sparse matrix ordering library. Technical report TR , Department of Computer Science, University of Minnesota, Minnesota . Karypis G, Kumar V () Parallel algorithms for multilevel graph partitioning and sparse matrix ordering. J Parallel Distrib Comput :– . Saad Y () Iterative methods for sparse linear systems, nd edn. SIAM, Philadelphia
Problem Architectures Computational sciences
Process Algebras Rocco De Nicola Universita’ di Firenze, Firenze, Italy
Synonyms Process calculi; Process description languages
Definition Process Algebras are mathematically rigorous languages with well-defined semantics that permit describing and verifying properties of concurrent communicating systems. They can be seen as models of processes, regarded as agents that act and interact continuously with other similar agents and with their common environment. The agents may be real-world objects (even people), or they may be artifacts, embodied perhaps in computer hardware or software systems. Many different approaches (operational, denotational, algebraic) are taken for describing the meaning of processes. However, the operational approach is the reference one. By relying on the so-called Structural Operational Semantics (SOS), labeled transition systems are built and composed by using the different operators of the many different process algebras. Behavioral equivalences are used to abstract from unwanted details and identify those systems that react similarly to external experiments.
Introduction
Prefix Reduce and Scan Scan for Distributed Memory, Message-Passing Systems
Prefix Reduction Reduce and Scan Scan
Systems
for Distributed Memory, Message-Passing
The goal of software verification is to assure that developed programs fully satisfy all the expected requirements. Providing a formal semantics of programming languages is an essential step toward program verification. This activity has received much attention in the last years. At the beginning the interest was mainly on sequential programs, then it turned also on concurrent program that can lead to subtle errors in very critical activities. Indeed, most computing systems today are concurrent and interactive. Classically, the semantics of a sequential program has been defined as a function specifying the
Process Algebras
induced input–output transformations. This setting becomes, however, much more complex when concurrent programs are considered because they exhibit nondeterministic behaviors. Nondeterminism arises from programs interaction and cannot be avoided. At least, not without sacrificing expressive power. Failures do matter, and choosing the wrong branch might result in an “undesirable situation.” Backtracking is usually not applicable because the control might be distributed. Controlling nondeterminism is very important. In sequential programming, it is just a matter of efficiency, in concurrent programming it is a matter of avoiding getting stuck in a wrong situation. The approach based on process algebras has been very successful in providing formal semantics of concurrent systems and proving their properties. The success is witnessed by the Turing Award given to two of their pioneers and founding fathers: Tony Hoare and Robin Milner. Process algebras are mathematical models of processes, regarded as agents that act and interact continuously with other similar agents and with their common environment. Process algebras provide a number of constructors for system descriptions and are equipped with an operational semantics that describes systems evolution in terms of labeled transitions. Models and semantics are built by taking a compositional approach that permits describing the “meaning” of composite systems in terms of the meaning of their components. Moreover, process algebras often come equipped with observational mechanisms that permit identifying (through behavioral equivalences) those systems that cannot be taken apart by external observations (experiments or tests). In some cases, process algebras have also algebraic characterizations in terms of equational axiom systems that exactly capture the relevant identifications induced by the behavioral operational semantics. The basic component of a process algebra is its syntax as determined by the well-formed combination of operators and more elementary terms. The syntax of a process algebra is the set of rules that define the combinations of symbols that are considered to be correctly structured programs in that language. There are many approaches to providing a rigorous mathematical understanding of the semantics of syntactically correct process terms. The main ones are those also used for describing the semantics of sequential systems, namely, operational, denotational, and algebraic semantics.
P
An operational semantics models a program as a labeled transition system (LTS) that consists of a set of states, a set of transition labels and a transition relation. The states of the transition system are just process algebra terms, while the labels of the transitions between states represent the actions or the interactions that are possible from a given state and the state that is reached after the action is performed by means of visible and invisible actions. The operational semantics, as the name suggests, is relatively close to an abstract machinebased view of computation and might be considered as a mathematical formalization of some implementation strategy. A denotational semantics maps a language to some abstract model such that the meaning/denotation (in the model) of any composite program is determinable directly from the meanings/denotations of its subcomponents. Usually, denotational semantics attempt to distance themselves from any specific implementation strategy, describing the language at a level intended to capture the “essential meaning” of a term. An algebraic semantics is defined by a set of algebraic laws which implicitly capture the intended semantics of the constructs of the language under consideration. Instead of being derived theorems (as they would be in a denotational semantics or operational semantics), the laws are the basic axioms of an equational system, and process equivalence is defined in terms of what equalities can be proved using them. In some ways it is reasonable to regard an algebraic semantics as the most abstract kind of description of the semantics of a language. There has been a huge amount of research work on process algebras carried out during the last years that started with the introduction of CCS [, ], CSP [], and ACP []. In spite of the many conceptual similarities, these process algebras have been developed starting from quite different viewpoints and have given rise to different approaches (for an overview see, e.g., []). CCS takes the operational viewpoint as its cornerstone and abstracts from unwanted details introduced by the operational description by taking advantage of behavioral equivalences that allow one to identify those systems that are indistinguishable according to some observation criteria. The meaning of a CCS term is a labeled transition system factored by a notion of observational equivalence. CSP originated as the
P
P
Process Algebras
theoretical version of a practical language for concurrency and is still based on an operational intuition which, however, is interpreted w.r.t. a more abstract theory of decorated traces that model how systems react to external stimuli. The meaning of a CSP term is the set of possible runs enriched with information about the interactions that could be refused at intermediate steps of each run. ACP started from a completely different viewpoint and provided a purely algebraic view of concurrent systems: processes are the solutions of systems of equations (axioms) over the signature of the considered algebra. Operational semantics and behavioral equivalences are seen as possible models over which the algebra can be defined and the axioms can be applied. The meaning of a term is given via a predefined set of equations and is the collection of terms that are provably equal to it. At first, the different algebras have been developed independently. Slowly, however, their close relationships have been understood and appreciated, and now a general theory can be provided and the different formalisms (CCS, CSP, ACP, etc.) can be seen just as instances of the general approach. In this general approach, the main ingredients of a specific process algebra are: . A minimal set of carefully chosen operators capturing the relevant aspect of systems behavior and the way systems are composed in building process terms . A transition system associated with each term via structural operational semantics to describe the evolution of all processes that can be built from the operators . An equivalence notion that allow one to abstract from irrelevant details of systems descriptions Verification of concurrent system within the process algebraic approach is carried out either by resorting to behavioral equivalences for proving conformance of processes to specifications or by checking that processes enjoy properties described by some temporal logic formulae [, ]. In the former case, two descriptions of a given system, one very detailed and close to the actual concurrent implementation, the other more abstract (describing the sequences or trees of relevant actions the system has to perform) are provided and tested for equivalence. In the latter case, concurrent systems are specified as process terms, while properties are specified
as temporal logic formulae, and model checking is used to determine whether the transition systems associated with terms enjoy the property specified by the formulae. In the next section, many of the different operators used in process algebras will be described. By relying on the so-called structural operational semantic (SOS) approach [], it will be shown how labeled transition systems can be built and composed by using the different operators. Afterward, many behavioral equivalences will be introduced together with a discussion on the induced identifications and distinctions. Next, the three most popular process algebras will be described; for each of them a different approach (operational, denotational, algebraic) will be used. It will, however, be argued that in all cases, the operational semantics plays a central rôle.
Process Operators and Operational Semantics To define a process calculus, one starts with a set of uninterpreted action names (that might represent communication channels, synchronization actions, etc.) and with a set of basic processes that together with the actions are the building blocks for forming newer processes from existing ones. The operators are used for describing sequential, nondeterministic, or parallel compositions of processes, for abstracting from internal details of process behaviors and, finally, for defining infinite behaviors starting from finite presentations. The operational semantics of the different operators is inductively specified through SOS rules: for each operator, there is a set of rules describing the behavior of a system in terms of the behaviors of its components. As a result, each process term is seen as a component that can interact with other components or with the external environment. In the rest of this section, most of the operators that have been used in some of the best-known process algebras will be presented with the aim of showing the wealth of choices that one has when deciding how to describe a concurrent system or even when defining one’s “personal” process algebra. A new calculus can, indeed, be obtained by a careful selection of the operators while taking into account their interrelationships with respect to the chosen abstract view of process and thus of the behavioral equivalence one has in mind.
Process Algebras
A set of operators is the basis for building process terms. A labeled transition system (LTS) is associated to each term by relying on structural induction by providing specific rules in for each operator. Formally speaking, an LTS is a set of nodes (corresponding to process terms) and (for each action a in some set) a relation a :→ between nodes, corresponding to processes transitions. Often LTSs have a distinguished node n from which computations start; when defining the semantics of a process term, the state corresponding to that term is considered as the initial state. To associate an LTS to a process term, inference systems are used, where the collection of transitions is specified by means of a set of syntax-driven inference rules. Inference systems: An inference system is a set of inference rule of the form: p , . . . , pn q where p , . . . , pn are the premises and q is the conclusion. Each rule is interpreted as an implication: if all premises are true, then also the conclusion is true. Sometimes, rules are decorated with predicates and/or negative premises that specify when the rule is actually applicable. A rule with an empty set of premises is called axiom and written as: q Transition rules: In the case of an operational semantics, the premises and the conclusions will be triples of the α form (P, α, Q), often rendered as P :→ Q, and thus the rules for each operator op of the process algebras will be of the following form, where {i , ⋯, im } ⊆ {, ⋯, n} and E′i = Ei when i ∈/ {i , . . . , im }: α
αm
Ei :→ E′i ⋯ Eim :→ E′im α
op(E , . . . , En ) :→ C[E′ , . . . , E′n ] In the rule above, the target term C[ ] indicates the new context in which the new subterms will be operating after the reduction and α represents the action performed by the composite system when some of the components perform actions α , . . . , α m . Sometimes, these rules are enriched with side conditions that determine their applicability. By imposing syntactic constraints on the form of the allowed rules, rule formats are obtained
P
that can be used to establish results that hold for all process calculi whose transition rules respect the specific rule format. A small number of SOS inference rules is sufficient to associate an LTS to each term of any process algebra. The set of rules is fixed once and for all. Given any process, the rules are used to derive its transitions. The transition relation of the LTS is the least one satisfying the inference rules. It is worth remarking that structural induction allows one to define the LTS of complex systems in terms of the behavior of their components. Basic actions: An elementary action of a system represents the atomic (non-interruptible) abstract step of a computation that is performed by a system to move from one state to the next. Actions represent various activities of concurrent systems, like sending or receiving a message, updating a memory cell, and synchronizing with other processes. In process algebras two main types of atomic actions are considered, namely, visible or external actions and invisible or internal actions. In the sequel, visible actions will be denoted by Latin letters a, b, c, . . . , invisible actions will be denoted by the Greek letter τ. Generic actions will be denoted by μ or other, possibly indexed, Greek letters. In the following, A will be used to denote the set of visible actions while Aτ will denote the collection of generic actions. Basic processes: Process algebras generally also include a null process (variously denoted as nil, , stop) which has no transition. It is inactive, and its sole purpose is to act as the inductive anchor on top of which more interesting processes can be generated. The semantics of this process is characterized by the fact that there is no rule to define its transition: it has no transition. Other basic processes are also used: stop denotes √ a deadlocked state, denotes, instead, successful termination. √
√
:→ stop
Sometimes, uninterpreted actions μ are considered basic processes themselves: μ
μ :→
√
Sequential composition: Operators for sequential composition are used to temporally order processes execution and interaction. There are two main operators
P
P
Process Algebras
for this purpose. The first one is action prefixing, μ.-, that denotes a process that executes action μ and then behaves like the following process. μ
μ. E :→ E The alternative form of sequential composition is obtained by explicitly requiring process sequencing , - ; -, that requires that the first operand process be fully executed before the second one. μ
√ (μ ≠ )
E :→ E′ μ
√
μ
E :→ E′
F :→ F ′ μ
E; F :→ E′ ; F E; F :→ F ′ Nondeterministic composition: The operators for nondeterministic choice are used to express alternatives among possible behaviors. This choice can be left to the environment (external choice) or performed by the process (internal choice) or can be mainly external, but leaving the possibility to the process to perform an internal move to prevent some of the choices by the environment (mixed choice). The rules for mixed choice are the ones below. They offer both visible and invisible actions to the environment; however, only the former kind of actions can be actually controlled. μ
E :→ E′
μ
F :→ F ′
μ
μ
E + F :→ E′ E + F :→ F ′ The rules for internal choice are very simple, they are just two axioms stating that a process E ⊕ F can silently evolve into one of its subcomponents. τ
E ⊕ F :→ E
τ
E ⊕ F :→ F
process algebras from sequential models of computation. Parallel composition allows computation in E and F to proceed simultaneously and independently. But it also allows interaction, that is synchronization and flow of information between E and F on a shared channel. Channels may be synchronous or asynchronous. In the case of synchronous channels, the agent sending a message waits until another agent has received the message. Asynchronous channels do not force the sender to wait. Here, only synchronous channels will be considered. The simplest operator for parallel composition is interleaving, - ∣∣∣ -, that aims at modeling the fact that two parallel processes can progress by alternating at any rate the execution of their actions. E :→ E′
μ
F :→ F ′
μ
μ
E ∣∣∣ F :→ E ∣∣∣ F ′
μ
E ∣∣∣ F :→ E′ ∣∣∣ F
Another parallel operator is binary parallel composition, - ∣ -, that not only models the interleaved execution of the actions of two parallel processes, but also the possibility that the two partners synchronize whenever they are willing to perform complementary visible actions (below represented as a and a). In this case, the visible outcome is a τ-action that cannot be seen by other processes that are acting in parallel with the two communication partners. This is the parallel composition used in CCS. E :→ E′
μ
F :→ F ′
μ
μ
E∣F :→ E∣F ′
μ
E∣F :→ E′ ∣F α
α
E :→ E′
F :→ F ′ τ
The rules for external choice are more articulate. This operator behaves exactly like the mixed choice in case one of the components executes a visible action; however, it does not discard any alternative upon execution of an invisible action. α
E :→ E′ α
E ◻ F :→ E′
α
(α ≠ τ)
F :→ F ′ α
E ◻ F :→ F ′
(α ≠ τ)
E∣F :→ E′ ∣F ′
(α ≠ τ)
Instead of binary synchronization, some process algebras, like CSP, make use of operators that permit multiparty synchronization, - ∣[L]∣ -. Some actions, those in L, are deemed to be synchronization actions and can be performed by a process only if all its parallel components can execute those actions at the same time. μ
E :→ E′
τ
F :→ F ′
τ
E :→ E′
τ
E ◻ F :→ E ◻ F ′
τ
E ∣[L]∣ F :→ E′ ∣[L]∣ F
E ◻ F :→ E′ ◻ F
μ
(μ /∈ L)
μ
Parallel composition: Parallel composition of two processes, say E and F, is the key primitive distinguishing
F :→ F ′ μ
E ∣[L]∣ F :→ E ∣[L]∣ F ′
(μ /∈ L)
P
Process Algebras a
a
E :→ E′ E ∣[L]∣
F :→ F ′ a
F :→ E′
∣[L]∣
F′
(a ∈ L)
It is worth noting that the result of a synchronization, in this case, yields a visible action, and that by setting the synchronization alphabet to / the multiparty synchronization operator ∣/∣ can be used to obtain pure interleaving, ∣∣∣. A more general composition is the merge operator, − ∥ − that is used in ACP. It permits executing two process terms in parallel (thus freely interleaving their actions), but also allows for communication between its process arguments according to a communication function γ : A × A :→ A, that, for each pair of atomic actions a and b, produces the outcome of their communication γ(a, b), a partial function that states which actions can be synchronized and the outcome of such a synchronization. E :→ E′
μ
F :→ F ′
μ
μ
E ∥ F :→ E ∥ F ′
μ
E ∥ F :→ E′ ∥ F a
Value passing: The above parallel combinators can be generalized to model not only synchronization, but also exchange of values. As an example, below, the generalization of binary communication is presented. There are complementary rules for sending and receiving values. The first axiom models a process willing to input a value and to base its future evolutions on it. The second axiom models a process that evaluates an expression (via the valuation function val(e)) and outputs the result. a(v)
a(x). E :→ E{v/x}
γ(a,b)
ACP has also another operator called left merge, - -, that is similar to ∥ but requires that the first process to perform an (independent) action be the left operand. μ
E :→ E′ μ
EF :→ E′ ∥ F The ACP communication merge, - ∣c -, requires instead that the first action be a synchronization action. a
b
E :→ E′
F :→ F ′
γ(a,b)
a(v)
av
E :→ E′
F :→ F ′ τ
E∣F :→ E′ ∣F ′
F :→ F ′
E ∥ F :→ E′ ∥ F ′
Disruption: An operator that is between parallel and nondeterministic composition is the so-called disabling operator, −[> −, that permits interrupting the evolution of a process. Intuitively, E [> F behaves like E, but can be interrupted at any time by F, once E terminates F is discarded.
E/L :→ E′ /L
μ
E [> F :→ E′ [> F
√ (μ ≠ ) μ
F :→ F ′ μ
E [> F :→ F ′
μ
√
E :→ E′ τ
E [> F :→ E′
a(v)
av
E :→ E′
F :→ F ′ τ
E∣F :→ E′ ∣F ′
Abstraction: Processes do not limit the number of connections that can be made at a given interaction point. But interaction points allow interference. For the synthesis of compact, minimal, and compositional systems, the ability to restrict interference is crucial. The hiding operator, −/L, hides (i.e., transforms into τ-actions) all actions in L to forbid synchronization on them. However, it allows the system to perform the transitions labeled by hidden actions. E :→ E′
μ
a val(e)
a e. E ::::→ E
In case the exchanged values are channels, this approach can be used to provide also models for mobile systems.
E∣c F :→ E′ ∥ F ′
E :→ E′
(v is a value)
The next rule, instead, models synchronization between processes. If two processes, one willing to output and the other willing to input, are running in parallel, a synchronization can take place and the perceived action will just be a τ-action.
b
E :→ E′
μ
μ
(μ ∈/ L)
E :→ E′ τ
E/L :→ E′ /L
(μ ∈ L)
The restriction operator, −/L is a unary operator that restricts the set of visible actions a process can perform. Thus, process E/L can perform only actions not in L. Obviously, invisible actions cannot be restricted. μ
E :→ E′ μ
E /L :→ E′ /L
(μ , μ ∈/ L)
The operator [ f ], where f is a relabeling function from A to A, can be used to rename some of the actions
P
P
Process Algebras
a process can perform to make it compatible with new environments. a
E :→ E′ f (a)
E[ f ] :→ E′ [ f ] Modeling infinite behaviors: The operations presented so far describe only finite interaction and are consequently insufficient for providing full computational power, in the sense of being able to model all computable functions. In order to reach full power, one certainly needs operators for modeling non-terminating behavior. Many operators have been introduced that allow finite descriptions of infinite behavior. However, it is important to remark that most of them do not fit the formats used so far and cannot be defined by structural induction. One of the most used is the construct rec x. -, wellknown from the sequential world. If E is a process that contains the variable x, then rec x. E represents the process that behaves like E once all occurrences of x in E are replaced by rec x. E. In the rule below, that models the operational behavior of a recursively defined process, the term E[ f /x] denotes exactly the above mentioned substitutions. μ
E[rec x. E/x] :→ E′
Three Process Algebras: CCS, CSP and ACP A process algebra consists of a set of terms, an operational semantics associating LTS to terms, and an equivalence relation equating terms exhibiting “similar” behavior. The operators for most process algebras are those described above. The equivalences can be traces, testing, bisimulation equivalences, or variants thereof, possibly ignoring invisible actions. Below, three of the most popular process algebras are presented. First the syntax, i.e., the selected operators, will be introduced, then their semantics will be provided by following the three different approaches outlined before: operational (for CCS), denotational (for CSP), and algebraic (for ACP). For CSP and ACP, the relationships between the proposed semantics and the operational one, to be used as a yardstick, will be mentioned. To denote the LTS associated to a generic CSP or ACP process p via the operational semantics, the notation LTS(p) will be used. Reference will be made to specific behavioral equivalences over LTSs that consider as equivalent those systems that rely on different standing about which states of an LTS have to be considered equivalent. Three main criteria have been used to decide when two systems can be considered equivalent:
μ
rec x. E :→ E′ The notation rec x. E for recursion sometimes makes the process expressions more difficult to parse and less pleasant to read. A suitable alternative is to allow for the (recursive) definition of some fixed set of constants, that can then be used as some sort of procedure calls inside processes. Assuming the existence of an environment (a set of process definitions) Γ = {X ≜ E , X ≜ E , . . . , Xn ≜ En } the operational semantics rule for a process variable becomes: μ X ≜ E ∈ Γ E :→ E′ μ
X :→ E′ Another operator used to describe infinite behaviors is the so-called bang operator, ! −, or replication. Intuitively, !E represents an unlimited number of instances of E running in parallel. Thus, its semantics is rendered by the following inference rule: μ
E∣! E :→ E′ μ
!E :→ E′
. The two systems perform the same sequences of actions. . The two systems perform the same sequences of actions and after each sequence are ready to accept the same sets of actions. . The two systems perform the same sequences of actions and after each sequence exhibit, recursively, the same behavior. These three different criteria lead to three groups of equivalences that are known as traces equivalences, decorated-traces equivalences (testing and failure equivalence), and bisimulation-based equivalences (strong bisimulation, weak bisimulation, branching bisimulation).
CCS: Calculus of Communicating Systems The Calculus of Communicating Systems (CCS) is a process algebra introduced by Robin Milner around . Its actions model indivisible communications
Process Algebras
between exactly two participants, and the set of operators includes primitives for describing parallel composition, choice between actions, and scope restriction. The basic semantics is operational and permits associating an LTS to each CCS term. The set A of basic actions used in CCS consists of a set Λ, of labels and of a set Λ of complementary labels. Aτ denotes A∪{τ}. The syntax of CCS, used to generate all terms of the algebra, is the following: P ::= nil ∣ x ∣ μ. P ∣ P/L ∣ P[ f ] ∣ P +P ∣ P ∣P ∣ rec x. P where μ ∈ Aτ ; L ⊆ Λ; f : Aτ → Aτ ; f (α¯ ) = f (α) and f (τ) = τ. The above operators are taken from those presented in Section : – – – – – – –
The atomic process (nil) Action prefixing (μ. P) Mixed choice (+) Binary parallel composition (∣) Restriction (P/L) Relabeling (P[ f ]) Recursive definitions (rec x. E)
The operational semantics of the above operators is exactly the same as the one of those operators with the same name described before, and it is thus not repeated here. CCS has been studied with bisimulation and testing semantics that are used to abstract from unnecessary details of the LTS associated to a term. Also denotational and axiomatic semantics for the calculus have been extensively studied. A denotational semantics in terms of so-called acceptance trees has been proved to be in full agreement with the operational semantics abstracted according to testing equivalences. Different algebraic semantics have been provided that are based on sound and complete axiomatizations of bisimilarity, testing equivalence, weak bisimilarity, and branching bisimilarity.
CSP: A Theory of Communicating Sequential Processes The first denotational semantics proposed for CSP associates to each term just the set of the sequences of actions the term could induce. However, while suitable to model the sequences of interactions a process could have with its environment, this semantics is unable to model situations that could lead to deadlock. A new approach, basically denotational but with a strong operational intuition, was proposed next. In this approach,
P
the semantics is given by associating a so-called refusal set to each process. A refusal set is a set of failure pairs ⟨s, F⟩, where s is a finite sequence of visible actions in which the process might have been engaged, and F is a set of action the process is able to reject on the next step. The semantics of the various operators is defined by describing the transformation they induce on the domain of refusal sets. The meaning of processes is then obtained by postulating that two processes are equivalent if and only if they cannot be distinguished when their behaviors are observed and their reactions to a finite number of alternative possible synchronization is considered. Indeed, the association of processes to refusal sets is not one-to-one; the same refusal set can be associated to more than one process. A congruence is then obtained that equates processes with the same denotation. The set of actions is a finite set of labels, denoted by Λ ∪ {τ} . There is no notion of complementary action. The syntax of CSP is reported below, and for the sake of simplicity, only finite terms (no recursion) are considered: E ::= Stop ∣skip∣a → E ∣E ⊓ E ∣E ◻ E ∣E ∣[L]∣ E ∣E/a – Two basic processes: successful termination (skip), null process (Stop) – Action prefixing here denoted by a → E – Internal choice (⊕) here denoted by ⊓ and external choice ( ◻ ) – Parallel composition with synchronization on a fixed alphabet (∣[L]∣, L ⊆ Λ) – Hiding (/a, an instance of the more general operator /L with L ⊆ Λ) Parallel combinators representing pure interleaving and parallelism with synchronization on the full alphabet can be obtained by setting the synchronization alphabet to / or to Λ, respectively. The denotational semantics of CSP compositionally associates a set of failure pairs to each CSP term generated by the above syntax. A function F [[−]] maps each CSP process (say P) to set of pairs (s, F), where s is one of the sequences of actions P may perform, and F represents the set of actions that P can refuse after performing s. As anticipated, there is a strong correspondence between the denotational semantics of CSP and the operational semantics that one could define by
P
P
Process Algebras
relying on the one presented in the previous section for the specific operators. – F [[P]] = F [[Q]] if and only if LTS(P) ≃test LTS(Q).
ACP: An Algebra of Communicating Processes The methodological concern of ACP was to present “first a system of axioms for communicating processes … and next study its models” ([], p. ). The equations are just a means to realize the real desideratum of abstract algebra, which is to abstract from the nature of the objects under consideration. In the same way as the mathematical theory of rings is about arithmetic without relying on a mathematical definition of number, ACP deals with process theory without relying on a mathematical definition of process. In ACP, a process algebra is any mathematical structure, consisting of a set of objects and a set of operators, like, e.g., sequential, nondeterministic, or parallel composition, that enjoys a specific set of properties as specified by given axioms. The set of actions Λ τ consists of a finite set of labels Λ ∪ {τ} . There is no notion of complementary action. The syntax of ACP is reported below, and for the sake of simplicity, only finite terms (no recursion) are considered: √ ∣δ∣a∣P + P ∣P ⋅P ∣P ∥P ∣P P ∣P ∣c P ∣∂ H (p) √ Three basic processes: successful termination ( ), null process, here denoted by δ, and atomic action (a) Mixed choice Sequential composition (;), here denoted by⋅ Hiding (/H with H ⊆ Λ), here denoted by ∂ H (−) Three parallel combinators: merge (∥), left merge () and communication merge (∣c )
P ::= –
– – – –
The system of axioms of ACP is presented as a set of formal equations, and some of the operators, e.g., left merge () have been introduced exactly for providing finite equational presentations. Below, the axioms relative to the terms generated by the above syntax are presented. Within the axioms, x and y denote generic ACP processes.
(A) x + y = y + x
(A) (x + y) + z = x + (y + z)
(A) x + x = x
(A) (x + y)⋅z = x⋅z + y⋅z
(A) (x⋅y)⋅z = x⋅(y⋅z) (A) x + δ = x
(A) δ⋅x = δ The set of axioms considered above induces an equality relation, denoted by =. A model for an axiomatization is a pair ⟨M, ϕ⟩, where M is a set and ϕ is a function (the unique isomorphism) that associates elements of M to ACP terms. This leads to the following definitions: . A set of equations is sound for ⟨M, ϕ⟩ if s = t implies ϕ(s) = ϕ(t). . A set of equations is complete for ⟨M, ϕ⟩ if ϕ(s) = ϕ(t) implies s = t. Any model of the axioms seen above is an ACP process algebra. The simplest model for ACP has as elements the equivalence classes induced by =, i.e., all ACP terms obtained starting from atomic actions, sequentialization and nondeterministic composition and mapping each term t to its equivalence class [[t]] as determined by =. This model is correct and complete and is known as the initial model for the axiomatization. Different, more complex, models can be obtained by first using the SOS rules to give the operational semantics of the operators, building an LTS in correspondence of each ACP term and then using bisimulation to identify some of them. This construction leads to establishing a strong correspondence between the axiomatic and the operational semantics of ACP. Indeed, if we consider the language with the null process, sequential composition and mixed choice we have: – Equality = as induced by (A)-(A) is sound relative to bisimilarity ∼, i.e., if p = q then LTS(p) ∼ LTS(q); – Equality = as induced by (A)-(A) is complete relative to bisimilarity ∼, i.e., if LTS(p) ∼ LTS(q) then p = q. Similar results can be obtained when new axioms are added and weak bisimilarity or branching bisimilarity are used to factorize the LTSs.
Process Algebras
Future Directions The theory of process algebra is by now well developed. The reader is referred to [] to learn about its developments since its inception in the late s to the early . Currently, in parallel with the exploitation of the developed theories in classic areas such as protocol verification and in new ones such as biological systems, there is much work going on concerning: – Extensions to model mobile, network aware systems – Theories for assessing quantitative properties – Techniques for controlling state explosion In parallel with this, much attention is dedicated to the development of software tools to support specification and verification of very large systems and to the development of techniques that permit controlling the state explosion phenomenon that arise as soon as one considers the possible configurations resulting from the interleaved execution of (even a small number of) processes. Mobility and network awareness: Much of the ongoing work is relative to the definition of theories and formalisms to naturally deal with richer classes of systems, like, e.g., mobile systems and network aware applications. The π-calculus [] the successor of CCS, developed by Milner and coworkers with the aim of describing concurrent systems whose configuration may change during the computation has attracted much attention. It has laid the basis for research on process networks whose processes are mobile and the configuration of communication links is dynamic. It has also lead to the development of other calculi to support network aware programming: Ambient [], Distributed π [], Join [], Spi [], Klaim [], etc. There is still no unifying theory and the name process calculi is preferred to process algebras because the algebraic theories are not yet well assessed. Richer theories than LTS (Bi-graph [], Tiles [], etc.) have been developed and are still under development to deal with the new dimensions considered with the new formalisms. Quantitative extensions: Formalisms are being enriched to consider not only qualitative properties, like correctness, liveness, or safety, but also properties related to performance and quality of service. There has been much research to extend process algebra to deal with a quantitative notion of time and probabilities and
P
integrated theories have been considered. Actions are enriched with information about their duration, and formalisms extended in this way are used to compare systems relatively to their speed. For a comprehensive description of this approach, the reader is referred to []. Extensions have been considered also to deal with systems that in their behavior depend on continuously changing variables other than time (hybrid systems). In this case, systems descriptions involve differential algebraic equations, and connections with dynamic control theory are very important. Finally, again with the aim of capturing quantitative properties of systems and of combining functional verification with performance analysis, there have been extensions to enrich actions with rates representing the frequency of specific events and the new theories are being (successfully) used to reason about system performance and system quality. Tools: To deal with non-toy examples and apply the theory of process algebras to the specification and verification of real systems, tool support is essential. In the development of tools, LTSs play a central role. Process algebra terms are used to obtain LTSs by exploiting operational semantics and these structures are then minimized, tested for equivalence, model checked against formulae of temporal logics, etc. One of the most known tools for process algebras is CADP (Construction and Analysis of Distributed Processes) []: together with minimizers and equivalence and model checkers, it offers many others functionalities ranging from step-by-step simulation to massively parallel model checking. CADP has been employed in an impressive number of industrial projects. CWB (Concurrency Workbench) [] and CWB-NC (Concurrency Workbench New Century) [] are other tools that are centered on CCS, bisimulation equivalence, and model checking. FDR (Failures/Divergence Refinement) [] is a commercial tool for CSP that has played a major role in driving the evolution of CSP from a blackboard notation to a concrete language. It allows the checking of a wide range of correctness conditions, including deadlock and livelock freedom as well as general safety and liveness properties. TAPAs (Tool for the Analysis of Process Algebras) [] is a recently developed software to support teaching of the theory of process algebras; it maintains a consistent double representation as term and as graph of each system.
P
P
Process Algebras
Moreover, it offers tools for the verification of many behavioral equivalences, possibly with counterexamples, minimization, step-by-step execution, and model checking. TwoTowers is instead a versatile tool for the functional verification, security analysis, and performance evaluation of computer, communication, and software systems modeled with the stochastic process algebra EMPA []. μCRL [] is a toolset that offers an appropriate treatment of data and relies also on theorem proving. Moreover, it make use of interesting techniques for visualizing large LTSs.
and proved by relying on interleaving models. However, there are situations in which it is important to keep the information that a system is composed of the independently computing components. This possibility is offered by the so-called non-interleaving or true-concurrency models, with Petri nets [] as the prime example. These models describe not only temporal ordering of actions, but also their causal dependences. Non-interleaving semantics of process algebras have also been provided, see, e.g., []. Linear-time vs Branching-time
Relationships to Other Models of Concurrency In a private communication, in , Robin Milner, one of the founding fathers of process algebras, wrote: ▸ The concept of process has become increasingly important in computer science in the last three decades and more. Yet we still don’t agree on what a process is. We probably agree that it should be an equivalence class of interactive agents, perhaps concurrent, perhaps non-deterministic.
This quote summarizes the debate on possible models of concurrency that has taken place during the last thirty years and has been centered on three main issues: – Interleaving vs true concurrency – Linear-time vs branching-time – Synchrony vs asynchrony Interleaving vs True Concurrency
The starting point of the theory of process algebras has been automata theory and regular expressions, and the work on the algebraic theory of regular expressions as terms representing finite state automata [] has significantly influenced its developments. Given this starting point, the underlying models of all process algebras represent possible concurrent executions of different programs in terms of the nondeterministic interleaving of their sequential behaviors. The fact that a system is composed by independently computing agents is ignored and behaviors are modeled in terms of purely sequential patterns of actions. It has been demonstrated that many interesting and important properties of distributed systems may be expressed
Another issue, again ignored in the initial formalization of regular expressions, is how the concept of nondeterminism in computations is captured. Two possible views regarding the nature of nondeterministic choice induce two types of models giving rise to the linear-time and branching-time dichotomy. A linear-time model expresses the full nondeterministic behavior of a system in terms of the set of possible runs; time is treated as if each moment there is a unique possible future. Major examples of structures used to model sets of runs are Hoare traces (captured also by traces equivalence) for interleaving models [], and Mazurkiewicz traces [] and Pratt’s pomsets [] for non-interleaving models. The branching-time model is the main one considered in process algebras and considers the set of runs structured as a computation tree. Each moment in time may split into various possible futures, and semantic models are computation trees. For non-interleaving models, event structures [] are one of the best-known models taking into account both nondeterminism and true concurrency. Synchrony vs Asynchrony
There are two basic approaches to describing interaction between a sender and a receiver of a message (signal), namely, synchronous and asynchronous interaction. In the former case, before proceeding, the sender has to make sure that a receiver is ready. In the latter case, the sender leaves track of its action but proceeds without any further waiting. The receiver has to wait in both cases. Process algebras are mainly synchronous, but asynchronous variants have been recently proposed and are receiving increasing attention. However, many other successful asynchronous models have been developed. Among these, it is important to mention Esterel
Process Algebras
[], a full-fledged programming language that allows the simple expression of parallelism and preemption and is very well suited for control-dominated model designs; Actors [], a formalism that does not necessarily records messages in buffers and puts no requirement on the ordering of message delivery; Linda [], a model of coordination and communication among several parallel processes operating upon objects stored in and retrieved from shared, virtual, associative memory; and, to conclude, Klaim [], a distributed variant of Linda with a strong process algebraic flavor.
Related Entries Actors Behavioral Equivalences Bisimulation CSP (Communicating Sequential Processes) Pi-Calculus
Bibliographic Notes and Further Reading A number of books describing the different process algebras can be consulted to obtain deeper knowledge of the topics sketched here. Unfortunately most of them are concentrating only on one of the formalisms rather than on illustrating the unifying theories. CCS: The seminal book on CCS is [], in which sets of operators equipped with an operational semantics and the notion of observational equivalence have been presented for the first time. The, by now, classical text book on CCS and bisimulation is []. A very nice, more recent, book on CCS and the associated Hennessy Milner Modal Logic is []; it also presents timed variants of process algebras and introduces models and tools for verifying properties also of this new class of systems. CSP: The seminal book on CSP is [], where all the basic theory of failure sets is presented together with many operators for processes composition and basic examples. In [], the theory introduced in [] is developed in full detail, and a discussion on the different possibilities to deal with anomalous infinite behaviors is considered together with a number of well-thought examples. Moreover the relationships between operational and denotational semantics are fully investigated.
P
Another excellent text book on CSP is [] that also considers timed extensions of the calculus. ACP: The first published book on ACP is [], where the foundations of algebraic theories are presented and the correspondence between families of axioms, and strong and branching bisimulation are thoroughly studied. This is a book intended mainly for researchers and advanced students, a gentle introduction to ACP can be found in []. Other approaches: Apart from these books, dealing with the three process algebras presented in these notes, it is also worth mentioning a few more books. LOTOS, a process algebra that was developed and standardized within ISO for specifying and verifying communication protocols, is the central calculus of a recently published book [] that discusses also the possibility of using different equivalences and finer semantics for the calculus. A very simple and elegant introduction to algebraic, denotational, and operational semantics of processes, which studies in detail the impact of the testing approach on a calculus obtained from a careful selection of operators from CCS and CSP, can be found in []. The text [] is the book on the π-calculus. For studying this calculus, the reader is, however, encouraged to consider first reading [].
P Bibliography . Abadi M, Gordon AD () A calculus for cryptographic protocols: the spi calculus. Inform Comput ():– . Aceto L, Gordon AD (eds) () Proceedings of the workshop “Essays on algebraic process calculi” (APC ), Bertinoro, Italy. Electronic notes in theoretical computer science vol . Elsevier, Amsterdam . Agha G () Actors: a model of concurrent computing in distributed systems. MIT Press, Cambridge . Aldini A, Bernardo M, Corradini F () A process algebraic approach to software architecture design. Springer, New York . Baeten JCM, Weijland WP () Process algebra. Cambridge University Press, Cambridge . Bergstra JA, Klop JW () Process algebra for synchronous communication. Inform Control (–):– . Bergstra JA, Ponse A, Smolka SA (eds) () Handbook of process algebra. Elsevier, Amsterdam . Berry G, Gonthier G () The esterel synchronous programming language: design, semantics, implementation. Sci Comput Program ():–
P
Process Calculi
. Bettini L, Bono V, Nicola R, Ferrari G, Gorla D, Loreti M, Moggi E, Pugliese R, Tuosto E, Venneri B () The klaim project: theory and practice. In: Global computing: programming environments, languages, security and analysis of systems, Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . Bowman H, Gomez R () Concurrency theory: calculi and automata for modelling untimed and timed concurrent systems. Springer, London . Brookes SD, Hoare CAR, Roscoe AW () A theory of communicating sequential processes. J ACM ():– . Calzolai F, De Nicola R, Loreti M, Tiezzi F () Tapas: a tool for the analysis of process algebras. In: Transactions on Petri nets and other models of concurrency, vol , pp – . Cardelli L, Gordon AD () Mobile ambients. Theor Comput Sci ():– . Clarke EM, Emerson EA () Design and synthesis of synchronization skeletons using branching-time temporal logic. In: Proceedings of logic of programs, Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . Cleaveland R, Sims S () The ncsu concurrency workbench. In: CAV, Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . Conway JH () Regular algebra and finite machines. Chapman and Hall, London . De Nicola R, Ferrari GL, Pugliese R () Klaim: a kernel language for agents interaction and mobility. IEEE Trans Software Eng ():– . Fokkink W () Introduction to process algebra. SpringerVerlag, Heidelberg . Fournet C, Gonthier G () The join calculus: a language for distributed mobile programming. In: Barthe G, Dybjer P, Pinto L, and Saraiva J (eds) APPSEM, Lecture notes in computer science, vol , Springer, Heidelberg, pp – . Gadducci F, Montanari U () The tile model. In: Plotkin G, Stirling C, Tofte M (eds) Proof, language and interaction: essays in honour of Robin Milner. MIT Press, Cambridge, pp – . Garavel H, Lang F, Mateescu R () An overview of CADP . In: European Association for Software Science and Technology (EASST), vol . Newsletter, pp – . Gelernter D, Carriero N () Coordination languages and their significance. Commun ACM ():– . Groote JF, Mathijssen AHJ, Reniers MA, Usenko YS, van Weerdenburg MJ () Analysis of distributed systems with mcrl. In: Alexander M, Gardner W (eds) Process algebra for parallel and distributed processing. Chapman Hall, Boca Raton, FL, pp – . Hennessy M () Algebraic theory of processes. The MIT Press, Cambridge . Hennessy M () A distributed pi-calculus. Cambridge University Press, Cambridge . Hoare CAR () A calculus of total correctness for communicating processes. Sci Comput Program (–):– . Hoare CAR () Communicating sequential processes. Prentice-Hall, Upper Saddle River
. Kozen D () Results on the propositional -calculus. Theor Comput Sci :– . Larsen KG, Aceto L, Ingolfsdottir A, Srba J () Reactive systems: modelling, specification and verification. Cambridge University Press, Cambridge . Mazurkiewicz A () Introduction to trace theory. In: Rozenberg G, Diekert V (ed) The book of traces. World Scientific, Singapore, pp – . Milner R () A calculus of communicating systems. Lecture notes in computer science, vol . Springer-Verlag, Heidelberg . Milner R () Communication and concurrency. Prentice-Hall, Upper Saddle River . Milner R () Communicating and mobile systems: the picalculus. Cambridge University Press, Cambridge . Milner R () The space and motion of communicating agents. Cambridge University Press, Cambridge . Moller F, Stevens P () Edinburgh Concurrency Workbench user manual. Available from http://homepages.inf.ed.ac. uk/perdita/cwb/. Accessed January . Olderog ER () Nets, terms and formulas. Cambridge University Press, Cambridge . Plotkin GD () A structural approach to operational semantics. J Log Algebr Program –:– . Pratt V () Modeling concurrency with partial orders. Int J Parallel Process : . Reisig W () Petri nets: an introduction. Monographs in theoretical computer science. An EATCS Series, vol . Springer-Verlag, Berlin . Roscoe AW () Model-checking csp. In: A classical mind: essays in honour of C.A.R. Hoare. Prentice-Hall, Upper Saddle River . Roscoe AW () The theory and practice of concurrency. Prentice-Hall, Upper Saddle River . Sangiorgi D, Walker D () The π-Calculus: a theory of mobile processes. Cambridge University Press, Cambridge . Schneider SA () Concurrent and real time systems: the CSP approach. John Wiley, Chichester . Winskel G () An introduction to event structures. In: de Bakker JW, de Roever WP, Rozenberg G (eds) Linear time, branching time and partial order in logics and models for concurrency – Rex workshop, Lecture notes in computer science, vol . Springer, Heidelberg, pp –
Process Calculi Process Algebras
Process Description Languages Process Algebras
Programming Languages
Process Synchronization Path Expressions Synchronization
P
Bibliography . Patterson D, Anderson T, Cardwell N, Fromm R, Keeton K, Kozyrakis C, Thomas R, Yelick K () A case for intelligent RAM. IEEE Micro ():– . Kogge PM, Brockman JB, Sterling T, Gao G Processing in memory: chips to petaflops. In: Workshop on mixing logic and DRAM: chips that compute and remember at ISCA‘
Processes, Tasks, and Threads Processes, Tasks, and Threads are programs or parts of programs under execution. A program does not define a process, task, or thread since the same code may underlay different processes, tasks, or threads. At any given time, the state of a process, task, or thread includes the location of the instruction being executed and the value stored in all memory locations accessible to the program. An important characteristic of processes, tasks, and threads is that they must execute their instructions at a speed greater than zero when they are not explicitly blocked by synchronization operations. Although the notion of process, tasks, and threads differs across systems, typically a process may contain multiple threads, and threads share memory while processes communicate via messages.
Profiling Intel Thread Profiler OpenMP Profiling with ompP Performance Analysis Tools Scalasca TAU
Profiling with OmpP, OpenMP OpenMP Profiling with OmpP
Program Graphs Processor Allocation
Data Flow Graphs
Job Scheduling
Processor Arrays Systolic Arrays
Processors-in-Memory
Programmable Interconnect Computer Blue CHiP
Programming Languages Array Languages
Processors-in-memory (PIM) are devices that tightly integrate, for example, on a single chip, both processing logic and memory with the objective of reducing memory latency and increasing memory bandwidth at a low cost in terms of power, complexity, and space.
C* Chapel (Cray Inc. HPCS Language) Cilk CoArray Fortran Concurrent ML
P
P
Programming Models
Connection Machine Fortran
Bibliography
Connection Machine Lisp
. Warren DH () A view of the fifth generation and its impact. AI Mag ():– . Holmer BK, Sano B, Carlton M, Van Roy P, Despain AM () Design and analysis of hardware for high performance prolog. J Logic Program ():–
Deterministic Parallel Java Fortran and Its Successors Fortress (Sun HPCS Language) Glasgow Parallel Haskell (GpH) HPF (High Performance Fortran) Linda *Lisp Multilisp NESL
Promises Futures
OpenMP Functional Languages Logic Languages PGAS (Partitioned Global Address Space) Languages Sisal Stream Programming Languages Titanium UPC
Protein Docking Roger S. Armen , Eric R. May , Michela Taufer Thomas Jefferson University, Philadelphia, PA, USA University of Michigan, Ann Arbor, MI, USA University of Delaware, Newark, DE, USA
ZPL
Definition
Programming Models Actors BSP (Bulk Synchronous Parallelism)
Predict the final atomic resolution structure of a protein–protein complex starting from the coordinates of the unbound conformation of each component protein.
Concurrent Collections Programming Model
Discussion
CSP (Communicating Sequential Processes)
Introduction
SPMD Computational Model
Prolog Logic Languages
Prolog Machines Prolog machines [, ] include instructions and devices for the efficient implementation of Prolog programs. During the years of Japan’s fifth generation project, several data flow machines were designed to implement KL, a parallel variant of Prolog.
Related Entries Data Flow Computer Architecture Logic Languages
Most successful approaches to protein–protein docking use multistage hierarchical methods typically consisting of an initial global rigid body search (or a simplified reduced search based on known information) to identify a number of candidate protein–protein complexes followed by a more sophisticated refinement procedure. After searching, filtering, and refining, the predicted complexes are rescored based on a sophisticated scoring function for selecting the lowest energy complex. The native protein–protein complex is at the global free energy minimum; therefore those models with the lowest (free) energy should be the most similar to the native complex.
Approaches to Conformational Searching Exhaustive Rigid Body Searching The first step in docking two known protein structures is to generate configurations of protein–protein complexes. Traditionally, the generation of conformations
Protein Docking
consists of large-scale simulations of independent jobs and thus can be performed in parallel on parallel or distributed systems such as clusters, volunteer computing systems, grid computing systems, or cloud computing platforms. Treating the proteins as rigid bodies is the most computationally efficient method and will be discussed in this section (incorporation of increasing levels of protein flexibility will be covered later). There are six degrees of freedom (three rotation and three translation) that must be explored for an exhaustive search of complex geometries. The interactions that dominate the favorable binding energy of a protein–protein complex are interactions between the surface residues of the proteins. Therefore, one can naively assume that the most favorable complex geometries are those that bury the largest amount of surface area at the interface. This concept leads to the notion of surface complementarity (i.e., bulges on the surface of Protein A align with a valley on the surface of the Protein B), which is a phenomena observed in many experimentally determined protein complexes. To identify those complexes that maximize the interfacial surface area, a method of describing the individual protein surfaces must first be addressed. An effective method is to project each protein onto a D grid and identify which grid points lie at the protein surface and which in the protein interior, as was first described by Katchalski-Katzir et al. []. The grid can be represented by a function a when Protein A is placed in the grid, and function b when Protein B is placed in the grid, where
al,m,n
⎧ ⎪ ⎪ surface points ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ = ⎨ << interior points ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ exterior points ⎪ ⎩
bl,m,n
⎧ ⎪ ⎪ surface points ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ = ⎨ > interior points ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ exterior points, ⎪ ⎩
and
where l, m, and n are the grid indices. With these definitions, a correlation function can be calculated between the functions a and b as follows: cα,β,γ = ∑ ∑ ∑ al,m,n ⋅bl+α,m+β,n+γ l
m
n
()
P
where {α, β, γ} is a shift vector applied to Protein B. The value of c is zero when there is no contact between the proteins, is negative when there is overlap of protein interior regions, and is positive when the surface regions overlap. Complexes with strong positive values have the largest surface complementarity and are retained for further assessment. The evaluation of c for a given configuration, requires O(N ) computations (for a grid with dimensions N × N × N); searching all translations of Protein B (all shift vectors) requires O(N ) computations. A discrete Fourier transform (DFT) of Eq. can be performed and the inverse transform allows c to be reexpressed as N N N πi(oα + pβ + qγ) ]⋅Co,p,q ∑ ∑ ∑ exp[ N o= p= q= N () where C is the DFT of c, defined by cα,β,γ =
Co,p,q = A∗o,p,q ⋅Bo,p,q , where B is the DFT of b and A∗ is the complex conjugate of the DFT of a. Using a fast Fourier transform algorithm (FFT) reduces the O(N ) computations to O(N ● ln[N ]). Calculating c for all values of the shift vector on Protein B only accounts for translational degrees of freedom for a given relative rotation. To fully search the D space, Protein B must be incrementally rotated through the three Euler angles that define its relative rotation. Therefore the calculation in Eq. is performed ( × × )/D times, where D is the angle increment. Another method for finding surface complementarity uses spherical harmonics as the basis functions for describing the protein surfaces, instead of a Cartesian D grid, and was introduced by Ritchie and Kemp []. In this representation, the natural degrees of freedom are five Euler angles (two for Protein A, three for Protein B) and the center of mass separation of the two proteins. The shape complementarity can be computed via spherical polar Fourier correlations. This method exploits the property of spherical harmonics that they transform among themselves under rotation, which results in a more computationally efficient algorithm than the D Cartesian grid FFT method. The correlation function above only accounts for the surface complementarity and does not account for the chemical nature of the interaction. However, one can use a more complex function that takes into account
P
P
Protein Docking
electrostatics, van der Waals, and/or desolvation effects, which can be mapped onto the grid as well. The details of these different “scoring functions” are covered in the subsequent sections.
Reduced Search Space Methods Both the FFT and the spherical polar Fourier method are designed for an exhaustive search of the D space defined by the rigid body rotations and translations of one protein about the other fixed protein. In many instances, additional information about the system may be available which can narrow the search space. For example, when the binding site region of one protein is known, a targeted search can be performed about that site, significantly reducing the search space. A rigid body search is still performed in the initial phase, but the reduced space makes methods such as Monte Carlo and genetic algorithms (GA) viable options. In the Monte Carlo methods, an interaction potential is used to calculate an interaction energy; trial moves are accepted or rejected based on the Metropolis criterion. The programs RosettaDock [] and ICM-DISCO [] utilize Monte Carlo in their docking procedures. Neither RosettaDock nor ICM-DISCO use a GA, but in other approaches either a traditional GA or variations (e.g., Lamarckian GA) have been used to drive the search for new low-energy conformations in which the “chromosomes” of the GA consist of the relative positions, rotations, and orientations of one or both proteins. For example, GA can be used to move the surface of one protein relative to the other and locate the area of greatest surface complementarity between the two. Search methods using GA have been widely evaluated in several protein–ligand docking programs, e.g., AutoDock and DOCK. The incorporation of NMR Nuclear Overhauser Effect (NOE) data, which indicates which residues are interacting, makes solving the docking problem much more tractable. Other more ambiguous data, such as chemical shift perturbations, residual dipolar couplings, and mutagenesis (i.e., alanine scanning), can also be incorporated. The program HADDOCK [] uses these data sets in the form of Ambiguous Interaction Restraints (AIR)s. An AIR restraint is satisfied if an interface residue on one protein interacts with any interface residue on the other protein. The HADDOCK search method generates random orientations and does
a rigid body minimization, followed by a torsion angle– simulated annealing protocol that introduces both side chain and backbone flexibility. Cartesian space molecular dynamics are then conducted in explicit solvent. A newer approach toward reducing the search space is the prediction of protein active sites using bioinformatics techniques. While it is still difficult to reliably predict a protein “hot spot,” further development of these techniques provides a promising avenue of future study.
Side-Chain Refinement In some protein–protein complexation events, the interprotein contacts can induce large-scale conformation changes in the individual protein configurations. These complexes are the most difficult to predict through docking methods and require the proteins to be fully flexible, greatly increasing the configurational search space that must be sufficiently sampled, all of which comes at a great computational cost. In most complexes, just small local deviations from the individual protein crystal structures occur. The majority of these small configurational changes occur in the residue side chains, while the protein backbone remains (nearly) rigid. In multistage docking methods, after a favorable binding geometry is determined from rigid docking, the residue side chains can be (re)built and optimized. The side-chain configurations can be limited to a discrete set of states, derived from known protein structures. From this precalculated rotamer library of low-energy states, the global minimum of side-chain configurations is sought. A combinatorial approach can be used to sample all states, or more efficient algorithms can be used. Dead end elimination is one such algorithm, which removes possible rotamer states that are determined as not being present in the global energy minimum state. Full flexibility of the side chains can be introduced after rotamer optimization, via a molecular mechanics method, to sample deviations from the rotamer library configuration.
Incorporation of Protein Flexibility in Protein–Protein Docking Under physiological solution conditions, it is well known that proteins are dynamic rather than being entirely static and exhibit flexible motions that are thermally accessible within their thermodynamically favorable native-state topology. The range of accessible
Protein Docking
motions scales from minor side-chain rearrangements and minor flexible loop backbone motions to more substantial backbone deviations (usually within .–. Å Ca -RMSD) including larger-scale “shear motions” and “hinge motions.” The most significant and complicated rearrangements upon complex formation seem to involve partial unfolding and refolding of flexible segments of the protein (usually termini). Although many protein–protein complexes exhibit minor deviations in the unbound to bound conformation of the component proteins, it appears that a modest number of protein– protein complexes undergo significant rearrangement upon complex formation. Therefore, incorporation of protein flexibility for each protein in a given complex is an important aspect required for accurate predictions of these challenging complexes. Incorporating protein flexibility into the docking search is a significant computational challenge; many different approaches and algorithms are currently being explored and evaluated. Most successful approaches incorporate some level of backbone flexibility at the level of the global search. This is because even modest backbone rearrangements have been shown to have a significant outcome on the identification of the native protein–protein interface. Therefore, most approaches first generate an ensemble of discrete flexible conformations, use this ensemble for the global search phase, and identify the most likely solutions before moving forward to more detailed model refinements and final scoring of complexes. One way of reducing the search space is to consider only one of the two binding partners to be flexible and only a small ensemble of the more flexible binding partner docking to a representative structure of the other binding partner. There are numerous approaches to generate an ensemble of structures for ensemble docking that incorporate backbone deviations. The ensemble can be taken from an experimental NMR ensemble or from multiple diverse conformations from X-ray crystallography. In many cases, where there is experimental structural information, there remains insufficient structural diversity in the known structures. If some conformational diversity does exist in experimental structures, it is then possible to use sophisticated tools (e.g., DynDom, HingeFind, FlexProt) to compare the known structure for the identification of flexible loops, as well as shear and hinge motions. It is also possible to use methods
P
like targeted molecular dynamics to generate a continuous trajectory of conformations that connect the known experimental conformations. In the more generic case, where only one unbound conformational state is known, it is necessary to generate an ensemble of discrete flexible conformations using computational methods that aim to accurately represent the assessable states of the flexible protein. To this end, it is possible to predict flexible segments using molecular framework approach algorithms (FIRST) and hingedetection algorithms. Following this approach, conformers can then be generated by sampling a limited set of flexible degrees of freedom to reduce the search space. Alternatively, diverse conformations can be generated directly from physical approaches that more fully sample the available degrees of freedom to the entire protein. These approaches include Molecular Dynamics (MD), MD methods that employ advanced conformational sampling techniques, Essential Dynamics (Principle Component Analysis), and variations of Normal Mode Analysis (NMA). Standard MD techniques are based on describing the all-atom structure of the system with a detailed force field and modeling dynamics by numerical integration of Newton’s equations of motion. MDsimulated annealing conformational searches have been shown to be quite effective in protein–ligand docking (CDOCKER). However, using MD in this way for protein–protein docking is not feasible due to the huge differences in the search-space (geometry of a proteinligand binding site vs. protein-protein interface) and the number of flexible degrees of freedom. Standard MD methods can explore local minor conformational changes of a protein including side-chain rearrangements and minor backbone relaxations on the picoseconds to nanosecond timescale. However, large-scale changes in protein conformational space occur on longer experimental timescales (microseconds and milliseconds to seconds) and are separated by large energy barriers. Other advanced sampling techniques (simulated annealing, biased methods, torsion angle dynamics, replica exchange) can be coupled with MD to allow for more rapid crossing of energy barriers and sampling alternative low-energy stable conformations. The conformational space sampled by these MD methods can be clustered in Cartesian space to identify low-energy representative conformations.
P
P
Protein Docking
Alternatively, any given set of generated conformations (such as the conformational space explored by MD) can also be analyzed using Essential Dynamics that is also known as Principle Component Analysis (PCA). In PCA, a square covariance matrix is constructed from the conformational space describing the deviation of each atom coordinate from the average position. When this matrix is diagonalized, the largest eigenvalues represent the “principle components” of the proteins flexibility, where the direction of motion is described by the eigenvector and the amplitude of motion by the eigenvalue. Linear combinations of a subset of the most important eigenvectors can be used to generate an ensemble of “eigenstructures” for docking (CONCORD and Dynamite). Normal Mode Analysis (NMA) is a method for analytically describing the thermally available deviations from a given equilibrium reference structure within the harmonic approximation. For a given potential energy function describing the system around a minimum energy conformation, a Hessian matrix can be constructed from the mass-weighted second derivatives of the potential energy. When this matrix is diagonalized, the eigenvectors of the matrix are the “normal modes” of the proteins flexibility, where the direction of motion is described by the eigenvector and the amplitude of motion by the eigenvalue. Linear combinations of a subset of the most important eigenvectors can be used to generate “eigenstructures” for docking. Several studies with both normal mode analysis and elastic network normal mode analysis have demonstrated that a few of the low-frequency modes are able to successfully describe experimentally observed large-scale motions of proteins. The set of low-frequency normal modes can also be used for predicting hinge motions (HingeProt) and for estimating the conformational energy penalty that should be associated with a given conformational change. Tama and Sanejouand have pointed out one potential issue which is that in some cases, the normal modes calculated from open conformations of proteins correlate better with observed conformational changes than those calculated from closed and compact conformations []. To avoid this problem, May and Zacharias calculate normal modes from the starting structure, generate new initial conformations along deformations of normal mode eigenvectors, and then recalculate the normal modes assuming the new structure is in
equilibrium in an iterative fashion []. Another potential issue is that conformational changes involving loop movements are associated with high-frequency rather than low-frequency modes, so new approaches aim to explore and identify conformationally relevant “normal modes” following the observations of Abagyan and coworkers [].
Scoring Functions for Protein–Protein Docking As most protein–protein docking approaches use multistage hierarchical methods, scoring and ranking of putative conformations also typically occur at different steps, and have different requirements for accuracy and speed. The first step of scoring is usually simultaneous with a global or reduced rigid body search where a specified number of favorable complexes are selected for additional refinement. Following this refinement, the most accurate scoring functions are applied to select the most favorable complexes for the final prediction. If there is insufficient accuracy in the first step of discrimination, then native-like complexes are never refined and are missed in the final step of more accurate scoring of refined complexes. On the other hand, if the first step of scoring is too computationally expensive, it limits the amount of conformational space that can be searched in the initial phases. In the global rigid body search phase (D search of translational/rotational space), experience has shown that allowing some steric overlap between the surfaces of the two proteins is necessary for success; thus some minor or significant rearrangement of the binding partners may be required. This is handled differently in geometric complementarity–based searching strategies like geometric hashing depending on how the surfaces match or geometric complementarity is calculated. Although not all of the most common search strategies employ surface-matching/geometric complementarity as the primary scoring metric during the sampling phase, many successful protocols (BIGGER, ClusPro, D-Dock, DOT, Molfit, PatchDock, SKE-DOCK, SmoothDock, and ZDOCK) use various surface-matching/geometric complementarity metrics in combination with other metrics such as electrostatics and desolvation to determine which complexes should move into the refinement phase. For molecular mechanics–based force fields, allowing some steric
Protein Docking
overlap is usually accomplished by using a “soft” or smoothed force field with reduced vdW repulsion. In small-molecule protein–ligand docking, it is extremely common to use a soft vdW potential early in the conformational search, and then in the refinement procedure introduce a more “hard” or standard Lennard–Jones potential. This practice has also been incorporated in some variations of protein–protein docking (ROTAFIT). Although it is true that exposed hydrophobic patches of residues on a protein surface often indicate a protein–protein interaction surface, several studies that have analyzed experimental protein–protein interfaces show that there is no one single determining feature for prediction of the correct native-like interface. Although some propensities for certain residues can be detected at a sequence level, there is no single interface property (i.e., total buried surface area, number of hydrogen bonds, hydrophobicity, electrostatic complementarity, predicted desolvation energy, residue sidechain rotamers) that is a strong predictor of the native interface over a variety of protein complexes. For this reason, most successful scoring functions designed for correct identification of native protein–protein interfaces are increasingly complex and sophisticated, and involve a combination of these types of interface properties (e.g., shape complementarity, hydrophobic interactions and electrostatics). In general, there are three generic categories of scoring functions: () molecular mechanics, () knowledge-based, and () empirical scoring functions. Research groups have demonstrated success and improvements in native-like geometry discrimination using various flavors and combinations of all three of these types of scoring functions. Most molecular mechanics force fields are quite similar in their overall generic form. The total potential energy of the system is the pair-wise sum of van der Waal interactions modeled from the Lennard–Jones potential, electrostatic interactions, and the sum of all the internal degrees of freedom (bonds, angles, and torsions) that describe the molecular geometry of the bonded atoms:
P
⎡ ⎤ ⎢ Rmin ij Rminij ⎥ ⎢ + ∑ εvdW ⎢( ) −( ) ⎥ ⎥ rij ⎢ rij ⎥ nonbonded ⎣ ⎦ qi q j () + εelec rij The most common molecular mechanics force field in the protein–protein docking field is CHARMM. Several forms of implicit solvation potential energy can also be included in the CHARMM potential, which are based on continuum electrostatics theories including Poisson–Boltzmann, generalized Born, and more simplified representations that are analytical approximations of these. Including these terms can allow for a more rigorous calculation of the desolvation energy upon formation of a protein–protein interface. The electrostatics of a protein–protein interface is a complicated balance between favorable and unfavorable electrostatic complementarity (e.g., H-bonds) and (un)favorable desolvation energy. Vajda and coworkers have determined that at the interface it is more important to put a larger weight on the vdW terms []. The field of knowledge-based scoring functions has its origin in the protein-folding field (for the correct identification of native-like folds), and is based on forming statistical potentials from inverse Boltzmann weighting of what is statistically observed in nativelike structures. One of the most basic forms of a knowledge-based potential for protein–protein docking is a potential based on the statistical preference of certain residue–residue pairs across the interface. Additional variations of this include a wide range of sophisticated atom–atom contact potentials and also distance-dependent atom pair potentials. One widely used knowledge-based potential of this type is the “atomic contact potential” (ACP) [] which is an atom– atom extension of the widely used Miyazawa and Jernigan potential []. Delisi and coworkers develop the ACP by considering protein atom types and a contact radii of . Å where the total contact energy of the interface Ec is defined as
Ec = ∑ ∑ Eij nij
()
i= j=
V = ∑ kb (b − b ) + ∑ kθ (θ − θ ) bonds
+
angles
∑ dihedrals
kϕ [ + cos(nϕ − δ)]
where Eij is the absolute contact energy between atoms i and j shown in Eq. : Eij = eij + Ei + Ej − E .
()
P
P
Protein Docking
The effective contact energies eij can be written in terms of the observed contact numbers in a database where n¯ ij is the number of i–j contacts in the most probable distribution in an unbiased sample: eij = − ln(
n¯ ij n¯ ). n¯ i n¯ j
()
It has also been possible to express variations of such atomic contact potentials and variations of distancedependent pair potentials into functional forms that allow their incorporation into FFT methods such as ZDOCK and PIPER. Incorporation of such potentials has been shown to improve the discrimination of native-like geometries over decoys. Empirical scoring functions attempt to approximate the relative statistical weights from various parameters to optimize performance from regression-based training and validation studies. One of the best examples of these in protein–protein docking is the complicated scoring function employed in RosettaDock. As outlined in Gray et al. [], the same functional form has different optimized weights during three phases of the algorithm, i.e., packing, minimization, and final discrimination. In the potential for the discrimination phase, there are terms that all have optimized weights (repulsive vdW, attractive vdW, surface area solvation, Gaussian solvent-exclusion, rotamer probability, hydrogen bonding, residue pair probability, electrostatic short-range repulsion, electrostatic short-range attraction, electrostatic long-range repulsion, and electrostatic long-range attraction) so the final version of the functional form is score = Wvdw−atr Svdw−atr + Wvdw−rep Svdw−rep + Wsol S sol + Wsasa Ssasa + Whb Shb + Wrotamer Srotamer + Wpair Spair + Welec
sr−rep Selec sr−rep
+ Welec
sr−atr Selec sr−atr
+ Welec
lr−atr Selec lr−atr
+ Welec
lr−rep Selec lr−rep
the free energy using an empirical contact potential for the desolvation contribution.
Critical Assessments of Protein–Protein Docking Methods The protein–protein docking field has significantly benefited from the Critical Assessment of Predicted Interactions (CAPRI) experiment. CAPRI is a comm unity-wide blind prediction experiment, where experimentalists provide newly solved protein–protein complexes and withhold the coordinates. CAPRI participants are given the D coordinates of each unbound subunit and predictions are due – weeks later. Using available experimental data in the literature (e.g., mutations that may indicate the location of the binding site) is also allowed, so some targets offer assessments of protein–protein docking where some information is known which can limit the search space, while for most of the targets no information is available. Each participating group is allowed to submit top-ranked models of the protein–protein complex. The accuracies of the predictions are categorized as: () high, () medium, () acceptable, and () incorrect according to several evaluation metrics that are summarized in Table . The three primary metrics are: () Ca RMSD of the ligand (smaller protein) from the native complex structure after the superposition of the receptor (larger protein), () Ca RMSD of the backbone of the interface residues, and () the percentage of native contacts on the interaction interface. The initial rounds – of CAPRI occurred from to and the results have been published in the literature [, ]. The next rounds – occurred from to and these results have also been published []. The most recent rounds – were recently evaluated in a CAPRI meeting held in Barcelona, Spain, in December . Over the rounds of CAPRI so far, targets
()
In the RosettaDock algorithm, this empirical functional form is accurate enough to successfully repack the side chains on the protein–protein interface during the refinement and discrimination phases. Many other empirical scoring functions are employed at the final discrimination step in other docking approaches. Another example is FastContact that rapidly estimates the vdW, electrostatic, and desolvation components of
Protein Docking. Table CAPRI evaluation metrics for predicted protein–protein complexes Rank
Ligand (Cα RMSD)
Interface (Cα RMSD)
Interface fNat. Cont.
High∗∗∗
≤ .
≤ .
≥ .
Medium∗∗
.–.
.–.
≥ .
.–.
.–.
≥ .
∗
Acceptable Incorrect
< .
P
Protein Docking
have been released and predicted by various groups, although some of the targets were canceled during the evaluation period. Over rounds –, the best performing methods were: ICM-DISCO, SmoothDock, Molfit, D-Dock, and DOT []. Over rounds – the best overall performance was from ICM-DICSO, PatchDock, ZDOCK, FT-Dock, RosettaDock, and SmoothDock []. Over rounds –, the best overall performance was from ZDOCK, HADDOCK, Molfit, D-Dock, ClusProt, RosettaDock, and ICM-DISCO []. From the most recent rounds –, which were evaluated at the Barcelona meeting, it was shown that some of the Web servers are also now performing as well as some of the other participating groups. The best performing Web servers were ClusPro and PatchDock over rounds –. The performance of the best performing Web servers that participated in rounds – is shown in Table , along with some of the other most popular Web servers. The improved performance of some of the Web servers is very important to the experimental community, as many groups use these servers to predict likely binding modes of protein complexes of interest. These top-ranked binding modes are then often used by experimental groups to design experiments aimed at validating the complex interface using experimental methods such as site-directed mutagenesis, enzymatic proteolysis, mass spectrometry, crosslinking, and FRET.
The most recent CAPRI rounds have shown that some of the more easy targets that did not involve significant conformational changes could be predicted by many groups and Web servers using a variety of methods. This suggests that there is an emerging consensus among the various docking methods for adequate predictions for the easiest of the test cases. The most challenging targets were clearly those where backbone flexibility was very important, as well as targets where one of the protein complex components needed to be predicted by homology modeling. Another clear lesson is that some limited knowledge from biochemical information, when used to limit the search to a certain space area, dramatically improves predictions for several of the methods. The greatest challenge in the field is clearly the issue of incorporating backbone flexibility into the predictions. One reason why this is challenging is that incorporation of backbone flexibility can further increase the number of false positives by nonnative complexes that score better than the native interface.
Applications of Protein–Protein Docking and Large-Scale Predictions Due to the importance of protein–protein interactions in understanding the connections and regulations between biochemical pathways, many high-throughput
P Protein Docking. Table Protein–protein docking Web servers and performance in CAPRI rounds – (CAPRI meeting Barcelona Dec ) Web Servers (CAPRI rounds –)
Total High Medium
CLUSPRO (Vajda group, Boston University, Boston MA) http://cluspro.bu.edu
∗∗∗
∗∗
HADDOCK (Bonvin group, Utrecht University, the Netherlands) http://haddock.chem.uu.nl
∗∗∗
∗∗
∗∗∗
-
GRAMM-X (Vakser group, U. of Kansas, Lawrence, KS) http://vakser.bioinformatics.ku.edu/ resources/gramm/grammx
SKE-DOCK (Takeda-Shitaka group, Kitasato U., Japan) http://www.pharm.kitasato-u.ac.jp/bmd/ files/SKE_DOCK.html
∗∗∗
-
PatchDock, FiberDock, FireDock (Wofson group, Tel Aviv U.) http://bioinfod.cs.tau.ac.il/PatchDock/
∗∗∗
-
∗∗
-
TOP DOWN (Nakamura group, Inst. for Prot. Res. Osaka Japan) http://pdbjs.pdbj.org/TopDown/ Other Web Servers: ZDOCK (Z. Weng group U. of Massachusetts, Worcester, MA) http://zdock.bu.edu/ RossettaDock (J. Gray group John Hopkins U., Baltimore, MD) http://rosettadock.graylab.jhu.edu/ HEX (Ritchie group, Nancy, France) http://www.loria.fr/~ritchied/hex/ FiberDock (Wofson group, Tel Aviv U.) http://bioinfod.cs.tau.ac.il/FiberDock D-Garden (Sternberg group, Imperial College, London, UK) http://www.sbg.bio.ic.ac.uk/~dgarden/
P
Protein Docking
proteomics efforts have also been aimed at creating large-scale comprehensive catalogues (an Interactome) of experimentally verified protein–protein interactions. Significant progress has been made in this area in the model organism yeast, and similar efforts are underway in many other areas, including important pathogenic organisms and human cancer cell types. Experimental studies of these protein–protein interaction networks in cells have shown that some proteins may interact with up to other proteins. Despite these ongoing experimental efforts, it is clear that structural genomics efforts alone (which systematically determine the experimental D structures of high interest proteins) will not be able to determine the structures of all of the detected and important protein–protein interactions and reliable computational prediction of these complexes will be more and more important in the future. To successfully perform large-scale protein–protein docking predictions, it will be necessary to employ homology-based modeling methods. The necessity to rely on homology models as the initial input to protein–protein docking, increases the need for improved accuracy both for the initial homology modeling steps and for the reliable incorporation of flexibility in the protein–protein docking steps. The European D-Repertoire project (http://www. drepertoire.org) aims to experimentally determine the D structures of some high–interest protein– protein complexes from yeast that are amenable to X-ray crystallography, NMR spectroscopy, and cryoelectron microscopy methods. Still a large number of the total number of yeast protein–protein interactions (, high confidence complexes from a total of , genes) will be predicted by computational methods. It is also true that for some fraction of the protein– protein targets, the entire complex itself including the interface may be modeled reliably based on homology to other known complexes that are sufficiently similar in sequence. However, for the vast majority of the targets, this approach is not possible and it becomes necessary to perform protein–protein docking of homology models for each individual component of the complex (given that sufficient templates are available for each component of the complex). The first publication describing this effort came out in , describing the prediction of , protein–protein complexes starting from experimental structures and ,
homology models []. The authors used both ZDOCK . along with the pyDOCK scoring scheme and provided results for both a widely used protein–protein docking benchmark dataset []. Despite the fact that the accuracy of the benchmark is only in the range of –% (depending on if the top , , or complexes are considered), these predictions of , complexes from the yeast Interactome are still a very exciting preliminary step toward more large-scale prediction efforts like this in the future. These preliminary efforts highlight the ongoing need for improved accuracy in protein–protein docking methods.
Bibliographic Notes and Further Reading More on genetic algorithms and protein–protein docking: () Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, Belew RK, Olson AJ () Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J Comput Chem ():–. () Taylor JS, Burnett RM () DARWIN: A program for docking flexible molecules. Proteins :–. () Gardiner EJ, Willett P, Artymiuk PJ () Protein docking using a genetic algorithm. Proteins :–. Introduction in molecular dynamics: () Leach A () Molecular modelling: principles and applications, nd edn. Prentice Hall, Englewood Cliffs. () Rapaport DC () The art of dynamics simulation, nd edn. Cambridge University Press, Cambridge. Key recent reviews on protein docking: () Zacharias M () Accounting for conformational changes during protein-protein docking. Curr Opin Struct Biol :–. () Vajda S, Kozakov D () Convergence and combination of methods in protein-protein docking. Curr Opin Struct Biol :–. () Andrusier N, Mashinach E, Nussinov R, Wolfson HJ () Principles of flexible protein-protein docking. Proteins :–.
Bibliography . Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA () Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci :– . Ritchie DW, Kemp GJL () Protein docking using spherical polar Fourier correlations. Proteins :–
PVM (Parallel Virtual Machine)
. Gray JJ, Moughon S, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D () Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol :– . Fernandez-Recio J, Totrov M, Abagyan R () ICM-DISCO Docking by global energy optimization with fully flexible sidechains. Proteins :– . Dominguez C, Boelens R, Bonvin MJJ () HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc :– . Tama F, Sanejouand YH () Conformational change of proteins arising from normal mode calculations. Protein Eng :– . May A, Zacharias M () Energy minimization in lowfrequency normal modes to efficiently allow for global flexibility during systematic protein-protein docking. Proteins :– . Cavasotto CN, Kovacs JA, Abagyan RA () Representing receptor flexibility in ligand docking through relevant normal modes. J Am Chem Soc :– . Camacho CJ, Vajda S () Protein docking along smooth association pathways. Proc Natl Acad Sci :– . Zhang C, Vasmatzis G, Cornette JL, DeLisi C () Free energy landscapes of encounter complexes in protein-protein association. Biophys J ():– . Miyazawa S, Jernigan RL () Estimation of effective interresidue contact energies from protein crystal structures: quasichemical approximation. Macromolecules :– . Mendez R, Leplae R, De Maria L, Wodak SJ () Assessment of blind predictions of protein–protein interactions: current status of docking methods. Proteins :– . Mendez R, Leplae R, Lensink MF, Wodak SJ () Assessment of CAPRI predictions in rounds – shows progress in docking procedures. Proteins :– . Lensink MF, Wodak SJ, Mendez R () Docking and scoring protein complexes: CAPRI rd edition. Proteins :– . Mosca R, Pons C, Fernandez-Recio J, Aloy P () Pushing structural information into the yeast interactome by highthroughput protein docking experiments. PLOS Comp Biol ():–
Pthreads (POSIX Threads) POSIX Threads (Pthreads)
PVM (Parallel Virtual Machine) Al Geist Oak Ridge National Laboratory, Oak Ridge, TN, USA
Synonyms Message passing
P
Definition Parallel Virtual Machine (PVM) is a software package that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more costeffectively by using the aggregate power and memory of many computers. The software is very robust and portable. The source, which is available free from the PVM website, has been compiled on everything from laptops to CRAYs. Because PVM supports fault tolerance and dynamic adaptability, hundreds of sites around the world continue using PVM to solve important scientific, industrial, and medical problems. PVM has also been popular as an educational tool to teach parallel programming because of its simple intuitive user interface.
Discussion Introduction PVM (Parallel Virtual Machine) is often lumped together with the Message Passing Interface (MPI) standard, because PVM was the precursor to MPI and the PVM developers, most notably, Jack Dongarra started and lead the initial MPI forum that defined the MPI . standard. But message passing is only a small part of the PVM package. PVM is an integrated set of software tools and libraries that emulates a general-purpose, fault-tolerant, heterogeneous parallel computing framework on interconnected computers of varied architectures and operating systems. As a demonstration at the Supercomputing conference, PVM was used to hook together all the supercomputers on the exhibit floor into one giant parallel virtual machine. PVM is a byproduct of the heterogeneous-distributed computing research project at Oak Ridge National Laboratory, Emory University, and the University of Tennessee. The programming model upon which PVM is based differs substantially from MPI. The PVM programming model has: ●
Dynamic, user configured, host pool. The set of computers that make up the virtual machine may themselves be parallel or serial computers and PVM can add or delete machines while programs are running. For example, to add more computing power, or delete hosts that have failed.
P
P ●
●
●
●
●
PVM (Parallel Virtual Machine)
PVM tasks are dynamic and asynchronous. The tasks that make up a parallel program do not have to start all together. New tasks can be spawned off in the middle of a computation, tasks can be killed and programs started outside the parallel virtual machine can join PVM and become another task in the dynamic set of tasks that make up the parallel program. Dynamic groups. Users can define any number of subgroups of PVM tasks that can then perform collective operations such as broadcast data to all members of the group or set a barrier. The members of the groups can change dynamically during a computation (an important feature for fault tolerance). PVM provides calls to manage dynamic groups. Translucent access to hardware. Application programs may view the hardware environment as an attributeless collection of hosts or may choose to exploit the capabilities of specific machines in the host pool by positioning certain computational tasks on the most appropriate computers. PVM provides calls to convey host attributes to a running application. Transparent heterogeneity support. A single PVM job can run across a collection of Linux, Windows, and Unix computers, that themselves can be multicore, serial or parallel computers. PVM transparently takes care when transferring data between computers that have different data representations, for example, big Endian and little Endian and different word lengths. Explicit message-passing. Data is transferred between PVM tasks using a small number of point-to-point and collective message-passing calls. When a sender or receiver is detected to be dead, PVM automatically deletes it from the virtual machine and notifies the application.
The PVM system is composed of two parts. The first part is a daemon, called pvmd, that resides on all computers making up the virtual machine. PVM is designed so that anyone with a valid login can install this daemon on a machine. When a user wishes to run a PVM application, he first creates a parallel virtual machine by starting PVM with a list of hosts upon which he has already installed pvmd. Multiple users can configure overlapping virtual machines and each user can execute
several PVM applications simultaneously on the same parallel virtual machine. The second part of the system is a library of PVM interface routines that is linked with the users PVM application. The library contains routines for resource management (modifying the pool of hosts), process control (spawning and killing tasks), managing dynamic task groups, passing messages, and fault tolerance. The basic PVM system supports C, C++, and Fortran languages, but the PVM user community has extended PVM so it can be used with Java, Python, Perl, Lisp, S-lang, R, tk/tcl, and IDL languages.
Resource Management The underlying set of hosts that make up a particular PVM instantiation is a dynamic resource. Typically, a host list is just fed into the PVM startup routine to create a custom parallel virtual machine tailored to the problem being solved, but PVM provides much more resource management flexibility. PVM provides pvm_addhosts and pvm_delhosts routines that allow any PVM task to add or delete a set of hosts in the parallel virtual machine at any time. These routines are sometimes used to set up a virtual machine, but more often they are used to increase the flexibility and fault tolerance of a large scientific simulation. These routines allow a simulation to increase the available computing power (adding hosts) if it determines that the problem is getting harder to solve, then to shrink back in size when the problem does not need all the extra resources. Another use of these routines is to increase the fault tolerance of a simulation by having it detect the failure of a host and replacing the failed host on-the-fly with pvm_addhosts. PVM provides routines to return information about the present set of hosts in the parallel virtual machine. The information includes the host names, their architecture types (from which the individual data representations can be determined), and their relative CPU speed (which can be useful to an intelligent load balancing program). A PVM task can also find out what host it is running on and use this architectural awareness to run algorithms tuned for that architecture. The PVM resource management system also includes a way to set/get various options that determines how the underlying PVM system works. For example, if the user prefers speed to flexibility, the communication can be set so that it is as fast as MPI. Similarly, if the user
PVM (Parallel Virtual Machine)
knows that the task groups are going to be static then the underlying system can be told and it will utilize local cache information to speed up all the group routines. If tracing or debugging are desired there are options in PVM to set which tasks to trace, and where to send the trace output. PVM has over a dozen options that can be set and the list is extensible to add additional system hints or options in the future.
Process Control PVM provides routines to dynamically spawn and kill tasks and allows tasks to join a running PVM system and to exit a PVM system. The pvm_spawn routine is the central function in the PVM process control. The pvm_spawn routine starts up one or more copies of an executable file. The executable is usually a PVM program, but it does not have to be. Spawn can launch any command or executable program that the user has permission to execute on the hosts. The spawn routine can pass an argument list to the launched executable(s) and can specify where the tasks should be started. If unspecified, then PVM will spread the tasks evenly across the entire virtual machine. Pvm_spawn can be called multiple times to start many different types of tasks, which is needed to support functional parallelism. It can be called anytime during program execution to add additional computation tasks, or to replace a failed task, or to add a different type of task in order to create a program that adapts on-the-fly to the problem being solved. Besides being able to place tasks on specific hosts, PVM has a plug-in interface that allows users to plug-in their own task placement/load balancing routines into the parallel virtual machine. The flexibility of the pvm_spawn routine allows PVM to support all types of programming methodologies, not just SPMD. To complement the ability to create new tasks dynamically, PVM provides a routine that allows any task to kill any other task, and a routine to allow a task to exit the PVM system, i.e., to cleanly kill itself. The key to these routines is to be sure that all pending parallel virtual machine operations that involve the particular tasks are taken care of before they are killed. The PVM process control system has a routine that returns information about the tasks running on PVM at any given moment. The routine, called pvm_tasks, has three options: it allows a user’s PVM program to
P
determine information about all tasks in the parallel virtual machine, information about all tasks running on a specific host (useful if there is a fault or a load balance issue), or information about a particular task.
Message Passing The PVM communication model provides pointto-point, and collective message-passing routines. It also provides the concept of persistent messages and message handlers. The number of communication routines is quite small compared to the MPI standard, which has over functions. For point-to-point communication PVM only provides asynchronous blocking send, asynchronous blocking receive, and nonblocking receive functions. There is also a nonblocking probe routine to check if a particular message has arrived. For collective communication PVM provides multicast to a set of tasks, broadcast to a user-defined group of tasks, reduce operation across a group, i.e., global max, min, sum, etc., and barrier across a group. There is a four-step process in sending a PVM message. First the message is “packed.” Each message can contain an arbitrary structure of data types and arrays. For example, a message may have an integer defining how long a vector is, a vector of integers defining the indices of a sparse array, and a vector of doubleprecision floating point numbers. Each of these data types could be packed into a single message. In the second step the message is sent to its destination. In the third step the message is received by one or more PVM tasks. In the fourth step the message is unpacked. PVM supplies a pack and unpack routine for every data type supplied by computer manufacturers. Persistent messages were added to the PVM interface in version .. Persistent messages are packed as usual with an arbitrarily complex or simple data structure, but instead of being sent to a destination, they are stored and retrieved by “name.” The sender of the message can specify if other tasks are allowed to update the message, or delete it. The sender also specifies if the message should persist beyond the sender exiting PVM or if it should be cleaned up on exit. Persistent messages have many powerful uses in the dynamic, faulttolerant environment of PVM’s programming model. It allows a message to be sent to a task that does not exist yet. An example of this use is rendezvous – the first task leaves a persistent message defining where it
P
P
PVM (Parallel Virtual Machine)
can be found and how to attach, another example of this use case is in-memory checkpointing. The feature also allows a message to be stored persistently and retrieved by a set of tasks that will not be defined until sometime in the future. An example of this is a fault in the system affects a set of tasks and they need the stored information to recover. Persistent messages provide a distributed information database for dynamic programs inside PVM. For example, a monitoring or computational steering application can attach to a PVM program and leave information in a named space without having to know anything about the tasks in the PVM program, which will retrieve and act on the information. The concept is similar to tuple space used by the Linda system []. Persistent messages are implemented in PVM . using only four additional routines: pvm_putinfo(), pvm_recvinfo(), pvm_delinfo() to delete a persistent message, and lastly, pvm_getmboxinfo() that allows a user to search the persistent message namespace with a regular expression. User-defined message handlers were also added to PVM in version .. There are no restrictions on the handler function that can be set up. The handler is triggered whenever a message arrives matching the specified context, message tag, and source. This feature has many uses. For examples, it allows users to define new control features inside a parallel virtual machine and it provides the ability to implement active messages to increase communication performance.
Dynamic Task Groups The dynamic process group functions are built on top of the core PVM routines. There was some debate about how groups should be handled in PVM. The issues include efficiency, fault tolerance, and robustness. There are tradeoffs between static versus dynamic groups and tradeoffs if use of the function calls is restricted to only tasks in a group. In keeping with the PVM philosophy, the group functions are designed to be very general and transparent to the user, at some cost in efficiency. Any PVM task can join or leave any group at any time without having to inform any other task in the affected groups. Tasks can broadcast messages to groups of which they are not members. Tasks can be members of multiple groups. In general, any PVM task can call any of the group functions at any time with only two exceptions: pvm_lvgroup() and pvm_barrier() which
by their nature require the calling task to be a member of the specified group. To boost performance, PVM provides a user function to freeze a group, which allows the underlying PVM system to treat that group as static from that point forward and exploiting all the efficiency gains afforded by having a static group structure. Dynamic group functions in PVM include joining a group, leaving a group, getting information about a group such as size and members, broadcasting to all members of a group, doing a reduce across all members of a group with predefined or arbitrary functions, and lastly setting a barrier across a group where tasks in the group wait until all tasks in the group have called the barrier routine.
Fault Tolerance The parallel virtual machine constantly monitors itself even when no PVM applications are running. If a fault is detected the set of pvmd coordinate with each other and automatically reconfigure the parallel virtual machine to remove the fault and keep running. The PVM interface also has a notify function that allows a running PVM task to request to be notified of changes in the system including where and what type of fault was detected. Typical notifications are for task exit, host deletion, and a host being added. Note that the latter notification is useful during a recovery phase when replacement resources are being brought into the virtual machine. A fault-tolerant PVM application has many choices about how to utilize the notification function in PVM. The application may have all the tasks notified if a fault occurs (SPMD style), it could have one specially written task dedicated to monitoring and recovery of the application if a fault occurs anywhere (MIMD). It could have a set of tasks that get spawned dynamically with a recovery expertise specific to the notification. It could have a set of monitoring and recovery tasks that back each other up. The flexibility of PVM and its dynamic programming model make it very attractive to groups who need fault-tolerant applications. One real life example – PVM was used to create a monitoring system across a hospital’s operating rooms with information from individual patient monitors being fed to a central observation/control room, the system needed to be tolerant as patients were constantly being plugged and unplugged from various resources as they went from prep to operating room to recovery room, etc.
PVM (Parallel Virtual Machine)
The PVM notification feature is not limited to fault recovery. It can also be used by load-balancing programs that increase and decrease the host pool to meet the changing load of the application over time. It can be used by logging and accounting tasks to keep track of system usage by PVM users. As with many of the PVM functions they are design to provide the maximum use and flexibility to the user.
Related Entries Broadcast Clusters Collective Communication Distributed-Memory Multiprocessor Fault Tolerance MPI (Message Passing Interface) Parallel Computing
Bibliographic Notes and Further Reading The PVM project began in the summer of at Oak Ridge National Laboratory when Vaidy Sunderam, a visiting faculty from Emory University, and Al Geist designed and built the prototype PVM . to explore the concept of heterogeneous parallel computing. This prototype was only used internally at the lab and was not released. Jack Dongarra joined the project in and saw the value of PVM for the larger community. A robust and hardened version of the prototype was written at the University of Tennessee and publicly released
P
as PVM . in March . During the next years PVM rapidly grew in popularity. After user feedback and a number of evolutionary changes (PVM .–.), a complete rewrite was undertaken, and PVM . was released in February . PVM . continued to evolve and incorporate new features and capabilities to meet the needs of the exponentially growing number of users and applications. In , years after its inception, the PVM feature set was frozen at PVM .. Since then PVM has continued to be supported and new versions released to port to new architectures and operating systems, for example, in , years after PVM started, PVM .. was released and distributed freely as has every other version of PVM. For further reading the PVM website (http://www.csm.ornl.gov) contains tutorials, tools, and information about the PVM development team. For the past years there has been an annual European conference dedicated to advances to PVM and MPI. The topics and papers from all these conferences can be found at http://pvmmpi.ucd.ie/ previous.
Bibliography . Bjornson R, Carriero N, Gelernter D, Leichter J (January ) Linda, the portable parallel. Research Report Yale/DCS/RR- . Geist A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V () PVM: parallel virtual machine. MIT Press, Cambridge, Massachusetts . Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J () MPI: the complete reference. MIT Press, Cambridge, Massachusetts
P
Q QCD apeNEXT Machines Olivier Pène University de Paris-Sud-XI, Orsay Cedex, France
Definition apeNEXT is a massively, fully custom, parallel computer conceived and designed by an Italian–German–French collaboration and delivered since . It is the latest computer of the APE series. It is a D array of computing nodes with periodic boundary conditions. It is dedicated to the calculations of lattice quantum chromodynamics, which is the “ab initio” computational method for the theory of the strong interactions in subnuclear matter.
Discussion Introduction All the matter surrounding us is made of atoms, which are made up of electrons surrounding an atomic nucleus. The latter nuclei are made up of protons and neutrons, which themselves are made up of quarks. The force responsible for binding quarks into protons, neutrons, and many other such particles, generically named “hadrons,” is the “strong interaction” of subnuclear matter. The same strong force binds the protons and neutrons together to form atomic nuclei. This force is so strong that quarks are “confined” into hadrons, which means that one cannot find in nature free quarks, not bound into hadrons. Only at incredibly high temperature the quarks become deconfined. The theory of this force was discovered in the s and this discovery was awarded by the Nobel Prize of physics in . It is called “quantum chromodynamics (QCD).” This theory demonstrates that the strong force is mediated by new fundamental particules named “gluons.” However, computing within his theory is a
formidable task. The most rigorous method uses a four dimensional lattice which provides a discretised description of space-time. This technique is called lattice QCD (LQCD). It allows to compute ab initio the masses of particles such as the proton, the neutron, the pion, etc., and their properties. This method is the best since it relies uniquely on the theory itself. However it is highly demanding in terms of needed computing power. This is why the LQCD community has developed since the mid-s a series of dedicated computers, in order to be able to perform these computations at an affordable price. This has been done in the USA, in Japan, in Europe. There have been several generations of these computers. The last generation is the QCDOC designed by a US–Britain collaboration with IBM. This is the ancestor of the BlueGene series of IBM. In continental Europe, the APE series started in Italy in . Nicola Cabbibo [] and Giorgio Parisi were among the iniators of APE. The first APE was operative in . The next generation, the apecento, started in . The APEmille, the third generation, became operative in . The APE series has been supported in Italy by the Italian institute for nuclear and particle physics (INFN). The third generation of APE, APEmille, was done in collaboration with a German group in DESY-Zeuthen. The fourth APE was named apeNEXT with the additional participation of a French group (Université Paris-sud and IRISA/Rennes). The apeNEXT collaboration has published a series of paper or conference contributions [–]. More details can be found on the documentation Web sites [, ].
The apeNEXT Hardware The Computing Node Since the operations performed in LQCD are mainly multiplicatons of complex matrices, the computing units are harware devised to work on complex numbers.
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
Q
QCD apeNEXT Machines
For every clock cycle, the arithmetic unit performs the APE normal operation: a∗b+c
()
where a, b, c are complex numbers. Expressed in terms of floating point operations it corresponds to eight flops per clock cycle. The peak computing power is thus eight times the clock frequency per processor. This allows for a low clock frequency (about MHz) for a peak performance of . Gflop per processor, and consequently a low power consumption and a reasonable heat dissipation: Air cooling is enough. It uses intensively “Very Large Scale Integration.” A word of bits is processed every clock cycle, corresponding either to a double precision complex number or to a very long instruction word. There are no cache, but there role is provided by a large register file ( words of bits). Each processor is fully independent with its private memory bank of – Mbytes based on standard DDR-SDRAM. The memory stores both data and program. The memory controller receives the input from an address generation unit that allows to perform integer operations independently of the arithmetic unit. Error correction is implemented. The memory access has a latency of clock cycles. Since most often the calculation proceeds via large matrices, i.e., in large arrays, the call for numbers in an array is pipe-lined in such a way that, after the arrival of the first complex number of the array, which costs the latency, the following ones come at the speed of one per cycle. A data buffer close to the arithmetic unit allows for a prefetch of the data, thus minimizing the effect of the latency. Since both data and instructions are imported from memory there is a risk of conflict. This is minimized by a compression of the microcode containing the instructions, and by an instruction buffer allowing prefetch and storing the critical kernels for repeated operations. Instruction decompression is performed on the flight by the hardware.
The Torus Network The network is a custom one, designed for apeNEXT. The computing nodes are organized according to a D torus. Every node is directly connected to six neighbors, up and down in three directions. The last one along one direction is connected to the first one, thus producing a
circle. This is fully adapted to LQCD. Indeed the LQCD calculations are performed on a discrete lattice which is a D torus. Every lattice site communicates with its eight neighbors and the last one along one line communicates with the first one. Thus it is natural to decompose the LQCD lattice into sublattices located on every computing node. For example, let us consider a LQCD lattice of size ∗ (where is in the time direction). An apeNEXT rack has processors organized in a D torus. It is thus possible to decompose the LQCD lattice into sublattices of size ∗ . Other combinations are of course possible. Thus every lattice site is located on one computing node, and its neighbors are either on the same computing node or on neighbouring ones. The hardware allows an access to the memory banks located on the considered computing nodes and to those of the neighboring computing nodes. Each link is bidirectional and moves one byte per clock cycle and the start-up latency is about cycles, i.e., about ns, which is very short. To minimize the effect of the latency via a prefetch of the data from neighboring nodes, six dedicated buffers are used for the data from the six neighboring nodes. The maximum throughput of the network is reached for rather small bursts of words ( bytes) of data. This is convenient for the LQCD algorithms. This torus organization allows for a high scalability when the number of processors increases. The parallel implementation is Single Program Multiple Data (SPMD) which means that all the computing nodes perform the same operations simultaneously, including their communication with neighbors. There is no data transfer bottleneck during the computation except for a limited number of cases (e.g., when a global sum has to be performed).
The Global Features Every board contains computing nodes. A crate contains boards assembled thanks to a backplane which provides all the communication links. A crate thus contains nodes. A rack contains two crates. It has a peak performance of . Tflop, a footprint of m only, a power consumption less than kW which allows for air cooling. The apeNEXT system is accessed from a master front-end PC which controls slave PCs (typically
QCD apeNEXT Machines
one for two boards). The PCs communicate via custom designed host-interface PCI boards. It uses a simple IC network for operating system requests, to know the state of the machine, handle errors and exceptions. Input/output operations are handled via a highspeed link, called “th link,” and proceed via the PCs. Several racks can also be connected together for a large scale calculation.
The apeNEXT Software The TAO Language Details on the software can be found on the Web site []. The series of APE computers use a special high-level language named “TAO.” It is a rather simple language, somehow in the spirit of Fortran. It contains some improvements. For example one can define a mathematical object which is a SU() matrix (i.e., a * unitary complex matrix of determinant ). Once defined all the operations on these matrices, one can overload the “multiply” symbol: C =A∗B
()
where C, B, A are SU() matrices, means that C is the matrix product of A and B. This is extremely useful, knowing that the SU() matrices are basic objects in LQCD. TAO language contains a few additional commands related to the SPMD parallelism. For example, an “if ” is different whether it applies to the full system (in which case it induces a branching) or to one node (in which case it will go through both branches with an appropriate mask). The transfer of data between nodes is simply performed by address numbers which are understood as being on the remote node. Finally TAO is associated with a preprocessing language, “ZZ.” It allows to define new data types, operations between these types and to write loops which are unrolled at compile time.
The Compiler The compiler also has been written by the apeNEXT collaboration. It contains several successive modules. The Tao code is first translated into the user level Assembly language named SASM. The latter is then converted into the microcode level Assembly, MASM. An optimizer
Q
named SOFAN [] then eliminates the useless instructions or combines several instructions. It turns the original MASM code into an improved MASM code. The “shaker,” then, converts the MASM into the microcode and reorganizes the program to eliminate, as much as possible, useless cycles, e.g., while waiting for data from memory. The compiler decides how to use the register files in order to minimize the latency of data retrieval from memory or from other nodes. The programmer can also take some decisions concerning the use of the registers, e.g., demanding some often reused data to stay in registers. More generally it is noticeable that, the architecture being rather simple and transparent, the programmer can know very exactly what happens during a calculation including every step of data transfer. Having thus a complete and clear understanding of how the computing time is used, one can efficiently improve the code’s efficiency. It also helps when debugging. Of course, the software also contains a custom operating system which is adapted to the architecture.
The Theoretical Physics Research Thanks to apeNEXT The apeNEXT is the latest computer of the APE series, after APE, APEcento, and APEmille. The apeNEXT racks have been installed in in three places: In an international computational laboratory in the university of Rome “La Sapienza” ( racks belonging to the INFN and used by the majority of Italian LQCD groups, plus racks supported by the French CNRS and National Research Agency (ANR)). Six racks are in the Bielefeld university in Germany and four racks in DESY-Zeuthen (close to Berlin). During these years, the apeNEXT computers have been intensively used for research in theoretical physics. The major difficulty in LQCD calculations is the presence of very light quarks in nature. The computing time increases very fast when the quarks become lighter. apeNEXT and other computers of that generation allowed the LQCD community to really perform calculation with quarks significantly lighter than what was possible before, although, it was not enough to reach a quark mass as light as in nature. This is left to the next generation of computers.
Q
Q
QCD apeNEXT Machines
The LQCD group in Bielefeld has used its apeNEXT’s to study the properties of QCD at finite temperature, a situation which has existed after the big bang and is still observable in accelerators in which very heavy ions collide at very large energy. They have among other quantities, computed at which temperature a phase transition takes place which deconfines the quarks and gluons, i.e., liberates them from being bound into hadrons. The apeNEXT computers in Zeuthen and Rome have been mainly devoted to the zero temperature case. They have studied the properties of the proton, neutron, and the pion which constitute the atomic nuclei and represent more than % of the visible mass of our universe. They have also devoted many studies to states containing the heavy quarks named “charm quarks” and “beauty quarks.” These quarks are not frequent in our everyday surrounding, they have to be produced by accelerators of particles. They have been extensively studied experimentally in dedicated accelerators because they allow an accurate test of the standard model of particle physics, which is our present understanding of the fundamental laws of nature. They also open a window toward an understanding of the laws of nature at still smaller scales than the ones we already understand. These experiments in combination with theoretical studies, mainly based on LQCD, have confirmed the “Cabbibo–Kobayashi–Maskawa (CKM)” model which describes the mixing between different quark species and a mechanism for CP violation (asymmetry between particles and their antiparticles). Thanks to this vast experimental and theretical work, the CKM model has been awarded by the Nobel Prize in . It is impossible of course to quote all the publications stemming from calculations performed thanks to the apeNEXT computer. They have provided a great help to LQCD in continental Europe. In Italy, they are still today the major tool to perform the heavy LQCD calculations. In France, it was so until fall when a BlueGene was installed in the CNRS computing center, but they are still under very active use.
Conclusion The apeNEXT, the latest computer of the APE series, was, during the last years, the only fully custom highpower computer designed and produced in Europe. All APE computers were created without the help of any
major computer company. The Italian SME “Eurotech” company has built the prototypes of apeNEXT and the market products. The apeNEXT team was essentially an accademic collaboration of physicists and computer scientists from Italy, Germany, and France. The funding came from different scientific institutions in these three countries. This created a rich know-how which was recently reused in the QPACE project building a LQCD dedicated supercomputer using IBM-CELL [, ] processors and an apeNEXT inspired network. This know-how is also used in the recent Aurora project [, ] of a supercomputer using Intel/Nehalem processors. It is finally used in the petaQCD project of the “Agence Nationale de la Recherche” about software and hardware problems of the petaflop era for LQCD []. Beyond the know-how, the community which was created around the apeNEXT project is still alive and may be fruitful in the future. Finally, the apeNEXT computers played an essential role in allowing the European LQCD community to enter really in the era of lattice calculations with light dynamical quarks.
Related Entries IBM Blue Gene Supercomputer QCDSP and QCDOC Computers
Bibliographic Notes and Further Reading A popular and deep presentation of QCD can be found in Frank Wilczek’s Nobel lecture [] or paper in Physics Today named “QCD made simple” []. The history of the APE series has been recently summarized by Nicola Cabbibo []. The apeNEXT collaboration has published a series of paper or conference contributions [–]. More details can be found on the documentation Web sites [, ] and [, ].
Bibliography . http://www.ba.infn.it/~apenext/PRESENTATIONS_FILES/ Cabibbo.pdf . Bodin F et al () Nucl Phys Proc Suppl :, arXiv:heplat/ . Alfieri R et al. APE Collaboration, arXiv:hep-lat/
QCD (Quantum Chromodynamics) Computations
. Ammendola R et al () Nucl Phys Proc Suppl :, arXiv:hep-lat/ . Bodin F et al () In: The proceedings of conference for computing in high-energy and nuclear physics (CHEP ), La Jolla, – Mar , pp THIT, arXiv:hep-lat/ . Bodin F et al () ApeNEXT collaboration. In: The proceedings of rd international conference on physics in collision (PIC ), Zeuthen, – Jun , pp FRAP, arXiv:heplat/ . Bodin F et al () ApeNEXT collaboration. Nucl Phys Proc Suppl : . http://www.fe.infn.it/fisicacomputazionale/ricerca/ape/ NXTPROJECT/doc/root.html . http://apegate.roma.infn.it/APE/index.php?go=/ documentation.tpl . http://hpc.desy.de/ape/software/sofan/ . Baier H et al () Pos LAT :. http:arxiv.org/abs/. . Baier H et al. arXiv:., hep-lat . Aurora Science Collaboration () Pos LATTICE : . Abramo A et al. Tailoring petascale computing. In: Proceedings of ISC, June – (Hamburg) . https://www.petaqcd.org/ . http://nobelprize.org/nobelprizes/physics/laureates// wilczek-lecture.html . http://www.frankwilczek.com/Wilczek_Easy_Pieces/ . Belletti F et al () ApeNEXT collaboration. Nucl Instrum Meth A : . Belletti F et al () ApeNEXT collaboration. Comput Sci Eng :
QCD (Quantum Chromodynamics) Computations Karl Jansen NIC, DESY Zeuthen, Zeuthen, Germany
Definition Quantum Chromodynamics (QCD) is the prime and generally accepted theory of the strong interactions between quarks and gluons. In QCD, there appear intrinsically nonperturbative effects such as the confinement of quarks, chiral symmetry breaking, and topology. These effects can be analyzed by formulating the theory on a four-dimensional space-time lattice and solving it by large-scale numerical simulations. An overview of the present simulation landscape, a physical example and a description of simulation and parallelization aspects of QCD on the lattice is given.
Q
Discussion Introduction When experiments at large accelerators such as HERA at DESY and LEP and LHC at CERN are performed, in the detectors one observes hadrons such as pions, protons, or neutrons. On the other hand, from the particular signature of these particles as seen in the detectors, it is known that the hadrons are not fundamental particles but that they must have an inner structure. It is strongly believed nowadays that the constituent particles of all hadrons are the quarks with the gluons being the interaction particles that “glue” the quarks together to perform the bound hadron states, i.e., the particles that are seen in the experiments. The force that binds the quarks and gluons together – the strong interaction – is theoretically described by quantum chromodynamics (QCD). The postulation of QCD is that at very short distances much below fm, the quarks behave as almost free particles that interact only very weakly, a phenomenon called asymptotic freedom. On the other hand, at large distances, at the order of fm, the quarks interact extremely strongly, in fact so strong that they will never be seen as free particles but rather form the observed hadron-bound spectrum. The latter phenomenon is called confinement. Since the interaction between quarks becomes so strong at large distances, analytical methods such as perturbation theory fail to analyze QCD. A method to nevertheless tackle the problem is to formulate QCD on a four-dimensional, euclidean space-time lattice [] with a nonzero lattice spacing a. This setup first of all allows for a rigorous definition of QCD and to address theoretical questions in a conceptually clean and fundamental way. On the other hand, the lattice approach enables theorists to perform numerical simulations [] and therefore to investigate also problems of nonperturbative nature. In the past, lattice physicists had to work with a number of limitations when performing numerical simulations because these simulations are extremely computer time expensive, needing petaflop computing and even beyond, a regime of computing power which is just reached today. Therefore, for a long time the quarks were treated as infinitely heavy, since in this situation the computational costs are orders of magnitude smaller. However, this is a crude approximation given
Q
Q
QCD (Quantum Chromodynamics) Computations
ETMC Nf = 2 QCDSF Nf = 2 CERN-ToV Nf = 2 CLS Nf = 2 JLQCD Nf = 2 JLQCD Nf = 2+1 PACS-CS Nf = 2+1 RBC-UKQCD Nf = 2+1 MILC Nf = 2+1 JLQCD (2001) Nf = 2 experiment
0.15
a[fm]
0.10
0.05
0.00
100 200 300 400 500 600 mPS [MeV]
QCD (Quantum Chromodynamics) Computations. Fig. The values of the lattice spacing a and pseudo scalar masses mPS as employed in typical QCD simulations by various collaborations as (incompletely) listed in the legend. The blue dot indicates the physical point where in the continuum limit the pseudo scalar meson assumes its experimentally measured value of mPS = MeV. The black cross represents a state of the art simulation by the JLQCD collaboration in
that the up and down quarks have masses of only O(MeV). In a next step, only the lightest quark doublets, the up and down quarks, were taken into consideration, although their mass values as used in the simulation have been unphysically large, about two to three times their physical values. Nowadays, besides the up and down quarks, also the strange quark is included in the simulations. In addition, these simulations are performed in almost physical conditions, having the quark masses close to their physical values, large lattices with about fm linear extent and small values of the lattice spacing such that a continuum limit can be performed. The situation [] of the change of the simulation landscape is illustrated in Fig. . In the figure, the blue dot indicates the physical point, i.e., a zero value of the lattice spacing and a pion mass reaching its physical value of MeV. The black cross represents a state-of-the-art simulation in the year . As can be seen in the graph, most of the simulations currently go well beyond what could be reached in , clearly demonstrating the progress in performing realistic simulations. This quite significant change in the situation is due to three main developments: (a) algorithmic breakthroughs [–]; (b) machine development; the computing power of the present BlueGene P systems of about Petaflops is even outperforming Moore’s law (c) conceptual developments, such as the use of improved actions which reduce lattice artefacts and the development of nonperturbative renormalization.
As one example of a physics result, the computation of the light hadron spectrum from first principles using lattice simulations [] is shown in Fig. . Using only two input masses, the ones for the pion and the Kaon, all other masses are a result from these ab initio calculations. Note that the lattice data reach in general a few percent accuracy while the experimentally obtained values are partly at the sub-percent level. In addition, some of the states are resonances which would need a special treatment on the lattice. Nevertheless, the good agreement between the experimentally observed and the numerically computed spectrum is thus a nice confirmation that QCD is indeed the correct theoretical description of the strong interaction, although further tests using more physical observables are necessary to eventually consolidate and verify this result. Results of lattice computations, the progress in the field, and a discussion of many technical aspects are documented in the proceedings of the lattice symposia taking place worldwide on a yearly basis.
The Principle of Lattice Gauge Theory Calculations The main target of lattice calculations is to compute expectation values ⟨O⟩ of physical observables O, the hadron masses discussed above being one example. The general problem of such calculations is to solve a specific, high-dimensional integral (O(several thousand )).
QCD (Quantum Chromodynamics) Computations 2,500
M[MeV]
1,500
1,000
500
Experiment Width QCD
0
QCD (Quantum Chromodynamics) Computations. Fig. The light hadron spectrum as obtained in ref. []
To illustrate the problem, let us consider a onedimensional problem in only one variable x. The eventual aim is the computation of the “expectation value” ⟨ f (x)⟩ of a function f (x). In analogy to QCD, this amounts to calculate the integral ∫ dx f (x)e−S(x) where S(x) is the so-called action which defines the physical system under consideration. If a very simple form of such an action is taken, namely, S(x) = x then the integral can be solved by generating Gaussian random numbers of unit variance as integration points xi . This is the principle of “importance sampling.” With this choice of integration variables xi , ⟨ f (x)⟩ is simply given by the sum over all xi , ⟨ f (x)⟩ ≈ N ∑i f (xi ) where N is the total number of integration points. In lattice QCD the same principle of importance sampling is used to compute physical observables but the action used is much more complicated. The form ¯ y)Ψ(y). of the action is given as S = ∑(x,y) Ψ(x)M(x, Here (x, y) denotes four-dimensional, integer lattice ¯ points and the quark fields Ψ (anti-quark fields Ψ) are defined at these discrete lattice points. The matrix M(x, y) describes the interaction between the quark fields. It has to be remarked already at this point that M is a very large matrix of the order of a hundred million times a hundred million for most advanced simulations nowadays. However, a big advantage is that the general structure of the matrix M is such that only quark fields at nearest neighbour points are coupled. The general form of the interaction is ¯ (i,α) (x)U (i,j) (x, μ)Γ α,β Ψ(j,β) (x + a μ). ˆ Ψ
()
Q
Here, field Ψ(i,α) (x) is equipped with internal indices i = , , representing colour and α = , , , representing the Dirac structure. The so-called gauge field U (i,j) (x, μ) is a SU()-matrix describing the gluon degrees of freedom. It goes beyond the scope of this article to discuss in detail the physical motivation and interpretation of this coupling structure of neighbouring fields and for our purposes it is more important to just look at the mathematical structure of this coupling. Let us remark also that the gauge fields U interact with themselves which can be expressed through another action Sg (U) the particular form of which is not important here. The expectation value of a physical observable corresponds now to solving the integral ¯
−S(U, Ψ,Ψ) ¯ ⟨O⟩ = ∫ DΨD ΨDUOe
()
where the integration is to be performed over all quark and gluon fields and where the action is given by ¯ ¯ Ψ) = Sg (U) + ∑ Ψ(x)M(x, y)Ψ(y). S(U, Ψ,
()
(x,y)
The action in Eq. contains two free parameters, the quark mass and the gauge coupling. These parameters have to be tuned in order to achieve a particular physical situation. Before proceeding, another technical difficulty has to be discussed. The quark fields are not represented by normal, commuting numbers but by so-called Grassmann variables which anti-commute. As a consequence, the quark fields cannot be dealt with on a computer directly since computers deal with normal numbers only. However, the Grassmann fields can be integrated over leading to an expression (assuming that the matrix M is such that all the integrations discussed here can be performed, which is, of course, the case for real lattice QCD) ⟨O⟩ = ∫ DUOdet [M(U)] e−Sg (U) .
()
However, given the large size of the matrix M, computing its determinant is hopeless. The solution of this difficulty is to integrate the determinant in again using ordinary, bosonic fields Φ leading to our final expression for a physical observable, ⟨O⟩ = ∫ DΦ† DΦDUOe−S(U,Φ
† ,Φ)
()
Q
Q
QCD (Quantum Chromodynamics) Computations
with S(U, Φ† , Φ) = Sg (U) + Φ† M − Φ .
()
This shows that for the evaluation of the action S(U, Φ† , Φ) the vector X = M − Φ is needed. The basic numerical problem in lattice QCD is thus to solve a set of linear equations MX = Φ
()
with the fermion matrix M having the particular interaction form given in Eq. with the mentioned size of hundred million times hundred million. Since M connects only nearest neighbor points on the lattice, it is a sparse matrix with only the diagonal and a few O() sub-diagonals occupied and the techniques to solve such sparse systems of linear equations [] can be applied. Moreover, what is actually (hard-) coded is only the application of the fermion matrix M on a vector and hence M need not be stored in memory. Nevertheless, the matrix M is large and a solver, such as, e.g., conjugate gradient (CG) [], needs several hundred to thousands of iterations to converge to the solution with the desired accuracy of about − (relative precision). Moreover, the computation of the integral of Eq. over the gauge fields U needs several thousands of these gauge fields and hence Eq. needs to be solved hundred thousands of times to obtain just one physical result at just one value of the action parameters, i.e., the quark mass and
the gauge coupling. In order to have a complete result, simulations at many of these action parameters have to be performed.
Parallelization The description of the computational elements given above leads to a very high total computational cost. For example, if a moderately sized lattice of ⋅ is taken, on one rack of a BG/P system, corresponding to processors, one would get about gauge field configurations per day. A simulation leading to a sufficiently precise estimate of ⟨O⟩ requires about , of such configurations and would thus need about month. Clearly, without any parallelization, it would be totally unrealistic to solve this problem. Fortunately, the structure of the fermion interaction with only nearest neighbor couplings renders the problem of parallelization rather easy. There are essentially two approaches. The first is to use MPI. Here, the total lattice is distributed over the number of desired processors (domain decomposition). The total lattice is then divided into local sub-lattices of size lt ⋅ lx ⋅ ly ⋅ lz on a Nt ⋅ Nx ⋅ Ny ⋅ Nz processor grid. The parallelization strategy is then to equip the fields on the sub-lattice with “borders” which will take the neighboring lattice points needed for the sub-lattice on a particular processor. Before any computation is performed, each processor
1000 323 ⫻ 64 643 ⫻ 128 1283 ⫻ 256 1283 ⫻ 288
100 Tflop
10
1 1
10
100
1000
# processors [103]
QCD (Quantum Chromodynamics) Computations. Fig. Strong scaling of the lattice fermion matrix on the Jugene BG/P system at the Jülich supercomputer center for three different global lattice volumes
QCD Machines
sends the necessary neighbor points to corresponding adjacent processors where these points are then stored in the borders. This concept works well for machines with a very fast communication network with, however, a rather large latency. Another concept is a direct memory access to neighboring processors. Here, a single datum is sent to a neighboring processor whenever it is needed. In an application code this looks like Φ(x + a μˆ + mn) where “mn” stands for a so-called magic number which ˆ addresses the memory location of the field Φ(x + a μ) but on a neighbouring processor. Realizations of this concept have been made on the dedicated lattice QCD machines APE [] and QCDOC []. Other, new developments are machines with multicore processors [] and GPUs. Here, new parallelization concepts have to be developed and the problem of working efficiently on many GPUs in general is still under investigation. Figure represents an example of the (strong) scaling behaviour of a certain implementation of a lattice QCD code []. The total performance in Teraflops is shown for the application of the fermion matrix on three different volumes – a ⋅ , a ⋅ , and a ⋅ . The performance is shown as a function of the number of processors in a double log plot. A clear linear scaling is observed for all volumes used. The single point in Fig. represents the performance measurement for a ⋅ volume which was chosen to use the full -rack installation of the BG/P system at the supercomputer center in Jülich. The graph shows that lattice QCD is very well suited for massively parallel machines and shows an almost perfect strong scaling behaviour. Note that lattice QCD applications typically reach –% of the peak performance of the supercomputers used. The total simulation cost depends strongly on the volume, the pion mass, and the lattice spacing used in the simulations. For a discussion on the scaling of the algorithm employed in lattice QCD applications, I refer to refs. [, ]. For quantities, such as some of the hadron masses and decay constants, it can be expected that machines in the multi-Petaflops regime are sufficient to match the experimental accuracy. For other more complicated observables for example connected to scattering states or unstable particles, clearly exaflops computing capabilities are required.
Q
Bibliography . Wilson KG () Confinement of quarks. Phys Rev D:– . Creutz M () Monte Carlo study of quantized SU() Gauge theory. Phys Rev D:– . Jansen K () POSCI LATTICE, , Lattice QCD: a critical status report. [arXiv:. [hep-lat]]. . Lüscher M () Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput Phys Commun : . Urbach C, Jansen K, Shindler A, Wenger U () HMC algorithm with multiple time scale integration and mass preconditioning. Comput Phys Commun :– . Clark MA, Kennedy AD () Accelerating dynamical fermion computations using the rational hybrid Monte Carlo (RHMC) algorithm with multiple pseudofermion fields. Phys Rev Lett : . Luscher M () Deflation acceleration of lattice QCD simulations. JHEP : . Dürr S et al () Ab initio determination of light hadron masses. Science :– . Saad Y () Iterative methods for sparse linear systems, nd edn. SIAM, Philadelphia, PA . Belletti F et al () Computing for LQCD: apeNEXT. Comput Sci Eng :– . Boyle PA et al () QCDOC: project status and first results. J Phys Conf Ser :– . Baier H et al QPACE – a QCD parallel computer based on Cell processors, arXiv:. . Jansen K, Urbach C () tmLQCD: a program suite to simulate Wilson Twisted mass Lattice QCD. Comput Phys Commun :– . Jung C () POSCI LAT, , Status of dynamical ensemble generation. [arXiv:. [hep-lat]].
Q QCD Machines Norman H. Christ Columbia University, New York, NY, USA
Definition QCD machines are parallel computers whose design and construction are specially tailored to efficiently carry out lattice QCD calculations. Quantum chromodynamics (QCD) is that part of the standard model of particle physics which describes the interactions of quarks and gluons. Lattice QCD is a discrete formulation of this theory which permits its study by numerical methods.
Q
QCD Machines
Discussion Introduction The neutrons and protons which compose the atomic nucleus are themselves made of quarks and gluons which are accurately described by the fundamental theory of quantum chromodynamics (QCD). QCD is a natural generalization of Maxwell’s theory of electromagnetism. To a large degree, the theory of electromagnetism can be understood using classical physics with quantum phenomena appearing as important refinements of the classical theory, revealing quantum granularity when only a few quanta are present. This theory is well described by Maxwell’s four partial differential equations with quantum corrections that can be computed using quantum perturbation theory. In contrast, the phenomena of QCD have no relevant classical limit and must be understood by solving the full quantum mechanical problem without recourse to perturbation theory. This challenging theory is best formulated as a Feynman sum over histories or a Wiener integral, which in present applications requires the evaluation of an integral over million variables – a problem that demands the power of highly parallel computing. This large number of variables is only an approximation to the infinite number that would be needed if an integration variable were assigned to each point in the space-time continuum. By approximating the space-time continuum by a four-dimensional grid or lattice of points, the number of integration variables is made finite. Lattice QCD exploits such a grid and evaluates the resulting integral by Monte Carlo methods. While many important problems require the enormous computing power that potentially results from a large degree of parallelism, lattice QCD has particular features which make it a natural first target for parallelization and support the construction of parallel computers specialized for its solution. In order of importance, these features are: Homogeneity. As a fundamental description of the physics of quarks and gluons in space-time, QCD offers enormous natural parallelism. Since each lattice site must be treated symmetrically, the parallel computation that must be performed on the variables at one site must be the same for every other. This high degree of
homogeneity makes the code simple and leads to perfect load balancing if each processor in a parallel system is assigned the same number of lattice sites. Locality. An important consequence of Einstein’s theory of special relativity is that all interactions are local. Only nearby variables in space-time are coupled, and long-range effects come about as changes propagate from one site to the next, finally crossing many sites. Thus, communication between the homogenous processes described in the previous paragraph need only be provided between neighboring processes. This vastly simplifies the needed communication in a parallel computer and supports the simplest of mesh communication fabrics. Simplicity. The basic algorithms needed to stochastically evaluate the QCD path integral are intrinsically simple. The code needed for the early study of only the gluon variables requires a simple heat bath or Metropolis update and takes only a few hundred lines. If the quarks are included in the calculation, then the Dirac matrix must be inverted. This sparce matrix acts on variables defined at each lattice site and connects only neighboring sites. Its inverse can be efficiently calculated using the conjugate gradient algorithm, again with a few hundred lines of code. In this situation, rewriting QCD code to target special floating point units is a manageable task. Stability. The most effective algorithms for lattice QCD calculations have evolved very slowly. While algorithmic improvements have provided spectacular gains in performance, often outstripping the gains coming from computer technology, these improvements have involved little change to the fundamental structure of the computational method. There is reasonable expectation that a computer architecture that is presently efficient for QCD will also be efficient in a few years. Over the past years, the computing resources available for lattice QCD have increased by more than six orders of magnitude, from one million floating point additions or multiplications per second ( Mflops) to more than one trillion ( Tflops). Combined with substantial advances in algorithms, this has allowed the size of a typical space-time volume to increase from perhaps × to × or more. With this substantial increase, QCD becomes a “multi-scale” problem in which substantial advantage follows from using increasingly sophisticated methods to treat differently
QCD Machines
those parts of the calculation which are physical with a scale of – lattice spacings and unphysical, cut-off effects associated with distances of – lattice spacings. These new methods are not simple and are evolving quickly. Thus, the simplicity and stability of QCD code is less evident now that it was or years ago. These four features of lattice QCD have made this application an attractive target for the construction of specialized parallel computers. The first of these were the Cosmic Cube at Caltech, the Columbia -node machine, and the APE machine in Rome in the – period. The next generation of machine, constructed in the – period, were the - and -node Columbia machines, the ACPMAPS machine at Fermilab, the GF at IBM, and the APE- in Rome. Finally, the most recent group of QCD machines, completed in the – time frame, are QCDSP, APEmille, QCDOC, apeNEXT, and QPACE computers. QCDSP, QCDOC, and apeNEXT are treated in separate entries in this volume; the others are described below. These special purpose machines have both made important contributions to the study of QCD and served as examples and inspiration for the development of commercial parallel computers.
First Examples The story of QCD machines begins in , a time when three important opportunities converged: () the physics potential offered by large-scale lattice QCD calculations, () the possibility to exploit large-scale parallelism, and () the availability of complex functionality in highly integrated, easily interconnected and inexpensive VLSI (very large scale integration) computer chips. The first machine to exploit these three possibilities was the Cosmic Cube of Fox and Seitz at Caltech []. This machine was constructed both to perform lattice QCD calculations and to provide a platform on which parallel implementations of many computational science problems could be mounted. Therefore, a very general hypercube communication network was constructed. The first machine had nodes. The network was constructed to treat these nodes as the corners of a six-dimensional cube with each node connected to its six neighboring nodes in the cube. The individual nodes were built from Intel and chips. The contained floating point hardware with a peak speed of Kflops. The performance for QCD on the -node
Q
cosmic cube was ∼ Mflops [], and interesting physics results for the heavy quark potential were published in []. A similar effort to construct special purpose computers for lattice QCD was started by Christ and Terrano in the Physics Department of Columbia University at the same time. Here the focus was on providing high-performance floating point capability while constructing the simplest communications network required to support this floating point capability. Thus, the first Columbia QCD machine was a single floating point accelerator which was attached to a DEC PDP computer and used to perform the six triples of × matrix multiplication and accumulation that are required for the Monte Carlo sampling of the Feynman path integral for the theory of gluons alone []. This machine used a TRW -bit integer multiplier, cascaded four-bit adders, and increased the performance of the PDP code by a factor of for a performance twice that of a VAX /. Following this successful demonstration of the ease with which standard components could be combined to provide fast floating point arithmetic, this same team constructed the Columbia -node machine [, ]. This machine contained nodes, each on a singlelarge printed circuit board with wire-wrap connections, arranged as a × mesh. The right and left edges of this mesh were joined as were the top and bottom edges changing the two-dimensional topology from a square to a torus. Each node was composed of Intel and chips, providing easily programmable control, address generation, and floating point capability. However, the real power of the node was provided by a vector processor composed of a pair of TRW chips, a floating point adder, and an integer multiplier enhanced by extra circuitry to perform floating point operations. This unit performed long sequences of pipelined, floating point additions and multiplications with the specific operations and operand addresses determined by the bits in a sequence of microcode words stored in a K ×-bit wide memory. Running at MHz, this unit had a peak speed of Mflops, and the entire machine a peak speed of Mflops. By modern standards, the nearest-neighbor communication provided in this machine were somewhat primitive. Each node was connected to its four nearest neighbors by bidirectional data wires. If we locate
Q
Q
QCD Machines
each node with x and y coordinates running between and , then these data paths expand the memory of the node with coordinates (x, y) to include the memories of the two nodes with coordinates ((x + )%, y) and (x, (y + )%), where % indicates the usual mod operation. No address lines were included in these network connections, and off-node communication was carried out in a synchronous SIMD (single instruction multiple data) mode in which all nodes functioned in lock-step and the needed data in the off-node memories was addressed by the local processors (either micro- or vector processor) on those nodes. The programming model for this machine followed naturally from the homogenous character of both lattice QCD and the computer’s architecture. The code was written from the perspective of a single node with an off-node access treated no differently from one which was local. Such single-node code was loaded and run on each node. Those portions of the code that were strictly local could run in MIMD (multiple instruction multiple data) mode without synchronization. Metropolis accept–reject decisions could be easily made in this mode. However, during off-node communications, all code had to be executing the same branch and the Intel processors had to be synchronous on a cycle-by-cycle level. Creating code for the / processors was straightforward with good compiler support. This single-node execution model and memory-mapped nearest neighbor communication is relatively easy to understand and follow. Programming the vector unit was more difficult even with a rudimentary assembler. Perhaps the greatest software challenge was created by the heterogenous character of the computer node. Much effort was required to reach an optimal balance between slow but easily created and modified code running on the microprocessors and that executing on the vector processor. Performance improved as bottle necks in the / portions of the code were identified and replaced by vector processor routines. This machine executed QCD code at a speed % faster than its Cray contemporary, sustaining ∼ Mflops or % efficiency for critical codes. It began physics production in March of and was used primarily to study the QCD phase transition for a theory of only gluons. Calculations on this machine were able to
reach far enough into the weak coupling regime to verify the dependence of the transition temperature on the coupling strength expected in perturbation theory []. The third of these early QCD machines was the APE computer designed and built by an Italian group lead by Parisi and centered at Rome []. The first physics results [] were obtained on a machine of four relatively powerful nodes connected in a one-dimensional ring. This machine later grew into the full APE computer with a larger ring of of these nodes. In contrast to the two machines described above, the APE architecture was entirely designed by the APE collaboration and used no commercial microprocessor. The machine was driven by complier-generated microcode, which controlled the floating point units (eight Weitek chips on each node), and a sequencer, which determines the next microcode instruction that will be executed. The ring architecture was highly programmable. If nodes are numbered through and the separate memory units, also labeled –, then in a given cycle, processor number k could be joined to access memory number k + l% for all ≤ k < with a fixed value of l determined by program control. The machine operated in SIMD mode with all nodes controlled by common microcode. Because of the homogeneous character of the machine, a high-level compiler was created to support a Fortran-like “APE language.” The output of the compiler could directly control the computer and could also be written out as symbolic tables which could inform further hand optimization of the input code. A four-node APE computer had a peak speed of Mflops and boasted a % efficiency. Two -node machines and a later -node machine enabled the extensive and influential physics program of the APE collaboration during the late s.
Second Generation QCD Machines This next period of QCD machines spans the – time frame and includes significant evolutions in the Caltech and Columbia projects. The full -node, Gflops peak ( floating point operations per second) APE machine was brought into operation in []. In addition, new QCD machines were built in Japan, at Fermilab, and by IBM. The Cosmic Cube described above lead to a sequence of larger hypercube machines at Caltech.
QCD Machines
The -node Mark IIIfp computer added a Weitek chip set to each node and achieved a Mflops performance on QCD code. While the principal purpose of this machine was to enable parallel computation for a broad range of scientific applications, an interesting state-of-the-art study of the static quark potential was performed on this machine []. This series of Caltech machines was instrumental in stimulating Intel to begin producing parallel computers targeting the scientific market. The Columbia -node machine was followed by a -node machine which began operation in August of with a peak speed of Gflops and sustained performance of ∼ Mflops. This machine used the same Intel microprocessor, mesh architecture, and host computer communication as did its -node predecessor. The only change was an increased memory size (from Mbyte to Mbytes/node) and a change to -bit IEEE Weitek floating point chips. Thus, only the microcode needed to be refreshed. This was followed in by a third, -node machine. Again the same microprocessor was used, the network bandwidth was doubled, and a large enhancement to the floating point unit was realized by using a pair of more advanced Weitek chips each with two external Weitek registers files all running at instead of MHz. This gave an increase in performance of × over the -node machine for a peak performance of Gflops and a sustained QCD performance of . Gflops for SU() conjugate gradient inversion. It should be recognized that this performance was greater or equal to the fastest commercially available machines. For example, a sustained speed of . Gflops was achieved a year later on a K node CMII machine []. The new project begun at Fermilab, called ACPMAPS (Advanced Computer Program Multiarray Processor System) [], targeted a machine of nodes based on a unified processor architecture constructed from a Weitek chip set. This provided a peak speed of Mflops per node and Gflops for the full machine. The Fermilab approach was crafted to give both cost-effective performance for QCD codes as well as a user-friendly programming environment that would be well suited to the development of new algorithms. The communications network was a hierarchy of × crossbar switches, providing a greater degree
Q
of interconnectivity than the one-dimensional and two-dimensional networks of the APE and Columbia machines. Each of the crates contained eight processor boards interconnected with one crossbar switch. The crates were themselves interconnected as the nodes in a hypercube. The first physics results from ACPMAPS were reported at the Lattice Field Theory Symposium in Japan [, ]. This machine was upgraded to one of Gflops peak speed in with a substantial enhancement to the node architecture (the Weitek chip set was replaced by a pair of Intel i processors) but with no change in the network []. The flexibility of this machine was supported by an innovative software environment which included C-level programming for the individual nodes and a system called Canopy [] which provided a grid-aware framework that allowed parallel programs to be created with the details of grid communication managed by the software. During this same time frame, the QCDPAX computer was designed [] and put into operation [] at Tsukuba in Japan. This machine contained nodes arranged in a two-dimensional mesh with the edges connected to form a torus. The individual nodes were constructed from a Motorola processor and a gate-array controlled LSI Logic L floating point chip. The node machine had a peak speed of Gflops and sustained speed of . Gflops, with – Gflops for the inversion of the Wilson Dirac operator. First physics results were presented in []. The last of the machines built during this period is the IBM GF of Beetham, Deneau, and Weingarten []. This had begun in and first physics results were presented in []. Like the APE machines, the GF was microcode controlled by a standard computer, in this case a RT/PC, and the resulting sequence of microcoded instructions was broadcast to the processors in the system as well as to inputs that controlled a large × switch, which could realize an arbitrary permutation between its data inputs and outputs. During actual program execution, the microcode instructions could choose between , preloaded switch configurations. The extra ports on the switch were connected to disks, so this switch also provided a very flexible connection between the processors and these disks. Each processor contained a hierarchy of memory ( MB DRAM, K SRAM, and
Q
Q
QCD Machines
word register file) and a Weitek floating point adder and multiplier, giving each node a peak speed of Mflops. The typical sustained performance reported on a -node version of the machine was . Gflops []. While the -year time-to-completion suggests the difficult struggle needed to complete the GF, once online, this machine was very successful. It produced the first calculation, in which a systematic effort was made to remove the errors associated with finite lattice spacing, large quark mass, and finite volume []. Later influential work studied the masses and widths of the lightest glueball states (particles formed from gluons rather than quarks) [].
Toward One Teraflops The projects described above succeeded in bringing a few Gigaflops of sustained computational speed to bear on the problems of lattice QCD. During the period –, these speeds were dramatically increased by projects in the USA, Italy, and Japan. The APE collaboration developed the APE- which enlarged the onedimensional architecture of the original APE machine to a three-dimensional mesh and incorporated an integrated arithmetic chip which contained the entire original APE node. The first -node, Gflops APE machines were operational in and the -node, Gflops version by . This was constructed commercially, and commercial machines, with the name “Quadrics,” were purchased by other groups in Europe to support their own lattice QCD calculations. This was followed by a still faster APE machine: APEmille. Begun in [], this machine replaced the single central controller of the earlier APE machine with an array of controllers with one per eight nodes, all mounted on a single PC board along with memory and communications. APEmille benefited from substantial board-level integration and a × faster processor chip. Separate installations of this machine in Germany, France, Italy, and Wales were producing interesting results by . The group at Columbia began designing a new QCD machine in []. This machine, called QCDSP (QCD with Digital Signal Processors), followed a different architecture than the earlier Columbia machines, being built from credit card–size nodes interconnected in a four-dimension network. Machines with K and K nodes, sustaining and Gflops, were built
at Columbia and the RIKEN BNL Research Center (RBRC). This machine and the follow-on QCDOC computer are described in a related entry. Finally, in this period, the Tsukuba group, working with the Hitachi company, constructed CP-PACS (Computational Physics Parallel Array Computer System) []. Begun in , this machine was complete in and sustained Gflops []. It was constructed as a three-dimension array of Hewlett Packard RISC processors. These were a version of a commercial HP chip modified to include four times the usual floating point registers that were accessed through a sliding window architecture. The network was more powerful than a simple nearest neighbor mesh. While arranged on a three-dimensional Cartesian grid, each node was joined to three crossbar switches which allowed direct communication with any node whose (x, y, z) coordinates differed in only one variable. This was a fully asynchronous machine with communication performed by message passing. The machine contained ,, Mflops nodes for a peak speed of Gflops. A commercial version of this machine was then sold by Hitachi.
Beyond One Teraflops Three projects have succeeded in creating QCD machines capable of sustained performance in excess of Tflops. Two of these are described in related entries. The first is the QCDOC machines designed and constructed by a collaboration of Columbia University, the RBRC, and UKQCD. Begun in and completed in three QCDOC machines, each with a sustained performance of Tflops were installed at the RBRC, Edinburgh, and the Brookhaven National Laboratory. The second, apeNEXT, was begun in and led to large machines in Bielefeld, DESY/Zeuten, and Rome with peak speeds of ., ., and Tflops, respectively. The third machine in this multi-teraflops class is the QPACE machine. This was constructed by a collaboration centered at the University of Regensburg with significant involvement from IBM. Four racks were installed in at the University of Wuppertal and four more at the Jülich Supercomputing Centre. The aggregate peak speed of the eight racks is Tflops [] with a sustained performance of – % of peak. The machine is based on an IBM/Sony CELL processor similar to that used in the Playstation but upgraded to perform double-precision arithmetic.
QCD Machines
The nodes are interconnected in a three-dimensional mesh network which is connected to each node by a field programmable gate array (FPGA). This machine had a remarkably short design and construction time which was achieved in part by avoiding the design of a special purpose chip (as was done in the QCDOC and apeNEXT projects) and using instead an FPGA in which initial errors could be easily corrected. Physics calculations are expected to begin on this hardware in January .
Conclusion The theory of quantum chromodynamics is a computationally simple but highly demanding application area in which the strategy for parallelization is obvious. It is therefore a natural target for parallel computing and, as described above, a natural target for applicationspecific, highly parallel machines. During the period through the late nineties QCD machines lead or rivaled the most powerful parallel machines available commercially. For example, the Tflops aggregate peak speed of the QCDSP machines put into operation in was comparable . Tflops of the Intel ASCII Red installed at Sandia in . The later QCD machines were built at costs ranging between $ and $ million, as might be expected for large research instruments. During the present – period, the – times larger sums spent on commercial supercomputers and their increasingly effective architectures (inspired in some cases by QCD machines) have lead commercial machines to outstrip by an order of magnitude or more the power available from special purpose machines. Thus, at present (), with the possible exception of the QPACE machine, the most challenging QCD problems are run on commercial computers. Because of the necessary investment of effort and money, machines built for QCD targeted important research goals requiring the largest possible computer capability to achieve. However, beginning in the late s, smaller QCD problems began to be tackled with what were known as Beowulf clusters, collections of – standard workstations connected by a commercial network []. While these machines, often configured in a fashion optimized for lattice QCD, are not able to efficiently manage the small lattice volumes per computer node that are required to mount the most
Q
demanding lattice QCD problems on a very large number of processors, they have become increasingly capable, are highly cost effective, and can be easily upgraded every year. These cluster machines are now enjoying a further increase in performance by incorporating the latest graphics processing units (GPU), which provide enormous performance boosts when a very large local volume can be used on each node. In the USA, the great majority of lattice QCD research is now performed on workstation clusters and the IBM BlueGene computers. To an even greater extent than in , the deeper study of the physics of quarks, gluons, and new strongly interacting particles likely to be discovered at the Large Hadron Collider at CERN requires vast computational resources not available from current computers. In fact, many important problems require computing performance on the exaflops ( flops) scale. Given the experienced and talented research groups in Germany, Great Britain, Italy, Japan, and the USA, we may hope that new technologies will be recognized that can provide this needed computational power and that can be exploited by new generations of specially built QCD machines.
Related Entries QCD apeNeXT Machines QCD (Quantum Chromodynamics) Computations QCDSP and QCDOC Computers
Q Bibliographic Notes and Further Reading The plans, progress, and first results from the QCD machines described here were usually reported in the annual International Symposia on Lattice Field Theory which provide many of the bibliographic citations here. An early volume describing both QCD and other special purpose machines of B. Alder [] may be of interest. Further detailed information about the first QCD machines can be found in the workshop proceedings of Li, Qiu, and Ren [], and more recent machines in those of Iwasaki and Ukawa []. The theoretical foundations of these lattice methods are described in many monographs and textbooks, for example, those of Creutz [], Montvay and Muenster [], Smit [], Rothe [], and Degrand and Detar [].
Q
QCDSP and QCDOC Computers
Bibliography . ALder BJ (ed) () Special purpose computers. Academic, Boston . Otto SW () Lattice gauge theories on a hypercube computer. In: Proceedings, gauge theory on a lattice, Argonne, pp – . Otto SW, Stack JD () The SU() Heavy quark potential with high statistics. Phys Rev Lett : . Christ NH, Terrano AE () Hardware matrix multiplier / accumulator for lattice gauge theory calculations. Nucl Instr Meth A:– . Christ NH, Terrano AE () A very fast parallel processor, IEEE T Comput C-:– . Christ NH, Terrano AE () A micro-based supercomputer. Byte :– . Christ NH, Terrano AE () The deconfining phase transition in lattice QCD. Phys Rev Lett : . APE Collaboration, Albanese M et al () The APE computer: an array processor optimized for lattice gauge theory simulations. Comput Phys Commun :– . APE Collaboration, Remiddi E () From APE to APE. In: Proceedings, lattice , Batavia, pp – . Ding HQ, Baillie CF, Fox GC () Calculation of the heavy quark potential at large separation on a hypercube parallel computer. Phys Rev D: . Liu WC. Fast QCD conjugate gradient solver on the connection machine. In: Proceedings, lattice , Tallahassee, pp – (see High Energy Physics Index () No. ) . Fischler M et al () The Fermilab lattice supercomputer project. Nucl Phys Proc Suppl :– . Mackenzie PB () Charmonium with improved Wilson fermions. . A determination of the strong coupling constant. Nucl Phys Proc Suppl :– . El-Khadra AX () Charmonium with improved Wilson fermions. . The spectrum. Nucl Phys Proc Suppl :– . Fischler M, Gao M, Hockney G, Isely M, Uchima M () Reducing communication inefficiencies for a flexible programming paradigm. Nucl Phys Proc Suppl :– . PAX Collaboration, Iwasaki Y, Hoshino T, Shirakawa T, Oyanagi Y, Kawai T () QCD-PAX: a parallel computer for lattice QCD simulation. Comput Phys Commun :– . QCDPAX Collaboration, Iwasaki Y et al () QCDPAX: present status and first physical results. Nucl Phys Proc Suppl :– . QCDPAX Collaboration, Kanaya K et al () Pure QCD at finite temperature: results from QCDPAX. Nucl Phys Proc Suppl :– . Beetem J, Denneau M, Weingarten D () The gf supercomputer. In: The th annual international symposium on computer architecture, Boston, pp – . Butler F, Chen H, Sexton J, Vaccarino A, Weingarten D () Volume dependence of the valence Wilson fermion mass spectrum. Nucl Phys Proc Suppl :– . Weingarten DH () Parallel QCD machines. Nucl Phys Proc Suppl :–
. Butler F, Chen H, Sexton J, Vaccarino A, Weingarten D () Hadron masses from the valence approximation to lattice QCD. Nucl Phys B:–, hep-lat/ . Sexton J, Vaccarino A, Weingarten D () Numerical evidence for the observation of a scalar glueball. Phys Rev Lett : –, hep-lat/ . Bartoloni A et al () The new wave of the APE project: APEmille. Nucl Phys Proc Suppl :– . Arsenin I et al () A .-teraflops machine optimized for lattice QCD. Nucl Phys Proc Suppl :– . Oyanagi Y () New parallel computer project in Japan dedicated to computational physics. Nucl Phys Proc Suppl :– . CP-PACS Collaboration, Iwasaki Y () The CP-PACS project. Nucl Phys Proc Suppl A:–, hep-lat/ . Baier H et al. QPACE – a QCD parallel computer based on Cell processors. arXiv:. . Gottlieb SA () Comparing clusters and supercomputers for lattice QCD. Nucl Phys Proc Suppl :–, hep-lat/ . Li X-Y, Qiu Z-M, Ren H-C (eds) () Lattice gauge theory using parallel processors. In: Proceedings, CCAST (world laboratory) symposium/workshop, Peking, May– June, . Gordon & Breach, New York, p (proceedings of the CCAST symposium/workshop, ) . Iwasaki Y, Ukawa A (eds) Lattice QCD on parallel computers. In: Proceedings, international workshop, Tsukuba, – Mar . Prepared for international workshop on lattice QCD on parallel computers, Tsukuba, – Mar . Creutz M () Quarks, gluons and lattices. Cambridge University Press Cambridge, p (Cambridge monographs on mathematical physics) . Montvay I, Munster G () Quantum fields on a lattice. Cambridge University Press Cambridge, p (Cambridge monographs on mathematical physics) . Smit J () Introduction to quantum fields on a lattice: a robust mate. Camb Lect Notes Phys :– . Rothe HJ () Lattice gauge theories: an introduction. World Sci Lect Notes Phys :– . DeGrand T, Detar CE () Lattice methods for quantum chromodynamics. World Scientific, New Jersey, p
QCDSP and QCDOC Computers Norman H. Christ Columbia University, New York, NY, USA
Definition QCDSP (quantum chromodynamics on digital signal processors) and QCDOC (quantum chromodynamics on a chip) are two computer architectures optimized for large-scale calculations in lattice QCD, a firstprinciples, numerical formulation of the theory which
QCDSP and QCDOC Computers
Q
describes the interactions of quarks and gluons. Both are mesh machines with a communication network arranged as a four- and six-dimensional torus, respectively. Completed in the largest QCDSP machine had a peak speed of . Tflops while three QCDOC machines were completed in /, each with a peak speed of Tflops. Both architectures delivered a sustained performance for QCD of about % of peak.
Discussion Overview The QCDSP and QCDOC machines represent a second generation of massively parallel computers optimized for QCD. Earlier QCD machines constructed at Columbia University had exploited single-board computers to assemble large machines with or nodes, where each node was a large printed circuit board containing a microprocessor and much support electronics including memory, internode communications and substantial added floating point hardware. By the early s, all of these functions could be contained in a very few chips. In particular, if one of those chips was a custom-designed ASIC (applications specific integrated circuit) then even the “glue” or interface logic special to the particular design could be integrated into this small number of chips. This large increase in integration capability made it possible to shrink one of the large node boards, for example, a × cm board making up one node in the Columbia -node machine, to the small, credit-card size, . × . cm, QCDSP node shown in Fig. . While the size of the node was drastically reduced, the floating point performance remained nearly the same. The Mflops peak speed of this new DSP chip compared well with the Mflops peak speed of the two MHz Weitek chips used on the earlier node. Thus, a large increase in computational power could be achieved if this minaturization was exploited to achieve a substantial increase in the number of nodes. In fact, compact packaging allowed , QCDSP nodes to be accommodated in a m footprint, very similar to the area occupied by the nodes of the earlier machine. The result was a nearly × increase in performance. Such a large increase in the number of nodes required a change in the way the nodes were interconnected. If the two-dimensional torus interconnection
QCDSP and QCDOC Computers. Fig. A single node of the QCDSP computer. The chip on the right is the Texas Instruments TMSC DSP. That on the left is the ASIC which provides the interface connecting the DSP and memory and manages internode communication over the four-dimensional network interconnecting the nodes. Five memory chips (which are not visible) are mounted on the opposite side of this . × . cm DIMM card
were simply expanded from the × mesh of the earlier node machine the result would be × mesh for even a K node partition of a K node machine. Since the difficulty of the typical lattice QCD calculation increases at least as fast as the seventh power of the linear size of the system under study, the × increase in computer power would only allow a / = . times increase in problem size, far smaller than the increase in mesh size from to . Of course, the linear size of a mesh of , processors can be easily reduced by increasing the dimension of the mesh. For QCDSP a four-dimensional mesh was used because this nicely matches the four-dimensional space-time mesh of sites used in lattice QCD. With such an interconnect and the onboard wiring adopted for QCDSP, , nodes can be configured as a × × × mesh allowing the full processing power of such a ten times more powerful K node partition to be applied to the same size problem as could be mounted on the earlier -node machine. The QCDOC network was further expanded to a six-dimensional mesh in order to allow a more flexible mapping of four- or five-dimensional problems onto a fixed partition geometry as discussed below.
Q
Q
QCDSP and QCDOC Computers
Networks Typically such massively parallel machines require a variety of networks. In addition to the mesh network which provides high-bandwidth data communication, a combining and broadcast network is needed to perform global sums or to identify the largest or smallest case of a variable appearing across all nodes and then to distribute the result to all nodes. A third network, often distributing interrupt signals, is needed to implement barriers for synchronization among the nodes. This or a similar network is needed to announce errors and bring all nodes to a halt as quickly as possible so that a malfunctioning node can be easily identified. Finally a control network is needed to boot the machine and perform and report the results of power-on diagnostics. The network most important for high computational performance is that transmitting the data between nodes as the calculation progresses. As discussed above for the QCDSP and QCDOC computers, this network is a four- or six-dimensional toroidal mesh. A four-dimensional lattice QCD problem can be easily mapped onto the four-dimensional network of QCDSP. Each processor node in an Nx ×Ny ×Nz ×Nt network can be assigned an nx ×ny ×nz ×nt local volume so the resulting space-time volume contained in the entire machine is an Nx ⋅ nx × Ny ⋅ ny × Nz ⋅ nz × Nt ⋅ nt volume assembled by juxtaposing these smaller local volumes in the obvious way. The standard interactions between neighboring sites in this Nx ⋅ nx × Ny ⋅ ny × Nz ⋅ nz × Nt ⋅ nt mesh of physical space-time points can then be accomplished either within a processing node or by communication between adjacent nodes in the Nx ×Ny ×Nz ×Nt network of nodes. While this four-dimension mesh worked very well, the rigid coupling between the problem size and the machine size required frequent reconfiguration of the QCDSP partitions as the size of problem changed. This reconfiguration was accomplished by a physical reconnection of the communications cables, a significant impediment that often delayed making needed changes in problem size. In the QCDOC computer, this coupling between problem size and partition geometry was weakened by increasing the number of dimensions of the network from four to six. For both QCDSP and QCDOC each motherboard held nodes. For QCDSP this was wired as a ××× mesh with the first “” connected back to
itself to form a one-dimensional torus within the motherboard. The other six faces of this four-dimensional cube were wired to six external connectors. Multiconductor ribbon cables then joined the corresponding faces between logically adjacent motherboards connecting the other three, off-board dimensions of the fourdimensional torus. For QCDOC the nodes are interconnected on a motherboard as a × × × × × mesh. For this machine a daughter card, shown in Fig. contains two nodes to accommodate the long dual inline memory modules (DIMM) provided for each node. The two nodes on a daughter card are connected in a twodimensional torus and the other network connections made to the motherboard. The motherboard is wired to join these daughter cards into a cube. Two additional dimensions of this cube are connected as twonode torii and connections to the remaining six faces of the cube of nodes are carried to connectors leaving the motherboard. Thus, as in QCDSP, external cables join the motherboards to assemble a three-dimensional torus of motherboards. Since it is easy to map a lower dimensional torus within one of higher dimension, the six-dimensional network of QCDOC can be used to allow a single sixdimensional QCDOC partition to support a greater variety of four-dimensional mesh geometries than can be accommodated in a single four-dimensional QCDSP partition. This is illustrated in Fig. which shows the example of a two-dimensional network in which a subset of the links are used to realize a one-dimensional mesh. Used in this way, the three-dimensional × × QCDOC network that is wired on the motherboard can be combined with the more distributed Nx × Ny × Nz network connecting Nx Ny Nz motherboards in a variety of ways. For example, these extra eight processors can be wired into an added one-dimensional torus in the time dimension of size giving a Nx × Ny × Nz × torus. Alternatively only two of these dimension could be used to form a one-dimensional torus of size and the third dimension could be combined with the x-direction to double its length giving a Nx × Ny × Nz × torus. Neither QCDSP and QCDOC has a separate/ combining broadcast network. Instead hardware features are added to the mesh communications network which identify particular paths though the mesh network that can be used to combine or broadcast
QCDSP and QCDOC Computers
Q
QCDSP and QCDOC Computers. Fig. A single daughter board of the QCDOC computer. This card holds two QCDOC nodes (the two silver colored chips), an Ethernet five-port hub (the rectangular Altima chip visible on the left), four Ethernet “PHY” chips (identifiable as the four small square Altima chips) and two DIMM sticks which varied in size between and MB
QCDSP and QCDOC Computers. Fig. An example of mapping a one-dimensional -site torus into a two-dimensional × torus
data. For example, additional hardware was provided that allows data passed from one or more neighboring nodes into a given node to be added or compared
to local data and then the sum or maximum or minimum to be passed on with minimal latency to a specially designated neighbor. The QCDSP design had a separate interrupt network which allows the implementation of barriers and the transmission of interrupts. For QCDOC this network is realized in two ways. First a simple network of three interrupt lines was connected throughout the machine which allows a signal sent by one processor to be seen by all others which is used during boot-up and to signal fatal errors. A larger and more flexible system of interrupts is provided by the distribution of -bit interrupt packets over the six-dimensional mesh network. The networks used for booting, diagnostics, and file transfer for QCDSP and QCDOC were quite different. For QCDSP a tree whose individual links used the commercial SCSI bus conventions connected the host workstation, typically a SUN, to each of the motherboards. For QCDOC this host communication and control function is provided by standard Ethernet connections. In fact each node has two Ethernet ports. The first is connected directly to the JTAG control of the processor allowing JTAG commands embedded in Ethernet packets to load boot code and boot the QCDOC Power PC processor. The second port is configured
Q
Q
QCDSP and QCDOC Computers
as a standard Ethernet device, supporting conventional remote procedure call (RPC) communication with the host and a standard NFS client which runs on each node and provides access to networked disk servers. This Ethernet hardware is likely the greatest source of difficulty with the QCDOC design with frequent hang-ups requiring elaborate software work-arounds.
Mechanical Design As shown in Fig. , the individual QCDSP nodes were mounted on small printed circuit boards with the form factor of a standard DIMM card. This design provided very reliable. Sixty three of these daughter cards were socketed on a QCDSP motherboard. The th node was soldered directly to the mother board and supplied with added connections giving it more memory, access to two SCSII interface chips and an EEPROM. This modular design allowed easy repair of faulty daughter cards and was continued for QCDOC. In this case two nodes are mounted on a daughter card as shown in Fig. . A QCDOC motherboard is shown in Fig. . For both machines eight motherboards are mounted in a backplane that makes up a crate as shown for QCDOC in Fig. . Two crates are stacked to form a rack. The racks are water cooled with a heat exchanger installed below each crate followed by a tray of muffin fans pushing the air upward. The return path for
QCDSP and QCDOC Computers. Fig. A QCDOC motherboard holding daughter cards with the nd daughter card removed. The three large edge connectors are wired directly through the backplane to three groups of eight cable connectors. Each cable is a bundle of twisted, differential pairs, of which provides uni-directional communications for the QCDOC mesh network
the air is provided inside the cabinet holding the two crates. The racks can then be lined up in a row with cables running from one rack to the next. If desired, these racks can also be stacked two high with a final “top hat” added which contains the Ethernet switches. Figure shows the largest QCDOC installation at the Brookhaven National Laboratory. Both machines are remarkably low power with a single QCDSP node consuming about W and a QCDOC node slightly more. Thus, QCDSP achieved about Gflops/KW while QCDOC achieves Gflops/KW. Both machines were also highly economical at the time they were finished. The price per sustained performance for QCDSP was $/Mflops while for QCDOC this number falls to a little more than $/Mflops.
Software The software for these two machines is composed of application code which performs the lattice QCD calculation and the operating system (OS) which boots the machine, responds to error conditions, loads and passes arguments to the application code, and provides network and file system support. On both machines the custom OS is divided between the nodes of the parallel machine and the host computer. At boot-up the OS on the host is active, loading the operating system on each node, carrying out extensive diagnostic tests of the parallel machine and loading the user code. Control is then passed to the OS on each node which starts the application code and provides the services requested by that code. The OS running on each node is designed to provide a “C” and “C++” environment for the application code with much of the standard UNIX functionality supported. To the extent that resources beyond those available on the node are required, for example data from a remote disk file or the ability to print to the user’s screen, the node component of the OS requests these services from the OS running on the host computer. Now the host component of the OS acts as the slave to the OS running on each node. The simple and lean structure of the node OS is critical to the high performance of these machines. No multitasking or asynchronous behavior interefers with the application code on any of the nodes. The node OS ceases to execute when the application code starts running.
QCDSP and QCDOC Computers
Q
QCDSP and QCDOC Computers. Fig. Front and rear views of a QCDOC crate. The bundles of black cables carry the mesh communications while the gray cables leaving the lower right provide the Ethernet connections to the eight motherboards
QCDSP and QCDOC Computers. Fig. The QCDOC installation at the Brookhaven National Laboratory (BNL). The RIKEN-funded RBRC QCDOC makes up the racks on the right while the DOE-funded USQCD QCDOC computer appears on the left. Each machine contains , nodes, has a peak speed of Tflops and a sustained speed in excess of three Tflops for lattice QCD
Both machines were typically used in a mode where the same code is executing on each node. Since there was no requirement for precise synchronization, the programs on different nodes can execute different branches. However, if one node runs ahead of the rest and begins communications with its slower neighbors, that faster node will wait until needed data is sent by its neighbors or the data it has sent to its neighbors has been removed from the neighbor’s input buffer and this action acknowledged so that further data can be sent. This self-synchronization provided by the communications hardware is automatic and requires no action from the user program. The QCDSP DSP chip came with “C” and “C++” compilers although the “C++” compiler quickly became
out-of-date. Given the lack of a data cache, standard “C” code gave poor performance (typically a few percent of peak) so that frequently used routines had to ultimately be written in assembler. The PowerPC core in QCDOC is much more efficient with its KB instruction and data caches. With reasonable precautions “C” code might achieve or even % of peak. The most critical code was best written in assembler, a process greatly aided by the Boyle assembler generation library (BAGEL) which was used to generate much of the assembler used in the QCD code. Such single node code can be run directly on QCDSP or QCDOC machines with thousands of nodes if this many-node environment is kept in mind when that code is written. For example, if data is sent to a
Q
Q
QCDSP and QCDOC Computers
neighboring node in the +μ direction, then the corresponding action of the neighbor in the −μ direction must be anticipated and a matching −μ direction receive instruction invoked as well. If a sum across all lattice sites is to be performed, this must be done by a call to a global sum routine which returns the sum performed across all nodes in the machine. The sends and receives of data in a particular direction are initiated by “C” function calls which begin the transmissions and test to see if they are complete. On QCDOC in addition to these special function calls, internode communications can also be accomplished by using the quantum message passing (QMP) library. This software was produced with SciDAC support by the USQCD collaboration. QMP provides communication routines which overlap in functionality with the standard message passing interface (MPI). Code which incorporates QMP calls can be run on many platforms including workstation clusters and other massively parallel supercomputers allowing portable code to be run on QCDOC. This QMP support was highly successful and resulted in most of the code used by groups in the USA studying lattice QCD being run on the QCDOC machines at BNL.
TMS320C31–50 DSP
−x −y −z
+x Custom gate array
−t
+y +z +t
512K ⫻ 40–bit DRAM
QCDSP and QCDOC Computers. Fig. A block diagram showing a QCDSP node. The central block is the node gate array (NGA) which provided the interface between the upper DSP chip and the lower DRAM chips. It also provided serial communications to each of the eight nearest neighbors connected to this node by the four-dimensional QCDSP mesh network
QCDSP Architecture A mesh architecture permits essentially all of the computing hardware of the parallel computer (with the exception of the network wires and perhaps some transceivers) to be built into the computing node. This was realized in the QCDSP machines. Essentially all of the computer’s functions were carried out by the three basic components on the daughter card shown in Fig. . Thus, the remaining parts of the computer, the motherboard and backplane, were to a large degree passive, providing a clock signal and power to the daughter cards and routing the communication wires. This concentration of functionality on the QCDSP daughter board was achieved with the custom-designed ASIC (referred to here as the “node gate array” or NGA) that was the third component, beyond the DSP and memory, which was mounted on the daughter board. Figure shows the organization of the daughter card. The NGA provided the interface between the DSP and
the dynamic random access memory (DRAM), supplying the periodic refresh signal and error checking and correction (ECC) required by this dense, economical memory technology. The NGA also contained a -location circular buffer which prefetched and stored DRAM data according to commands from the DSP. Thus, when the DSP began to multiply an -element link matrix by a component fermion spinor, the circular buffer could be instructed by the DSP to load the next -bit data word from memory followed by the additional sequential locations corresponding to the link matrix. These data were held in the circular buffer, available for later, low-latency DSP access. The NGA also sent and received data to and from the eight neighbors of the node. It managed the low-level protocol needed to serialize and transmit the data to be sent, or receive and deserialize incoming data. The NGA would also send an acknowledgment packet after bits
QCDSP and QCDOC Computers
of data had been received and the receive buffer had been cleared so that a further -bits could be received. Received data with bad parity was acknowledged with an error packet and would be retransmitted. The data being sent or received was fetched or stored directly from/to memory by an independent, direct memory access (DMA) controller, one for each of the eight directions. The DSP needed only load the starting address for such a transfer and a start command in order to begin the transfer. Additional preloaded control bits determined how much data was to be transferred and if the data should be fetched from or stored to memory as a series of blocks of fixed length separated by a fixed stride. Thus, the DSP could quickly start such a data transfer and then resume calculation using data not involved in the transfer. All eight directions could be active simultaneously at MHz for a total off-mode transfer rate of MB/s.
QCDOC Architecture The QCDOC architecture is a natural evolution of that described above for QCDSP, exploiting the much higher degree of integration that was available in when the QCDOC design was begun. Here the PowerPC processor is an IBM ASIC library component and was combined with the serial communications unit (SCU) into the single chip that makes up nearly all of a QCDOC node. The entire MB memory of the QCDSP chip was doubled and included in the QCDOC chip as embedded DRAM (EDRAM), again exploiting IBM technology. In addition two Ethernet controllers, a synchronous DRAM controller, and an . Gflops double precision attached floating point unit were incorporated in this QCDOC chip as shown in Fig. . These improvements give the QCDOC machine more than ten times the capability of QCDSP. The processor has increased from to MHz and the peak performance from to Mflops. The memory bandwidth has grown from MB/s to . GB/s while the total off-node bandwidth increased from . to . GB/s for send and for receive. Finally the limited DSP with its non-IEEE compliant single precision has been replaced by the IEEE, -bit precision of the IBM FPU, making QCDOC a highly capable and convenientto-program computer.
Q
Conclusion Both the QCDSP and QCDOC machines have been used for a wide range of influential physics calculations. The large QCDSP machines at Columbia and the RBRC began operation in and continued until January , . A smaller -crate machine was also installed at the Supercomputer Computations Research Institute of Florida State University. The first ,-node QCDOC machine began operation in December of at the University of Edinburgh which was followed in a few months by the RBRC and DOE ,-node machines at BNL. The two large machines at BNL are still operating at full capacity at the time of this writing (May ). The QCDSP machines were used to develop and exploit the chiral, domain wall fermion (DWF) formalism [, ]. This approach to lattice calculations with spin-/ particles preserves the independent symmetry of the left- and right-handed quarks in the limit of vanishing quark mass by adding a fifth dimension to the usual four-dimensional space-time lattice. The substantial increase in capability provided by QCDSP allowed the study of these new five-dimensional lattices with lattice extent varying between and in this fifth dimension. Among the more notable results achieved with these machines were a through quenched study of low-energy QCD using DWF [], the application of this new method to the weak decays of the K meson including results for the complex, CP violating amplitudes A and A [], the successful application of RI/MOM techniques to non-perturbatively renormalize the seven four-Fermi operators which appear in this calculation [, ], and the development of methods to include the effects of DWF quark loops in lattice QCD calculations at both zero [] and nonzero [] temperatures. This inclusion of quark loops, a step necessary for consistency, was at the limit of the capabilities of QCDSP. However, this preliminary work set the stage for much larger scale calculations enabled by the QCDOC machines. Thus, the RIKEN, BNL, Columbia (RBC) and UKQCD collaborations, which had worked together on the design and construction of QCDOC began a joint program of large-scale DWF calculations using the completed QCDOC computers to study a wide range of questions important in particle and nuclear physics on × and × grids. Among
Q
Q
QCDSP and QCDOC Computers
4 MB of Embedded DRAM
2.6 GB/sec EDRAM/SDRAM DMA
8 GB/sec Memory/Processor bandwidth
1 Gflops Double precision RISC processor
EDRAM Interrupt controller
Boot/Clock support
EDRAM controller 440 Core
FPU
DMA controller
24 Link DMA Communication control
DDR SDRAM
DDR SDRAM controller
OPB-PLB bridge
PLB Arbiter
MDMAL Ethernet DMA
1
2
IC
DDR SDRAM module sense
PLB
DCR Ring
24 Off-node links 12 Gbit/sec Bandwidth
2.6 GB/sec Interface to External Memory
HSSL JTAG HSSL
SCU
GPIO
IBM library component Custom designed logic
32 general I/O port
O P B
sideband signals
Bootable ethernet interface Ethernet-JTAG interface
MII
EMAC3
MII
HSSL
66 PLL
Complete processor node on a single QCDOC chip
2KB trans. fifo 66
100 Mbit/sec Fast ethernet
4KB recv. fifo
QCDSP and QCDOC Computers. Fig. A block diagram of the ASIC which makes up each QCDOC node. The cross-hatched blocks are those designed by the team which built the computer while the open boxes represent standard IBM ASIC library components
the many results obtained one might note those for the neutral K meson mixing parameter BK [], the nucleon axial coupling gA [] and the decay amplitude f+ of a K meson into a pion and lepton pair [], to mention a few examples. In addition a wide range of topics have been studied by members of the USQCD collaboration using the DOE-funded QCDOC. The QCDSP and QCDOC machines have also exerted influence on high-performance computing technology. The award of the Gordon Bell prize for cost performance at SC to the BNL QCDSP machine helped attract attention to this economical, low power, and yet broadly applicable architecture. This visibility together with the recruiting by IBM of four physicists closely involved with QCDSP and earlier Columbia University machines, lead to a collaboration between IBM, Columbia University, Edinburgh University, and the RIKEN laboratory in Japan on the QCDOC computer and at the same time, the start of the separate Blue Gene Light project at IBM. Blue Gene Light evolved into the present IBM Blue Gene series of commercial computers which have transformed high-performance computing and dominated the top list.
Related Entries QCD apeNeXT Machines QCD (Quantum Chromodynamics) Computations QCD Machines
Bibliographic Notes and Further Reading A thorough account of these two computer architectures can be found in Ref. [], written after both machines were complete. Addition information about QCDSP appears in Ref. []. Earlier accounts of the design and construction these computers appear in the proceedings of the annual International Symposia on Lattice Field Theory and other HPC conferences. For QCDSP see Refs. [–] and for QCDOC Refs. [–].
Bibliography . Kaplan DB () A method for simulating chiral fermions on the lattice. Phys Lett B:– [hep-lat/] . Shamir Y () Chiral fermions from lattice boundaries. Nucl Phys B:– [hep-lat/] . Blum T et al () Quenched lattice qcd with domain wall fermions and the chiral limit. Phys Rev D: [heplat/]
Quadrics
. RBC Collaboration, Blum T et al () Kaon matrix elements and cp-violation from quenched lattice qcd. i: the -flavor case. Phys Rev D: [hep-lat/] . Blum T et al () Non-perturbative renormalisation of domain wall fermions: quark bilinears. Phys Rev D: [heplat/] . Aoki Y et al () Lattice QCD with two dynamical flavors of domain wall fermions. Phys Rev D: [hep-lat/] . Chen P et al () The finite temperature qcd phase transition with domain wall fermions. Phys Rev D:, [arXiv:heplat/]. . RBC Collaboration, Antonio DJ et al () Neutral kaon mixing from + flavor domain wall QCD. Phys Rev Lett : [hep-ph/] . RBC+UKQCD Collaboration, Yamazaki T et al () Nucleon axial charge in + flavor dynamical lattice QCD with domain wall fermions. Phys Rev Lett : [.] . Boyle PA et al () KI semileptonic form factor from + flavour lattice QCD. Phys Rev Lett : [.] . Boyle P et al () Overview of the qcdsp and qcdoc computers. IBM Res J :– . Mawhinney RD () The teraflops qcdsp computer. Parallel Comput : [hep-lat/] . Arsenin I et al () A .-teraflops machine optimized for lattice QCD. Nucl Phys Proc Suppl :– . Chen D et al () The qcdsp project: a status report. Nucl Phys Proc Suppl A:– . Chen D et al () Status of the qcdsp project. Nucl Phys Proc Suppl : [hep-lat/] . Christ NH () Computers for lattice qcd. Nucl Phys Proc Suppl :– [hep-lat/] . Chen D et al () QCDOC: a -teraflops scale computer for lattice QCD. Nucl Phys Proc Suppl :– [hep-lat/] . Boyle PA et al () Status of the QCDOC project. Nucl Phys Proc Suppl :– [hep-lat/] . Boyle PA et al () Status of and performance estimates for QCDOC. Nucl Phys Proc Suppl :– [hep-lat/] . QCDOC Collaboration, Boyle PA, Jung C, Wettig T () The qcdoc supercomputer: hardware, software, and performance. ECONF C:THIT [hep-lat/] . Boyle PA et al () Hardware and software status of QCDOC. Nucl Phys Proc Suppl :– [hep-lat/] . Boyle PA et al () Qcdoc: project status and first results. J Phys Conf Ser :–
QsNet Quadrics
Q
Quadrics Salvador Coll Universidad Politécnica de Valencia, Valencia, Spain
Synonyms QsNet
Definition Quadrics was a company that produced hardware and software for high-performance interconnection networks used in cluster-based supercomputers. Their main product was QsNet, a high-speed interconnection network that was referred to as Quadrics interconnection network. Quadrics, and QsNet, played a significant role in the supercomputing field from , when the company was created, until , when it was officially closed.
Discussion Introduction Quadrics was created in as a subsidiary of Finmeccanica, a large Italian industrial group, under the name Quadrics Supercomputers World (QSW). The new company inherited the architecture of the Quadrics parallel computer and the Meiko CS- supercomputer []. The Quadrics computer was originally developed at the Istituto Nazionale di Fisica Nucleare (IFNF) and later commercialized by Alenia Spazio (another Finmeccanica subsidiary). The Meiko CS- was a massively parallel supercomputer based on Sun or Fujitsu processors connected by a multistage fat-tree network. The network was implemented by two hardware building blocks, a programmable network interface, called Elan and an -way crossbar switch called Elite. That technology was later developed into QsNet by QSW, whose name was shortened to be simply Quadrics on . QsNet proved very successful reaching its peak on June when six out of the ten fastest supercomputers in the world used the Quadrics interconnect. During , the second generation of the Quadrics network, QsNetII, was deployed. Finally, very close to the completion of the third-generation Quadrics network, QsNetIII, the company was closed on June and the support was transferred to Vega UK Ltd.
Q
Q
Quadrics
The Quadrics Interconnection Network The Quadrics network surpassed contemporary highspeed interconnects – such as Gigabit Ethernet [], GigaNet [], the Scalable Coherent Interface (SCI) [], the Gigabyte System Network (HiPPI-) [], and Myrinet [] – in functionality with an approach that integrated a node’s local virtual memory into a globally shared, virtual-memory space; provided a programmable processor in the network interface that allowed the implementation of intelligent communication protocols; delivered integrated network fault detection and fault tolerance; and supported collective communication operations at hardware level. The Quadrics interconnection network is a bidirectional multistage interconnection network with a connection pattern among stages known as butterfly. It is based on × switches, and can be viewed as a quaternary fat tree []. QsNet is based on two custom ASICs: a communication coprocessor called Elan, integrated into a network interface card, and a high-bandwidth, low-latency, communication switch called Elite. It uses wormhole switching with two virtual channels per physical link, source-based routing and adaptive routing. The main specifications of the two commercialized QsNet generations are summarized in Table for the network interface card, and in Table for the network switches. QsNet connects Elite switches in a quaternary fattree topology, which belongs to the more general class of k-ary n-trees [, ]. A quaternary fat-tree of dimension n is composed of n processing nodes and n × n− switches interconnected as a butterfly network (see below); it can be recursively built by connecting four quaternary fat trees of dimension n − .
Fat-Tree (k-ary n-tree) A k-ary n-tree is composed of two types of vertices: N = kn processing nodes and nkn− k × k communication switches (a k-ary n-tree of dimension n = is composed of a single processing node). Each node is an n n-tuple {, , ..., k − } , while each switch is defined as n− an ordered pair (w, l), where w ∈ {, , ..., k − } and l ∈ {, , ..., n − }. ●
Two switches (w , w , ..., wn− , l) and (w′ , w′ , ... , w′n− , l′ ) are connected by an edge if and only if l′ = l + and wi = w′i for all i ≠ l. The edge is labeled
Quadrics. Table Quadrics NIC specifications QsNet
QsNetII
-bit RISC thread processor -bit RISC thread processor -bit virtual addressing
-bit virtual addressing
Kbytes data cache
Kbytes data cache
Mbytes ECC SDRAM
Mbytes DDR SDRAM
PCI bit/ MHz
PCI-X bit/ MHz
× MHz byte-wide links × . GHz byte-wide links
Quadrics. Table Quadrics switch specifications QsNet
QsNetII
links × virtual channels
links × virtual channels
ns unblocked latency
ns unblocked latency
with w′l on the level l vertex and with wl on the level l′ vertex. ● There is an edge between the switch (w , w , ..., wn− , n − ) and the processing node p , p , ..., pn− if and only if wi = pi for all i ∈ {, , ..., n − }. This edge is labeled with pn− on the level n − switch. The labeling scheme shown makes the k-ary n-tree a delta network [, ]: any path starting from a level switch and leading to a given node p , p , ..., pn− traverses the same sequence of edge labels (p , p , ..., pn− ). An example of such labeling is shown in Fig. , for a -node QsNet network, that is a -ary -tree. The above-defined connection pattern among levels is also referred to as butterfly.
First Generation, QsNet The Elan network interface connects the highperformance, multistage Quadrics network to a processing node containing one or more CPUs. In addition to generating and accepting packets to and from the network, the Elan provides substantial local processing power to implement high-level, message-passing protocols such as MPI. The internal functional structure of the Elan, shown in Fig. , centers around two primary processing engines: the microcode processor and thread processor. The -bit microcode processor supports four threads of execution, where each thread can independently issue pipelined memory requests to the memory system. Up to eight requests can be outstanding at
Quadrics
Q
Column 0,1
0,2
0,3
1,0
1,1
1,2
1,3
2,0
2,1
2,2
2,3
3,0
3,1
3,2
3,3
0
0,0,0
0,1,0
0,2,0
0,3,0
1,0,0
1,1,0
1,2,0
1,3,0
2,0,0
2,1,0
2,2,0
2,3,0
3,0,0
3,1,0
3,2,0
3,3,0
1
0,0,1
0,1,1 01 2 3
0,2,1
0,3,1
1,0,1
1,1,1
1,2,1
1,3,1
2,0,1
2,1,1
2,2,1
2,3,1
3,0,1
3,1,1
3,2,1
3,3,1
0,1,2
0,2,2
0,3,2
1,0,2
1,1,2
1,2,2
1,3,2
2,0,2
2,1,2
2,2,2
2,3,2
3,0,2
3,1,2
3,2,2
3,3,2
Level
0,0
0 1 2 2
0,0,2
3
3,0,0 3,0,1 3,0,2 3,0,3 3,1,0 3,1,1 3,1,2 3,1,3 3,2,0 3,2,1 3,2,2 3,2,3 3,3,0 3,3,1 3,3,2 3,3,3
2,0,0 2,0,1 2,0,2 2,0,3 2,1,0 2,1,1 2,1,2 2,1,3 2,2,0 2,2,1 2,2,2 2,2,3 2,3,0 2,3,1 2,3,2 2,3,3
1,0,0 1,0,1 1,0,2 1,0,3 1,1,0 1,1,1 1,1,2 1,1,3 1,2,0 1,2,1 1,2,2 1,2,3 1,3,0 1,3,1 1,3,2 1,3,3
0,0,0 0,0,1 0,0,2 0,0,3 0,1,0 0,1,1 0,1,2 0,1,3 0,2,0 0,2,1 0,2,2 0,2,3 0,3,0 0,3,1 0,3,2 0,3,3
0 1 23
Quadrics. Fig. -ary -tree node, switch, and edge labels
any given time. The scheduling for the code processor enables a thread to wake up, schedule a new memory access on the result of a previous memory access, and go back to sleep in as few as two system-clock cycles. The four code threads are for ●
The inputter, which handles input transactions from the network ● The DMA engine, which generates DMA packets to write to the network, prioritizes outstanding DMAs, and time slices large DMAs to prevent adverse blocking of small DMAs ● Processor-scheduling, which prioritizes and controls the thread processor’s scheduling and descheduling ● The command processing, which handles requested operations (commands) from the host processor at the user level The thread processor is a -bit RISC processor that helps implement higher-level messaging libraries without explicit intervention from the main CPU. To better support the implementation of high-level, message-passing libraries without the main CPU’s explicit intervention, QsNet augments the instruction set with extra instructions. These extra instructions help construct network packets, manipulate events, efficiently schedule threads, and block save and restore a thread’s state when scheduling. The memory management unit (MMU) translates -bit virtual addresses into either -bit local SDRAM physical addresses or -bit peripheral component interconnect (PCI) physical addresses. To translate these addresses, the MMU contains a -entry, fully
associative, translation look-aside buffer (TLB) and a small data-path and state machine used to perform table walks to fill the TLB and save trap information when the MMU experiences a fault. The Elan contains routing tables that translate every virtual processor number into a sequence of tags that determine the network route. Elan has an -Kbyte memory cache (organized as sets of Kbytes) and -Mbyte SDRAM memory. The cache line size is bytes. The cache performs pipelined fills from SDRAM and can issue multiple cache fills and write backs for different units while still being servicing accesses for units that hit on the cache. The SDRAM interface is bits in length with eight check bits added to provide error-correcting code. The memory interface also contains -byte write buffer and read buffers. The link logic transmits and receives data from the network and generates bits and a clock signal on each half of the clock cycle. Each link provides buffer space for two virtual channels with a -entry, -bit FIFO RAM for flow control. The Elite provides the following features: ●
Eight bidirectional links supporting two virtual channels in each direction ● An internal × full crossbar switch (the crossbar has two input ports for each input link to accommodate two virtual channels) ● Packet error detection and recovery with cyclicredundancy check-protected routing and data transactions ● Two priority levels combined with an aging mechanism to ensure fair delivery of packets in the same priority level
Q
Q
Quadrics
10
200 MHz
FIFO 0
Link mux
72
SDRAM I/F
10
FIFO 1
µ code processor
Thread processor
DMA buffers
Inputter
Data bus
64
100 MHz 64 32 MMU & TLB 4-way set associative cache
Table walk engine
Clock & statistics registers
28
PCI interface 66 MHz
64
Quadrics. Fig. Elan functional units
● ●
Hardware support for collective communication Adaptive routing
can be connected to two or more networks through multiple network interface cards.
Packet Routing and Flow Control Multiple Network Rails One novel solution exploited by Quadrics-based systems to the issues of limited bandwidth availability in network connections, and of fault tolerance, is the use of multiple independent networks (also known as rails) []. Support for multirail networks has been included for all the QsNet products. In a multirail system, the network is replicated. In this way, each node
Each user- and system-level message is chunked in a sequence of packets by the Elan. An Elan packet contains three main components. The packet starts with the () routing information, which determines how the packet will reach the destination. This information is followed by () one or more transactions consisting of some header information, a remote memory address, the context identifier and a chunk of data, which can
Quadrics
Route
One or more transactions
Q
EOP token
Transaction type Context Packet header
Memory address
Routing tags CRC
Data
CRC
Quadrics. Fig. Packet transaction format
be up to bytes in the current implementation. The packet is terminated by () an end of packet (EOP) token, as shown in Fig. . Transactions fall into two categories: write block transactions and non-write block transactions. The purpose of a write block transaction is to write a block of data from the source node to the destination node, using the destination address contained in the transaction immediately before the data. A DMA operation is implemented as a sequence of write block transactions, partitioned into one or more packets (a packet normally contains five write block transactions of bytes each, for a total of bytes of data payload per packet). The non-write block transactions implement a family of relatively low-level communication and synchronization primitives. For example, non-write block transactions can atomically perform remote testand-write or fetch-and-add and return the result of the remote operation to the source, and can be used as building blocks for more sophisticated distributed algorithms. Elite networks are source routed. The Elan network interface, which resides in the network node, attaches route information to the packet header before injecting the packet into the network. The route information is a sequence of Elite link tags. As the packet moves inside the network, each Elite switch removes the first route tag from the header and forwards the packet to the next
Elite switch in the route or to the final destination. The routing tag can identify either a single link (used for point-to-point communication) or a group of adjacent links (used for collective communication). The Elan interface pipelines each packet transmission into the network using wormhole flow control. At the link level, the Elan interface partitions each packet into smaller -bit units called flow control digits or flits []. Every packet closes with an end-of-packet token, but the source Elan normally only sends the end-of-packet token after receipt of a packet acknowledgment token. This process implies that every packet transmission creates a virtual circuit between source and destination. It is worth noting that both acknowledgment and EOP can be tagged to communicate control information. So, for example, the destination can notify the successful completion of a remote non-write block transaction without explicitly sending an extra packet. Minimal routing between any pair of nodes is accomplished by sending the message to one of the nearest common ancestor switches and from there to the destination []. In this way, each packet experiences two routing phases: an adaptive ascending phase (in forward direction) to get to a nearest common ancestor, where the switches forward the packet through the least loaded link; and a deterministic descending phase (in backward direction) to the destination.
Q
Q
Quadrics
Network nodes can send packets to multiple destinations using the network’s broadcast capability. For successful broadcast packet delivery, the source node must receive a positive acknowledgment from all the broadcast group recipients. All Elan interfaces connected to the network can receive the broadcast packet but, if desired, the sender can limit the broadcast set to a subset of physically contiguous Elans.
Global Virtual Memory Elan can transfer information directly between the address spaces of groups of cooperating processes while maintaining hardware protection between these process groups. This capability (called virtual operation) is a sophisticated extension to the conventional virtual memory mechanism that is based on two concepts: Elan virtual memory and Elan context. Elan virtual memory. Elan contains an MMU to translate the virtual memory addresses issued by the various on-chip functional units (thread processor, DMA engine, and so on) into physical addresses. These physical memory addresses can refer to either Elan local memory (SDRAM) or the node’s main memory. To support main memory accesses, the configuration tables for the Elan MMU are synchronized with the main processor’s MMU tables so that Elan can access its virtual address space. The system software is responsible for MMU table synchronization and is invisible to programmers. The Elan MMU can translate between virtual addresses in the main processor format (e.g., a -bit word, big-endian architecture, such as that of the AlphaServer) and virtual addresses written in the Elan format (a -bit word, little-endian architecture). A processor with a -bit architecture (e.g., an Intel Pentium) requires only one-to-one mapping. Figure shows a -bit processor mapping. The -bit addresses starting at xFFC are mapped to the Elan’s -bit addresses starting at xC. This means that the main processor can directly access virtual addresses in the range xFFC to xFFFFFFFFFF, and Elan can access the same memory with addresses in the xC to xFFFFFFFF range. In our example, the user can allocate main memory using malloc, and the process heap can grow outside the region directly accessible by the Elan, which is delimited by xFFFFFFFFFF. To avoid this problem, both the main and Elan memory
can be allocated using a consistent memory allocation mechanism. As shown in Fig. , the MMU tables can map a common region of virtual memory called the memory allocator heap. The allocator maps, on demand, physical pages (of either main or Elan memory) into this virtual address range. Thus, using allocation functions provided by the Elan library, the user can allocate portions of virtual memory either from main or Elan memory, and the main processor and Elan MMUs can be kept consistent. Elan context. In a conventional virtual-memory system, each user process has an assigned process identification number that selects the MMU table set and, therefore, the physical address spaces accessible to the user process. QsNet extends this concept so that the user address spaces in a parallel program can intersect. Elan replaces the process identification number value with a context value. User processes can directly access an exported segment of remote memory using a context value and a virtual address. Furthermore, the context value also determines which remote processes can access the address space via the Elan network and where those processes reside. If the user process is multithreaded, the threads will share the same context just as they share the same main-memory address space. If the node has multiple physical CPUs, then different CPUs can execute the individual threads. However, the threads will still share the same context.
Network Fault Detection and Fault Tolerance QsNet implements network fault detection and tolerance in hardware. (It is important to note that this fault detection and tolerance occurs between two communicating Elans). Under normal operation, the source Elan transmits a packet (i.e., route information for source routing followed by one or more transactions). When the receiver in the destination Elan receives a transaction with an ACK Now flag, it means that this transaction is the last one for the packet. The destination Elan then sends a packet acknowledgment token back to the source Elan. Only when the source Elan receives the packet acknowledgment token does it send an end-ofpacket token to indicate the packet transfer’s completion. The fundamental rule of Elan network operation
Quadrics
Q
Main processor virtual address space
Elan virtual address space
1FFFFFFFFFF
FFFFFFFF heap heap bss bss data data text text stack stack
1FF0C808000
C808000 Memory allocator heap System
10C808000 200 MB 100008000
200 MB 8000
Memory allocator heap System
Main memory
Elan sdram
Q Quadrics. Fig. Virtual address translation. The dotted lines in the figure signify that a segment of memory from one address space maps onto an equally sized segment of memory in another address space
is that for every packet sent down a link, an Elan interface returns a single packet-acknowledgment token. The network will not reuse the link until the destination Elan sends such a token. If an Elan detects an error during a packet transmission over QsNet, it immediately sends an error message without waiting for a packet-acknowledgment token. If an Elite detects an error, it automatically transmits an error message back to the source and the destination. During this process, the source and destination Elans and the Elites between them isolate the faulty link and/or switch via per-hop fault detection []; the
source receives notification about the faulty component and can retry the packet transmission a default number of times. If this is unsuccessful, the source can appropriately reconfigure its routing tables to avoid the faulty component.
Communication Libraries With respect to software, QsNet provides several layers of communication libraries that trade off performance with machine independence and programmability. Figure shows the different programming libraries for the Elan network interface. The Elanlib provides the
Q
Quadrics User applications shmem
mpi Elanlib
User space Kernel space
tport
Elan3lib System calls
Elan kernel comms
Quadrics. Fig. Elan programming libraries
lowest-level, user space programming interface to the Elan. At this level, processes in a parallel job can communicate through an abstraction of distributed, virtual, shared memory. Each process in a parallel job is allocated a virtual process identification (VPID) number and can map a portion of its address space into an Elan. These address spaces, taken in combination, constitute a distributed, virtual, shared memory. A combination of a VPID and a virtual address can provide an address for remote memory (i.e., memory on another processing node belonging to a process). The system software and the Elan hardware use VPID to locate the Elan context when they perform a remote communication. Since Elan has its own MMU, a process can select which part of its address space should be visible across the network, determine specific access rights (e.g., write or read only), and select the set of potential communication partners. Elanlib is a higher-level interface that releases the programmer from the revision-dependent details of Elan and extends Elanlib with collective communication primitives and point-to-point, tagged messagepassing primitives (called tagged message ports or Tports). Standard communication libraries such as that of the MPI- standard [, ] or Cray SHMEM are implemented on top of Elanlib.
Elanlib The Elanlib library supports a programming environment where groups of cooperating processes can transfer data directly, while protecting process groups from each other in hardware. The communication takes place at the user level, with no copy, bypassing the operating system. The main features of Elanlib are the memory
mapping and allocation scheme (described previously), event notification, and remote DMA transfers. Events provide a general-purpose mechanism for processes to synchronize their actions. Threads running on Elan and processes running on the main processor can use this mechanism. Processes, threads, packets, etc. can access events both locally and remotely. In this way, intranetwork synchronization of processes is possible, and events can indicate the end of a communication operation, such as the completion of a remote DMA. QsNet stores events in Elan memory to guarantee atomic execution of the synchronization primitives. (The current PCI bus implementations cannot guarantee atomic execution, so it is not possible to store events in main memory). Processes can wait for an event by busy-waiting, or polling. In addition, processes can tag an event as block copy. The block-copy mechanism works as follows: A process can initialize a block of data in Elan memory to hold a predefined value. An equivalent-sized block is located in main memory, and both blocks are in the user’s virtual address space. When the specified event is set, for example when a DMA transfer has completed, a block copy takes place. That is, the hardware in the Elan, the DMA engine, copies the block in Elan memory to the block in main memory. The user process polls the block in main memory to check its value (by, e.g., bringing a copy of the corresponding memory block into the level-two cache) without polling for this information across the PCI bus. When the value is the same as that initialized in the source block, the process knows that the specified event has occurred. The Elan supports remote DMA transfers across the network, without any copying, buffering, or operating system intervention. The process that initiates
Quadrics
Q
the DMA fills out a DMA descriptor, which is typically allocated on the Elan memory for efficiency. The DMA descriptor contains source and destination process VPIDs, the amount of data, source and destination addresses, two event locations (one for the source and the other for the destination process), and other information that enhances fault tolerance. Figure outlines the typical steps of remote DMA. The command processor referred to in the figure is an Elan microcode thread that processes user commands; it is not a specific microprocessor.
contiguous. In this way, a group of adjacent nodes can be reached by using a single hardware multicast transaction. When the destination nodes are not contiguous, this hardware mechanism is not usable. Instead, a software-based implementation which uses a tree and point-to-point messages, exchanged by the Elans without interrupting their processing nodes, is used . These mechanisms constitute the basic blocks to implement collective communication patterns such as barrier synchronization, broadcast, reduce, and reduce-to-all.
Elanlib Elanlib is a machine-independent library that integrates the main features of Elanlib with higher-level collective communication primitives and Tports. Tports provide basic mechanisms for point-topoint message passing. Senders can label each message with a tag, sender identity, and message size. This information is known as the envelope. Receivers can receive their messages selectively, filtering them according to the sender’s identity and/or a tag on the envelope. The Tports layer handles communication via shared memory for processes on the same node. The Tports programming interface is very similar to that of MPI. Tports implement message sends (and receives) with two distinct function calls: a non-blocking send that posts and performs the message communication, and a blocking send that waits until the matching start send is completed, allowing implementation of different flavors of higher-level communication primitives. Tports can deliver messages synchronously and asynchronously. They transfer synchronous messages from sender to receiver with no intermediate system buffering; the message does not leave the sender until the receiver requests it. QsNet copies asynchronous messages directly to the receiver’s buffers if the receiver has requested them. If the receiver has not requested them, it copies asynchronous messages into a system buffer at the destination.
Hardware-Based Multicast The routing phases for multicast packets differ from those defined for unicast packets. With multicast, during the ascending phase the nearest common ancestor switch for the source node and the destination group is reached. After that, the turnaround routing step is performed and, during the second routing phase, the packet spans the appropriate links to reach all the destination nodes. The operation of the hardware-based multicast is outlined in Fig. . A process in a node injects a multicast packet into the network (see Fig. (a)). In Fig. (b) the packet reaches the nearest common ancestor switch, and then multiple branches are propagated in parallel. All nodes are capable of receiving a multicast packet, as long as the multicast set is physically contiguous. For a multicast packet to be successfully delivered, a positive acknowledgment must be received from all the recipients of the multicast group. The Elite switches combine the acknowledgments, as pioneered by the NYU Ultracomputer [, ], returning a single one to the source (see Figs. (c) and (d)). Acknowledgments are combined in a way that the worst ack wins (a network error wins over an unsuccessful transaction, which on its turn wins over a successful one), returning a positive acknowledgment only when all the partners in the collective communication complete the distributed transaction with success. The network hardware guarantees the atomic execution of the multicast: either all nodes successfully complete the operation or none. It is worth noting that the multicast packet opens a set of circuits from the source to the destination set, and that multiple transactions (up to in the current implementation) can be pipelined
Collective Communication Support The basic hardware mechanism that supports collective communication is provided by the Elite switches. The Elite switches can forward a packet to several output ports, with the only restriction that these ports must be
Q
Q
Quadrics
µ processor
10
2
1 Elan sdram DMA engine DMA descriptor
4
3 Command processor
Event 9
Event Inputter
8 Source
Table walk engine
MMU
5
7 µ processor
13 Elan sdram DMA engine Event
12
Command processor
6
Event Inputter
11 Destination
MMU
Table walk engine
Quadrics. Fig. Execution of a remote DMA. The sending process initializes the DMA descriptor in the Elan memory () and communicates the address of the DMA descriptor to the command processor (). The command processor checks the correctness of the DMA descriptor () and adds it to the DMA queue (). The DMA engine performs the remote DMA transaction (). Upon transaction completion, the remote inputter notifies the DMA engine (), which sends an acknowledgment to the source Elan’s inputter (). The inputters or the hardware in the Elan can notify source (, , ) and destination (, , ) events, if needed
within a single packet. For example, it is possible to conditionally issue a transaction based on the result of a previous transaction. This powerful mechanism allows the efficient implementation of sophisticated collective operations and high-level protocols [, ].
Second Generation, QsNetII The QsNetII network has been designed to optimize the interprocessor communication performance in systems constructed from standard server building blocks. The organization of the network remains the same while
outperforming what it was achieved with QsNet. The network is based on two building blocks: Elan, a communication processor for the network interface and Elite, an -way high-performance crossbar switch. The network interface incorporates a number of innovative features to minimize latency for short messages, and achieve the maximum bandwidth from a standard PCI-X interface. The network interface has a full -bit virtual addressing capability and can perform RDMA operation from user space to user space in -bit architectures. As in the QsNet an embedded I/O processor,
Quadrics
Q
a
b
c
Q d Quadrics. Fig. Hardware-based multicast quadrics. (a) A process in node injects a packet into the network (the ascending routing phase in shown); (b) The packet reaches the destinations passing through a nearest common ancestor switch (descending routing phase shown); (c) The acknowledgments are combined; (d) The issuing process receives a single acknowledgment
which is user programmable can be used to offload asynchronous protocol handling tasks. As an improvement from the first generation, apart from the technology evolution summarized in Table and Table , QsNetII includes specific hardware at Elan level for processing short reads and writes and protocol control to provide very low latencies. Elan architecture allows data flow from the PCI-X bus directly to the
output link, thus providing very low latency and high bandwidth.
Related Entries All-to-All Broadcast Buses and Crossbars Clusters
Q
Quadrics
Collective Communication Collective Communication, Network Support for Flow Control Interconnection Networks Meiko
venues of exploiting multiple rails are presented and analyzed. Finally, QsNetII is presented and evaluated in [] while QsNetIII, that was never commercialized, is described in [].
MPI (Message Passing Interface) Myrinet Network Interfaces
Bibliography
Network of Workstations
. Beecroft J, Addison D, Hewson D, McLaren M, Roweth D, Petrini F, Nieplocha J () QsNetII: defining high-performance network design. IEEE Micro ():– . Beecroft J, Homewood M, McLaren M () Meiko cs- interconnect elan-elite design. Parallel Comput (–):– . Bell G () Ultracomputers: a teraflop before its time. Communications of the ACM ():– . Boden NJ, Cohen D, Felderman RE, Kulawik AE, Seitz CL, Seizovic JN, Su W-K () Myrinet: a gigabit per-second local area network. IEEE Micro ():– . Coll S, Duato J, Mora F, Petrini F, Hoisie A () Collective communication patterns on the Quadrics network. In: Getov V, Gerndt M, Hoisie A, Malony A, Miller B, (eds) Performance analysis and grid computing, Chapter I, Kluwer Academic, Norwell, MA, pp – . Coll S, Frachtenberg E, Petrini F, Hoisie A, Gurvits L () Using multirail networks in high-performance clusters. Concurrency Computat Pract Exper (–):– . Dally WJ, Seitz CL () Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans Comput C-():– . Dally WJ, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco, CA . Duato J, Yalamanchili S, Ni L () Interconnection networks: an engineering approach. Morgan Kaufmann, San Francisco, CA . Fernández J, Frachtenberg E, Petrini F () BCS-MPI: a new approach in the system software design for large-scale parallel computers. In: ACM/IEEE SC, Phoenix, Arizona . Frachtenberg E, Petrini F, Fernandez J, Pakin S, Coll S () Storm: lightning-fast resource management. In: IEEE/ACM SC, Baltimore, MD . Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, Snir M () MPI - the complete reference, vol , the MPI extensions. The MIT Press, Cambridge, London . Hellwagner H () The SCI Standard and Applications of SCI. In: Hellwagner H, Reinfeld A (eds) SCI: scalable coherent interface, vol of lecture notes in computer science. Springer, Berlin, pp – . Leiserson CE () Fat-Trees: universal networks for hardware efficient supercomputing. IEEE Trans Comput C-():– . Leiserson CE et al () The network architecture of the connection machine CM-. J Parallel Distrib Comput ():– . Petrini F, Vanneschi M () k-ary n-trees: high performance networks for massively parallel architectures. In: Proceedings of the th international parallel processing symposium, IPPS’, Geneva, Switzerland, pp – April
Networks, Fault-Tolerant Networks, Multistage OpenSHMEM - Toward a Unified RMA Model Routing (Including Deadlock Avoidance) SCI (Scalable Coherent Interface) Scheduling Algorithms Shared-Memory Multiprocessors Switch Architecture Switching Techniques Top Ultracomputer, NYU
Bibliographic Notes and Further Reading As indicated in the introduction, QsNet inherited the architecture of the Meiko CS- supercomputer whose design is described in []. Many of the performance characteristics of QsNet rely on the fat-tree topology used in the network implementation. Fat-tree networks were first introduced by Charles E. Leiserson in [], and later used in the Connection Machine CM supercomputer []. In [], Fabrizio Petrini and Marco Vanneschi presented the first formal model of k-ary n-trees based on a recursive definition of fat-tree networks built with constant arity switches. A description and evaluation of point-to-point communication in QsNet is reported in [], while extensions of the previous paper with an evaluation of permutation patterns, scalability of uniform traffic, analysis of I/O traffic, and collective communication are presented in [, ]. In [], the QsNet high-performance collective communication primitives are shown to be a key factor to provide resource-management functions, orders of magnitude faster than previously reported results. A novel technique introduced by Quadrics is the use of independent network rails to overcome bandwidth limitations and enhance fault tolerance. In [], various
Quicksort
. Petrini F, Vanneschi M () Performance analysis of wormhole routed k-ary n-trees. Int J Found Comput Sci ():– . Petrini F, Feng W-C, Hoisie A, Coll S, Frachtenberg E () The Quadrics network: high-performance clustering technology. IEEE Micro, ():– . Petrini F, Frachtenberg E, Hoisie A, Coll S () Performance evaluation of the quadrics interconnection network. Cluster Comput ():– . Pfister GF, Norton VA () “Hot Spot” contention and combining in multistage interconnection networks. IEEE Trans Comput C-():– . Quadrics Supercomputers World Ltd () Elan programming manual. Bristol, England, UK . Roweth D, Jones T () QsNetIII an adaptively routed network for high performance computing. In: Proceedings of the th IEEE symposium on high performance interconnects, HOTI ’ Washington, DC, pp – . Seifert R () Gigabit ethernet: technology and applications for high speed LANs. Addison-Wesley Boston, MA . Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J () MPI - the complete reference, vol , the MPI core. The MIT Press, Cambridge, London . Tolmie D, Boorman TM, DuBois A, DuBois D, Feng W, Philp I () From HiPPI- to HiPPI-: a changing of the guard and gateway to the future. In: Proceedings of the th international conference on parallel interconnects (PI’), Anchorage, AK
Q
. Vogels W, Follett D, Hsieh J, Lifka D, Stern D () Treesaturation control in the AC velocity cluster. In: Proceedings of the eighth symposium on high performance interconnects (HOTI’), Stanford University, Palo Alto CA
Quantum Chemistry NWChem
Quantum Chromodynamics (QCD) Computations QCD (Quantum Chromodynamics) Computations
Quicksort Sorting
Q
R Race Race Conditions
Race Conditions Christoph von Praun Georg-Simon-Ohm University of Applied Sciences, Nuremberg, Germany
Synonyms Access anomaly; Critical race; Determinacy race; Harmful shared-memory access; Race; Race hazard
Definition A race condition occurs in a parallel program execution when two or more threads access a common resource, e.g., a variable in shared memory, and the order of the accesses depends on the timing, i.e., the progress of individual threads. The disposition for a race condition is in the parallel program. In different executions of the same program on the same input access events that constitute a race can occur in different order, which may but does not generally result in different program behaviors (nondeterminacy). Race conditions that can cause unintended nondeterminacy are programming errors. Examples of such programming errors are violations of atomicity due to incorrect synchronization.
Discussion Introduction Race conditions are common on access to shared resources that facilitate inter-thread synchronization or communication, e.g., locks, barriers, or concurrent data structures.
Such structures are called concurrent objects [], since they operate correctly even when being accessed concurrently by multiple threads. The term concurrent refers to the fact that the real-time order in which accesses from different threads execute is not predetermined by the program. Concurrent may but does not necessarily mean simultaneous, i.e., that the execution of accesses overlaps in real time. Concurrent objects can be regarded as arbiters that decide on the outcome of race conditions. Threads that participate in a race when accessing a concurrent object will find agreement about the outcome of the race; hence access to concurrent objects is a form of inter-thread synchronization. A common and intuitive correctness criterion for the behavior of a concurrent object is linearizability [], which informally means that accesses from different threads behave as if the accesses of all threads were executed in some total order that is compatible with the program order within each thread and the real-time order. Linearizability is not a natural property of concurrent systems. Ordinary load and store operations on weakly-ordered shared memory, e.g., can expose many more behaviors than permitted by the strong rules of linearizability. Hence when implementing concurrent objects on such systems, specific algorithms or hardware synchronization instructions are required to ensure correct, i.e., linearizable, operation. Race conditions on access to resources that are not designed as concurrent objects, e.g., shared memory locations with ordinary load and store operations, are commonly considered as programming errors. An example of such errors are data races. Informally, a data race occurs when multiple threads access the same variable in shared memory; the accesses may occur simultaneously and at least one access modifies the variable.
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
R
Race Conditions
Model and Formalization The definition and formalization of the terminology is adopted from Netzer and Miller []. Events An event represents an execution instance of a statement or statement sequence in the program. Every event specifies the set of shared memory locations that are read and/or written. In this entry, events typically refer to individual read or write operations that access a single shared memory location. Simultaneous events Events do not necessarily occur instantaneously, and in parallel programs, events can be unordered. A simple model that relates the occurrence of events to wall clock time is as follows: The timing of an event is specified by its start and end instants. Events e , e are simultaneous if the start of e occurs after the start of e but before the end of e , or vice versa. Figure illustrates the concept. Temporal ordering relation Events that are not simulT
taneous are ordered by a temporal ordering relation →, T
such that e → e :⇔ end(e ) < start(e ). In the exeT
cution of a sequential program → is a total order, for T
parallel programs → is typically a partial order. For clarity of the presentation and without limiting generality, the temporal ordering relations in the examples of this entry are total orders. Shared data dependence relation Another relation D
among events is the shared data dependence relation → that specifies how events communicate through memD
ory. e → e holds if e reads a value that was written by e . This relation serves to distinguish executions
that have the same temporal ordering but differ in the communication among simultaneous events. Program execution A program execution is a triplet T
D
P = ⟨E, →, →⟩. Feasible program executions To characterize a race condition in a program execution P, it is necessary to contrast P with other possible executions P′ of the same program on the same input. P′ can differ from P in the temporal ordering or shared data dependence relation (or both). Moreover, P′ should be a prefix of P, which means that each thread in P′ performs the same or an initial sequence of events as the corresponding thread in P. In other words, the projection for each thread in P′ is the same or a prefix of the projection of the corresponding thread in P. The set of such program executions P′ is called the set of feasible program executions Xf . A feasible execution is one that adheres to the control and data dependencies prescribed by the program. It is reasonable to contrast P with execution prefixes P′ , since the focus of interest is on the earliest point of the program execution P where a race condition occurred and thus non-determinacy may have been introduced. The execution that follows from that point on may contain further race conditions or erroneous program behavior but these may merely be a consequence of the initial race condition. Xf contains, e.g., program executions P′ that differ from P only in the temporal ordering relation. In other words, the timing in P′ can deviate from the timing in P but not such that any shared data dependencies among events from different threads are resolved differently. Thus P and P′ have the same functional behavior. An execution that differs from P in Fig. in such a way is, T′
start end event e1 -------------|----------|----> start end event e2 ----|-----------|------------> start end event e3 ------|-----|----------------> Race Conditions. Fig. e is simultaneous to e and e
T′
T′
T′
e.g., P′ = ⟨{e , e , ..., e }, e → e → e → e → e , ∅⟩: Reordering the read of shared variable x does not alter the shared data dependence relation. A somewhat more substantial departure from P but still within the scope of feasible program executions, are executions where shared data dependencies among events from different threads are resolved differently. An execution that differs from P in Fig. in such a way T ′′
T ′′
T ′′
D′′
is, e.g., P′′ = ⟨{e , e , e , e }, e → e → e → e , e → e ⟩: The update of x (event e ) occurs in this execution before the read of x in event e . This induces a new data
Race Conditions
initially x = 0 thread-1 (1) (2) (3)
thread-2
r1 = x if (r1 == 0) x = r1 + 1
(4) (5)
r2 = x x = r2 + 1
Race Conditions. Fig. Feasible execution: T T T T P = ⟨{e , e , ..., e }, e → e → e → e → e , ∅⟩. Events ei correspond to the execution of statements marked (i). There are no shared data dependencies in this execution
dependence. It is sufficient that P′′ specifies a prefix of P, namely, up to the event where the change in the shared data dependence manifests. Finally, an execution of the program in Fig. that T ′′′
T ′′′
is not feasible: P′′′ = ⟨{e , e , e , e , e }, e → e → T ′′′
T ′′′
D′′′
e → e → e , e → e ⟩. This execution is not possible, since e would not be executed if e read a value different from . Conflicting events A pair of events is called conflicting, if both events access the same shared resource and at least one access modifies the resource [].
General Races and Data Races T
D
Given a program execution P = ⟨E, →, →⟩, let Xf be the set of feasible executions corresponding to P. General race A general race exists between conflicting T ′ D′
events a, b ∈ E, if there is a P′ = ⟨E′ , →, →⟩ ∈ E, such that a, b, ∈ E′ and T′ → T′ → T′
T → T
. b a if a b, or . a b if b → a, or . a → / b. In other words, there is some order among events a and b but the order is not predetermined by the program (cases and ), or executions are feasible where a and b occur simultaneously (case ). Data race A data race is a special case of a general race. A data race exists between conflicting memory accesses T ′ D′
a, b ∈ E, if there is a P′ = ⟨E′ , →, →⟩ ∈ X, such that a, b, ∈ T
′
E′ and a → / b. In other words, executions are feasible where a and b occur simultaneously.
R
Programmers commonly use interleaving semantics when reasoning about concurrent programs. Interleaving semantics mean that the possible behaviors of a parallel program correspond to the sequential execution of events interleaved from different threads. This property of an execution is also called serializability []. The resulting intuitive execution model is also known as sequential consistency []. The motivation for the distinction between data races and general races is to explicitly term those race conditions, where the intuitive reasoning according to interleaving semantics may fail; these are the data races. A common situation where interleaving semantics fails in the presence of data races are concurrent accesses to shared memory in systems that have a memory consistency model that is weaker than sequential consistency []. A data race ⟨a, b⟩ means that events a and b have the potential of occurring simultaneously in some execution. Simultaneous access can be prevented by augmenting the access sites in the program code with some form of explicit concurrency control. That way, every data race can be turned into a general race that is not a data race. Race conditions and concurrent objects For linearizable concurrent objects, the distinction between data races and general races among events is immaterial. This is because linearizability requires that events on concurrent objects that are simultaneous behave as if they occurred in some order.
Feasible and Apparent Races Feasible races So far, the definition of a race condition in P is based on the set of feasible program executions Xf . Thus, locating a race condition requires that the feasibility of a program execution P′ , i.e., P′ ∈ Xf , is assessed and that in turn requires a detailed analysis of shared data and control dependencies. This problem NP-hard [, ]. Race conditions defined on the basis of feasible program executions are called feasible general races and feasible data races respectively. Apparent races Due to the difficulty of verifying the feasibility of a program execution P, common race detection methods widen the set of program executions to consider. The set Xa ⊇ Xf contains all executions
R
R
Race Conditions
initially x,y = 0 thread-1 (1) (2)
x = 1 y = 1
thread-2
initially x,y = 0 thread-1
(3) (4) (5)
(1) (2) (3)
r1 = y if (r1 == 1) r2 = x
Race Conditions. Fig. This program does not have explicit synchronization that would restrict the temporal T′
T
T
ordering →. Thus the execution P′ = ⟨{e , ..., e }, e → e → T T e → e → e , ∅⟩ is in Xa . P′ is however not feasible, since control dependence does not allow execution of e if the read in e returns
P′ that are prefixes of P with the following restric-
r1 = x if (r1 == 1) y = 1
thread-2 (4) (5) (6)
r2 = y if (r2 == 1) x = 1
Race Conditions. Fig. Program with read and write accesses to shared variables that does neither have a feasible nor an apparent data race
Some of the literature does not explicitly distinguish apparent and feasible races, i.e., the term “race” may refer to either of the two definitions, depending on the context.
T′
tion: The temporal ordering → obeys the program order and inter-thread dependencies due to explicit synchronization; such relation is also known as the happenedbefore relation []. It is not required that P′ be feasible with respect to shared data dependencies or control dependencies that result from those. Figure gives an example. Race conditions that are determined using a superset of feasible program executions, e.g., Xa , are called apparent general races and apparent data races respectively. Note that this definition of race conditions depends on the specifics of the synchronization mechanism. The notion of feasible races is more restricting than the notion of apparent races. This means that an algorithm that detects apparent races considers executions that are not feasible and therefore may report race conditions that could never occur in a real program execution. Such reports are called spurious races. Figure shows another example to illustrate the notion of apparent races further. This program does not have explicit synchronization and has read and write accesses to shared variables x and y. Perhaps surprisingly, the program in Fig. does neither have an apparent nor a feasible data race. The reason is that there is no program execution P that contains event e or e . Hence the sets Xf or Xa , which are defined based on P, cannot contain those events either and hence neither Xf nor Xa would lead us to determine a feasible respectively apparent race condition.
Actual data races An actual data race exists in execuT D tion P = ⟨E, →, →⟩, if events a, b ∈ E are conflicting and T′
/ b. both events occur simultaneously, i.e., a →
Race Conditions as Programming Errors Some race conditions are due to programming errors and some are part of the intended program behavior. The distinction between the two is commonly made along the distinction of data races and general races: data races are programming errors, whereas general races that are not data races are considered as “normal” behavior. While such distinction is oftentimes true, exceptions exist and are common. First, there are correct programs that entail the occurrence of data races at run time. In non-blocking concurrent data structures [], e.g., data races are intended and considered by the algorithmic design. The development and implementation of such algorithms that “tolerate” race conditions is extremely difficult, since it requires a deep understanding of the shared memory consistency model. Moreover, the implementation of such algorithms is typically not portable, since different platforms may have different shared memory semantics (see “Data-Race-Free Programs”). Data races that are not manifestations of program bugs are called benign data races. Second, there are incorrect programs that have executions with general races that are not data races. An example for such programming error is a violation of atomicity []: Accesses to a shared resource occur in
Race Conditions
separate critical sections, although the programmer’s intent was to have the accesses in the same atomicity scope. Non-determinacy Race conditions can but do not necessarily introduce non-determinacy in parallel programs. For a detailed discussion of this aspect, refer to the entry on “Determinacy”.
Data-Race-Free Programs Data races are an important concept to pinpoint certain behaviors of parallel programs that are manifestations of programming errors. But beyond this role, the concept of data races is also important for system architects who design shared memory consistency models. Intuitively, a shared memory consistency model defines the possible values that a read may return considering prior writes. Consider the program execution in Fig. . Despite the fact that this execution is correct according to the memory consistency model of common multiprocessor architectures [], such program behavior is difficult to analyze and explain. The reason is, that the program execution is not serializable, i.e., there is no interleaving of events from different threads that obeys program order and that would justify the results returned by the reads []. Why should such behavior be permitted? For optimization of the memory access path, a processor or compiler can reorder and overlap the execution of memory accesses such that the temporal ordering relaT
tion →, and with it the shared data dependence relation D
→, can differ from the program order. For the program in Fig. , one of the processors hoists the read before the writes, thus permitting the result. To allow such optimization on the one hand and to give programmers an intuitive programming abstraction on the other hand, computer architects have thread-1
thread-2
(1) (2) (3)
(4) (5) (6)
write(x, 1) write(x, 2) read(y, 5)
write(y, 5) write(y, 7) read(x, 1)
Race Conditions. Fig. Program execution with a data race. This execution is not serializable
R
phrased a minimum guarantee that shared memory consistency models should adhere to: Data-race-free programs should only have executions that could also occur on sequentially consistent hardware. In other words, the behavior of parallel programs that are datarace-free follows the intuitive interleaving semantics. So what means data-race-free? Data-race-free in computer architecture Adve and Hill [] define a data race as a property of a single program execution: A pair of conflicting accesses participates in a data race, if their execution is not ordered by the happened-before relation []. The happenedbefore relation is defined as the combination of program order and the synchronization order due to synchronization events from different threads. This definition allows, unlike the definition according to Netzer and Miller [], to validate the presence or absence of data race by inspecting a single program execution. An execution is data-race-free, if it does not have a data race. A program is data-race-free, if all possible executions are data-race-free. “Possible executions” are thereby implicitly defined as any executions permitted by the processor, for which the memory model is defined. Data-race-free in parallel programming languages The developers of parallel programming languages also specify the semantics of shared memory using a memory model [, ]. Like memory models at the hardware level, the minimum requirement is that data-race-free programs can only have behaviors that could result from sequentially consistent executions. The definition of what constitutes a data race is however more complex, since building the model entirely around the happened-before relation is not sufficient. This is because the starting point for the definition of data races are programs, not program executions. It is necessary to explicitly define which executions are feasible for a given program. Causality Manson et al. [] define a set requirements called causality that a program execution has to meet to be feasible. As an example, consider the program in Fig. . The causality requirement serves to rule out the feasibility of executions that contain events e or e . Intuitively, causality requires that there is a noncyclic
R
R
Race Conditions
chain of “facts” that justify the execution of a statement. The writes e , e may only execute if the reads e , e returned the value . That in turn requires, that the writes executed. This kind of cyclic justification order is forbidden by the Java memory model. This program is hence data-race-free since executions that include the writes e , e are not feasible due to a violation of the causality rule. More generally, the causality requirement reflects on the overall definition of “data-race-free” as follows: Programs that have no data races in sequentially consistent executions are called data-race-free. Execution of such programs are always sequentially consistent. Alternatives to causality Causality has led to significant complexity in the description of the Java memory model and it is not evident that the rules for causality are complete. Hence alternative proposals have been made to specify feasible program executions, e.g., by Saraswat et al. [].
Effects of Data Races on Program Optimization Midkiff and Padua [] observed that program transformations that preserve the semantics of sequential programs may not be correct for parallel programs with data races. Thus, the presence of data races inhibits the optimization potential of compilers for parallel programs.
Race Conditions in Distributed Systems Race conditions can also occur in distributed systems, e.g., when multiple client computers communicate with a shared server. Conceptually, the shared server is a concurrent object that gives certain guarantees on the behavior of concurrent accesses. Typically, concurrency control at the level of the communication protocol or at the level of the server implementation will ensure that semantics follow certain intuitive and well-understood correctness conditions such as linearizability. In MPI programs [], e.g., a race condition may occur at a receive operation (MPI_Recv) where the source of the message to be received can be arbitrary (MPI_ANY_SOURCE). The order in which concurrent messages from different sources are received can be different in different program runs. The receive operation
serializes the arrival of consecutive messages and hence exercises concurrency control.
Related Entries Determinacy Memory Models Race Detection Techniques
Bibliographic Notes and Further Reading Race conditions and data races have been discussed in early literature about the analysis and debugging of parallel programs [, , , , ]. An informal characterization of the term “race” was, e.g., given by Emrath and Padua []: A race exists when two statements in concurrent threads access the same variable in a conflicting way and there is no way to guarantee execution ordering. A formal definition is given by Netzer and Miller [], which is based on the existence of program executions with certain properties as discussed by this entry. This definition however does not lend itself to the design of a practical algorithm that validates the presence or absence of a race condition in a given program execution (see the related entry on Race Detection Techniques). Hence for practical applications, slightly different definitions of data races are common. In the context of shared memory models, the interest is mainly in characterizing the absence of data races, hence the definition of data-race-free program executions and data-race-free programs in [, , ]. Most definitions for data races in the context of static analysis tools make a conservative approximation of the program’s data-flow and control-flow and thus are akin to Netzer and Miller’s concept of apparent data races. For dynamic data-race detection, the approximations made for the definition of what constitutes a data race may not be conservative any longer: While the goal of dynamic race detection tools is to detect all feasible races on the basis of a single-input-single-execution (SISE) [], there are feasible data races according to [] that may be overlooked according to these definitions.
Bibliography . Adve S, Gharachorloo K () Shared memory consistency models: a tutorial. IEEE Comput :–
Race Detection Techniques
. Adve S, Hill M (June ) A unified formalization of four sharedmemory models. IEEE Trans Parallel Distrib Syst ():– . Arvind, Maessen J-W () Memory model = instruction reordering + store atomicity. SIGARCH Comput Archit News ():– . Dinning A, Schonberg E (December ) Detecting access anomalies in programs with critical sections. In: Proceedings of the ACM/ONR workshop on parallel and distributed debugging, pp – . Emrath PA, Padua DA (January ) Automatic detection of nondeterminacy in parallel programs. In: Proceedings of the ACM workshop on parallel and distributed debugging, pp – . Flanagan C, Freund SN () Atomizer: a dynamic atomicity checker for multithreaded programs. In: POPL’: Proceedings of the st ACM SIGPLAN-SIGACT symposium on principles of programming languages, ACM, New York, pp – . Herlihy MP, Wing JM (July ) Linearizability: a correctness condition for concurrent objects. ACM Trans Program Lang Syst (TOPLAS) :– . Lamport L (July ) Time, clock and the ordering of events in a distributed system. Commun ACM ():– . Lamport L (July ) How to make a correct multiprocess program execute correctly on a multiprocessor. IEEE Trans Comput ():– . Manson J, Pugh W, Adve S () The Java memory model. In: Proceedings of the symposium on principles of programming languages (POPL’). ACM, New York, pp – . Mellor-Crummey J (May ) Compile-time support for efficient data race detection in shared-memory parallel programs. In: Proceedings of the workshop on parallel and distributed debugging. ACM, New York, pp – . Message Passing Interface Forum (June ) MPI: a message passing interface standard. http://www.mpi-forum.org/ . Michael MM, Scott ML () Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: PODC’: Proceedings of the th annual ACM symposium on principles of distributed computing. ACM, New York, pp – . Midkiff S, Padua D (August ) Issues in the optimization of parallel programs. In: Proceedings of the international conference on parallel processing, pp – . Nelson C, Boehm H-J () Sequencing and the concurrency memory model (revised). The C++ Standards Committee, Document WG/N J/- . Netzer R, Miller B () On the complexity of event ordering for shared-memory parallel program executions. In: Proceedings of the international conference on parallel processing, Pennsylvania State University, University Park. Pennsylvania State University Press, University Park, pp – . Netzer R, Miller B (July ) Improving the accuracy of data race detection. Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming PPOPP, published in ACM SIGPLAN NOTICES ():– . Netzer R, Miller B (March ) What are race conditions? Some issues and formalizations. ACM Lett Program Lang Syst (): –
R
. Nudler I, Rudolph L () Tools for the efficient development of efficient parallel programs. In: Proceedings of the st Israeli conference on computer system engineering . Saraswat VA, Jagadeesan R, Michael M, von Praun C () A theory of memory models. In: PPoPP’: Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp – . Shasha D, Snir M (April ) Efficient and correct execution of parallel programs that share memory. ACM Trans Program Lang Syst ():– . Sterling N (January ) WARLOCK: a static data race analysis tool. In: USENIX Association (ed) Proceedings of the USENIX winter conference, San Diego, pp – . Weaver DL, Germond T () The SPARC architecture manual (version )
Race Detection Techniques Christoph von Praun Georg-Simon-Ohm University of Applied Sciences, Nuremberg, Germany
Synonyms Anomaly detection
Definition Race detection is the procedure of identifying race conditions in parallel programs and program executions. Unintended race conditions are a common source of error in parallel programs and hence race detection methods play an important role in debugging such errors.
Discussion Introduction The following paragraphs summarize the notion of race conditions and data races. An elaborate discussion and detailed definitions are given in the corresponding related entry on “race conditions.” Race conditions. A widely accepted definition of race conditions is given by Netzer and Miller []: A general race, or race for short, is a pair of conflicting accesses to a shared resource for which the order of accesses is not guaranteed by the program, i.e., the accesses may execute in either order or simultaneously. A subclass of general races are data races, for which conflicting memory accesses may occur simultaneously.
R
R
Race Detection Techniques
Data races. The bulk of literature on race detection discusses methods for the detection of data races. Most data races are programming errors, typically due to omitted synchronization. Data races can potentially cause the atomicity of critical sections to fail. Violations of atomicity. Race conditions that are not data races are common in parallel program executions. Some of such races are however programming errors related to incorrect use of synchronization. A common example are accesses to a shared resource that occur in separate critical sections, although the programmer’s intent was to have the accesses in the same atomicity scope. Such programming errors are commonly called violations of atomicity []. Taken literally, this terminology seems perhaps too general, since data races, which are non implied by this term, can also cause the atomicity of a critical section to fail. Detection accuracy. An ideal race detection should have two properties: () A race condition is reported if the program or a program execution, whatever is checked, has some race condition. This property is called soundness. Notice that if there are several race conditions, soundness does not require that all of them are reported. () Every reported race condition is a genuine race condition in a feasible program execution. This property is called completeness []. Detection methods. Static methods analyze the program code and do not require that the program executes. Sound and complete detection of feasible race conditions in a parallel program by static analysis is in general undecidable due to a theoretical result by Ramalingam []. Dynamic methods. analyze the event stream from a single program execution, and thus determine the occurrence of a race condition in one program execution. Such analysis can be sound and complete. Naturally, dynamic methods cannot make general statements about the disposition of a race condition in a program. However, the accuracy of a dynamic analysis can be more or less sensitive to the timing of threads. Ideally, a dynamic race detection algorithm has the single input, single execution (SISE) property [], which means that “A single execution instance is sufficient to determine the existence of an anomaly (data race) for a given input.” In other words, the detection accuracy should be oblivious to the timing of the execution.
When designing a static or dynamic race detection method, trade-offs between performance, resource usage, and accuracy are made. In practice, it is not uncommon that useful tools are unsound, incomplete, or both with respect to the detection of feasible race conditions.
Data Race Detection There are two principal approaches to detect data races: happened-before-based (HB-based) and lockset-based methods.
HB-Based Data Race Detection HB-based data race detection operates typically at runtime, thus validating the occurrence of data races in a program execution. Data race definition. A data race ⟨a, b⟩ is deemed to have occurred in a program execution, if a and b are conflicting accesses to the same resource, and the accesses are causally unordered. The causal order among run-time events in a concurrent systems is thereby defined through the happened-before relation by Lamport []. This definition assesses the existence of a data race on one particular program execution and its HBrelation. Note that this is different from the definition of feasible and apparent data races by Netzer and Miller [], since their definition assesses the occurrence of a data race on a set of program executions. Thus, it is possible that an execution is classified as “data-race-free” according to the above (HB-based) defintion but it is not data-race-free according to Netzer and Miller’s defintion []. An example of such a program execution, which is detailed in a later paragraph, is given in Fig. . Happened-before relation. A pair of events ⟨a, b⟩ is ordered according to the happened-before relation (HB-relation), if a and b occur in the same thread and are ordered according to the sequential execution of that thread, or a and b are synchronization events in different threads that are ordered due to the synchronization order. Overall, the happened-before relation is a transitive partial order. The notion of synchronization order depends on the underlying programming model and has been originally described for point-to-point communication in
Race Detection Techniques
thread-1
thread-2
thread-3
thread-1
thread-2
x = 1
R
thread-3 lock(L) y = 3 unlock(L)
lock(L) y = 1 unlock(L)
x = 1 lock(L) y = 2 unlock(L)
lock(L) y = 1 unlock(L)
lock(L) y = 3 unlock(L) x = 3
a intra-thread order
x = 3
lock(L) y = 2 unlock(L)
b synchronization order
happened-before order
Race Detection Techniques. Fig. In execution (a), the assignments to variable x in thread- and thread- are ordered by the HB-relation. There is however a feasible data race, since these assignments may occur simultaneously as, e.g., in execution (b)
distributed systems []. The synchronization order has also been defined for synchronization operations on multiprocessor architectures [, ], for fork-join synchronization [], and critical sections []. Algorithmic principles. HB-based race detectors perform three tasks: () track the HB-relation within each thread; () keep an access history as sequence of logical timestamps for each shared resource; () validate that, for every resource, critical accesses are ordered by the HB-relation. Various HB-based race detection algorithms have been proposed; they differ in the precise methods and run-time organization of computing the three tasks. Accuracy. If the HB-relation is recorded precisely, HB-based data race detection on a specific program execution can be complete and also sound according to the HB-based definition of data races. However, HBbased data race detection is not sound with respect to data races according to Netzer and Miller’s definition []: Fig. illustrates two executions of the same program, one where the assignments to x are ordered by the HB-relation (a), and one where the assingments occur simultaneously (b). An HB-based race detector applied to execution (a) would not report the data race. In other words, HB-based race detection does not have the SISE property. Encodings and algorithms for computing the HBrelation. The HB-relation can be represented and
recorded precisely during a program execution with standard vector clocks [, ]. Race detectors that build on standard vector clocks are, e.g., the family of Djit detectors by Pozniansky and Schuster [, ]. The algorithm requires that maximum number of threads is specified in advance. A theoretical result from Charron-Bost [] implies that the precise computation of vector clocks values for n events in an execution with p processors requires O(np) time and space. Due to this result, the encoding of the HB-relation with vector clocks is considered to be impractical []. Performance optimizations. There are two main sources of overhead in HB-based race detection: First, the tracking of logical time, usually by vector clocks or variations. Second, access operations to shared memory and synchronization operations are instrumented to record access histories and verify orderings. Flanagan and Freund [] optimize vector clocks with an adaptive representation of timestamps as follows: The majority of memory accesses target threadlocal, lock-protected or shared-read data; for such “safe” access patterns a very efficient and compact timestamp representation is used. This representation is adaptively enhanced to a full vector clock when the access patterns deviate from the safe ones; then, access ordering can be validated according to the HB-relation.
R
R
Race Detection Techniques
Variants of vector clocks with alternative encodings of the HB-relation have been developed to improve the efficiency of tracking logical time and checking the ordering among pairs of accesses. Examples for such encodings are the English-Hebrew labeling (EH) by Nudler and Rudolph [], task recycling (TR) by Dinning and Schonberg [], and offset-span labeling (OS) by Mellor-Crummey []. These algorithms make trade-offs at different levels: For example, OS does, unlike TR, avoid global communication when computing the timestamps; this makes the algorithm practical for distributed environments. However, OS is limited to nested fork-join synchronization patterns. The implementation of encodings can be optimized further for efficiency, sometimes however at the cost of reducing the accuracy of the detection. Dinning and Schonberg [], e.g., limit the size of access histories per variable. Some data races may not be reported due to this optimization, which seemed to occur infrequently in their experiments. A similar optimization is described by Choi and Min []: Their algorithm records only the most recent read and write timestamp for a shared variable; as a concequence, data races that occurred in an execution may not be not reported. They argue however that this optimization preserves the soundness of the detection, i.e., if there is some data race according to the HB-definition, it is reported. An iterative debugging process based on deterministic replay will eventually identify all data races and thus lead to a program execution that is datarace-free (again according to the HB-definition). The significant contribution of Choi and Min [] is their systematic and provably correct debugging methodology for data races. Compilers can reduce the overhead of race detection by pruning the instrumentation of accesses that are provably not involved in data races, such as access to thread-local data. Moreover, accesses that lead to redundant access events [, ] at run-time may also be recognized by a compiler. Static analyses of concurrency properties support this process (see “Static Analysis”). Even if redundant accesses cannot be safely recognized by a compiler, techniques have been developed by Choi et al. [] to recognize and filter redundant access events at run-time.
Another run-time technique to reduce the overhead of dynamic data race detection is to include only small fractions of the overall program execution (sampling) in the checking process []. Perhaps surprisingly, the degradation of accuracy due to an incomplete event trace is moderate and acceptable in the light of significant performance gains. Race detection in distributed systems. Race detection in distributed systems refers to HB-based data race detection in distributed shared memory (DSM) systems. Perkovic and Keleher, e.g., [] present an optimized DSM protocol where the tracking of HBinformation is piggybacked on the coherence protocol. Another DSM race detection technique is presented by Richards and Larus [].
Lockset-Based Data Race Detection Lockset-based data race detection is a technique tailored to programs that use critical sections as their primary synchronization model. Instead of validating the absence of data races directly, the idea of lockset-based data race detection is to validate that a program or program execution adheres to a certain programming policy, called locking discipline []. Locking discipline. A simple locking policy could, e.g., state that threads that access a common memory location must hold a mutual exclusion lock when performing the access. Compliance with this locking discipline implies that executions are data-race-free. The validation of the locking discipline is done with static or dynamic program analysis, or combinations thereof. The key ideas of lockset-based data race detection are due to Savage et al. [], although the concept of lock covers, which is related to locksets is discussed earlier by Dinning and Schonberg []. Algorithmic principles. Checking a locking discipline works as follows: Each thread tracks at run-time the set of locks it currently holds. Conceptually, each variable in shared memory has a shadow location that holds a lockset. On the first access to a shared variable, the shadow memory is initialized with the lockset of the current thread. On subsequent accesses, the lockset in shadow memory is updated by intersecting it with the lockset of the accessing thread. If the intersection is
Race Detection Techniques
empty and the variable has been accessed by different threads, a potential data race is reported. Accuracy. Lockset-based detection is sound. In fact, the validation of a locking discipline has the SISE property, i.e., all race conditions that occur in a certain execution are reported. The detection is however incomplete, since accesses that violate the locking discipline, may be ordered by other means of synchronization. The principles of lockset-based race detection have been refined with two goals in mind: To increase the accuracy (reduce overreporting) and to improve the performance of dynamic checkers. Increasing the accuracy. There are several common parallel programming idioms that are data-racefree but violate the simple locking discipline stated before. Examples are initialization without lock protection, read-sharing, or controlled data handoff. To accommodate these patterns, the simple locking policy can be extended as follows: An abstract state machine is associated with each shared memory location. The state machine is designed to identify safe programing idioms in the access stream and to avoid overreporting. For example, read and write accesses of the initializing thread can proceed safely, even without locking. If other threads read the variable and only read accesses follow, a read-sharing pattern is deemed to have occurred and no violation of the locking policy is reported, even if the accessing threads do not hold a common lock at the time of the accesses []. Subsequent work has picked up these ideas and refined the state model to capture more elaborate sharing patterns that are safe [, ]. The state model can introduce unsoundness, namely, in cases where the timing of threads in an execution lets an access sequence look like a safe programming idiom, where indeed an actual data race occurred. In practice, the benefits of reduced overreporting outweigh the drawback due to potential unsoundness. Hybrid data race detection. Some methods for dynamic data race detection [, , , ] combine the lockset algorithm [] with checks of Lamport’s happened-before relation. Such a procedure mitigates the shortcomings of either approach and improves the accuracy of the detection.
R
Performance optimizations. Dynamic tools can be supported by a compiler that reduces instrumentation and thus the frequency of run-time checks. For example, accesses that are known at compile-time not to participate in data races don’t have to be instrumented [, , ]. Another strategy to reduce the run-time overhead is to track the lockset and state information in shadow memory not per fixed-size variable but at a larger granularity, e.g., at the granularity of programming language objects [] or minipages []. Application to non-blocking synchronization models. Although most work has studied programs with blocking synchronization (locks), the principles are also applicable to optimistic synchronization[, ].
Dynamic Methods So far, data race detection has been described as a dynamic program analysis. Dynamic methods are also called trace-based, since subject of the analysis is an execution trace. Architecture. There are different styles for organizing the data collection and analysis phase: ●
Post-mortem methods record a complete set of relevant run-time events in a persistent trace for offline analysis. This method was common in early race detection systems, e.g., by Allan and Padua []. Due to the possibly large size of the trace, this method is limited to short program runs. ● Online methods, also called on-the-fly detection, record events temporarily and perform the data race analysis entwined with the actual program execution. Much of the recorded information can be discarded as the analysis proceeds. Depending on the algorithm, this may but does not necessarily compromise the accuracy of the detection. Most dynamic data race tools choose this architecture, e.g., Mellor-Crummey [] for an HB-based analysis and Savage et al. [] for a method based on locksets. In distributed systems, online methods are also the preferred analysis architecture. Implementation techniques. Dynamic race detection is an aspect that is crosscutting and orthogonal to the remaining functionality of a software. Several
R
R
Race Detection Techniques
techniques have been used to incorporate race detection in a software system: ●
Program instrumentation augments memory and synchronization accesses with checking code. Instrumentation can occur at compile-time [], or at run-time []. ● Race detection as part of a software-based protocol implementation in distributed shared memory systems [, , ]. ● Hardware extensions for HB-based [, , , ] or lock-set-based [] race detection. Limitations. An inherent limitation of dynamic race detection is that it is based on information from a single program execution. This limitation is twofold. First, not all possible control paths may have been exercised on the given input. Second, possible race conditions may have been covert in the specific thread schedule of the execution trace, as e.g., in example in Fig. . This second limitation, called scheduling dependence, can be mitigated or entirely avoided by so-called predictive analyses [, , , ]. The key idea of predictive (dynamic) analyses is to consider not only the order of events recorded in a specific execution trace but also permutations thereof. This technique increases the coverage of the dynamic analysis of concurrent program executions, since it is capable to expose thread interleavings (event orders) that have not been exercised.
Static Methods Static program analysis, does not require that the program is executed. Pragmatic methods. A pragmatic approach to find programming errors is to identify situations in the source code, where the program deviates from common programming practice. Naturally, this approach lends itself to find deviations from common programming idioms in concurrent programs that may lead to unintended race conditions. This approach is called pragmatic, since the analysis is neither sound nor complete with respect to identifying data races. In practice, pragmatic analysis methods have turned out to be very effective with the additional benefit that in most cases, analyses do not require sophisticated dataflow or concurrency analysis and hence can be very efficient.
Findbugs by Ayewah et al. [] is a tool for pragmatic analyses of Java programs. Hovenmeyer and Pugh [] describe numerous analyses for concurrency related errors. One programming idiom for concurrent program in Java is, e.g., that accesses to a shared mutable variable are consistently protected by synchronized blocks. Violations of this programming practice are described as bug pattern called “inconsistent synchronization.” Inconsistent synchronization occurs, if a class contains mixed (synchronized and unsynchronized) accesses to a field variable and no more than one third of the accesses (writes weighed higher than reads) occur outside a synchronized block. RacerX [] is a static data race and deadlock detection system targeted to check large operating system codes. The tool builds a call graph and verifies a locking discipline along a calling context sensitive traversal of the code. The system does however not have a pointer analysis and approximates aliasing through variable types. Moreover, the system uses a heuristic to classify sections of code that are sequential and those that are concurrent. These approximations facilitate the analysis of very large codes (> K lines of code). As the analysis issues a significant number of spurious reports, a clever ranking scheme is used to prioritize the large numbers of reports according to their likelihood of being an actual bug. Methods based on data-flow analysis. May-happenin-parallel analysis (MHP) is the foundation of many compile-time analyses for concurrent programs. MHP analysis approximates the order of statements executed by different threads and computes the may-happen-inparallel relation among statements. MHP analysis in combination with a compile-time model of program data can serve as the foundation of compile-time data race detection. Bristow [] used an inter-process precedence graph for determining anomalies in programs with postwait synchronization. Taylor [] and Duesterwald and Soffa [] extend this work and define a model for parallel tasks in Ada programs with rendez-vous synchronization. The program representation in [] is modular and enables an efficient analysis of programs with procedures and recursion based on a data-flow framework. Masticola and Ryder [] generalize and improve the approach of []. Naumovic et al. [] compute the potential concurrency in Java programs at the
Race Detection Techniques
level of statements. The authors have shown that the precision of their data-flow algorithm is optimal for most of the small applications that have been evaluated. The approach requires that the number of real threads in the system is specified as input to the analysis. The combination of MHP information with a model of program data (heap shape and reference information) could be used to determine conflicting data accesses. This approach is discussed by Midkiff, Lee, Padua, and Sura [, ]. Static race detection for Java programs has been developed, e.g., by Choi et al. [] and Naik et al. []. Both systems are based on a whole program analysis to determine an initial set of potentially conflicting object accesses. This set is pruned (refined) by several successive analysis steps, which are alias analysis, thread escape analysis, and lockset analysis. The Chord checker by Naik et al. [] is object context-sensitive [, ], and this feature is found to be essential, in combination with a precise, inclusion-based alias analysis, to achieve a high detection accuracy; object context sensitivity incurs however a significant scalability cost. Chord sacrifices soundness since it approximates lock identity through may alias information when determining common lock protection. Type-based methods. Type systems can model and express data protection and locking policies in data and method declarations. Compliance of data access and method invocations with the declared properties can be checked mostly statically. The main advantage of the type-based approach is its modularity, which makes it, in contrast to a whole program analysis, well amenable to treat incomplete and large programs. The type systems that can prove data-race-freedom have either been proposed as extensions to existing programming languages by Bacon at al. [] or Boyapati and Rinard []. Flanagan and Freund [, ] present a type system that is able to specify and check lock-protection of individual variables. In combination with an annotation generator [], they applied the type checker to Java programs of up to KLOC. The annotation generator is able to recognize common locking patterns and further uses heuristics to classify as benign certain accesses without lock protection. The heuristics are effective in reducing the number of spurious warnings; some are however unsound, which has not been a problem for the benchmarks investigated in [].
R
Model checking. The principle of model checking is to explore every possible control flow-path and variable value assignment for undesired program behavior. Since this procedure is obviously intractable, models of data and program are explored instead. An additional source of complexity in parallel versus sequential programs is the timing of threads resulting in myriads of possible interleaving of actions from different threads. The main challenge of model checking for concurrency errors, such as data races, is hence to reduce the state space to be explored. One idea is to consider only those states and transitions of individual threads that operate on shared data, and are thus visible to other threads. Another idea is to aggregate the possible interleavings, e.g., by modeling multiple threads and their transitions in one common state transition space. Model checking has been applied to the detection of access anomalies in concurrent programs, e.g., in [, , , , , ]. Stoller’s model checker [] verifies adherence to a locking policy. Henzinger et al. [] describe a model checker that identifies conflicting accesses that are not ordered according to the HBrelation. Both works assume an underlying sequentially consistent execution platform. Model checking for executions on weakly-ordered memory systems have been conceived as well by [], though not particularly for the purpose of data race detection.
Detection of Determinacy and Atomicity Violations The emphasis of this entry is on the detection of race conditions that are data races. Related to race conditions are also violations of determinacy and violations of atomicity. Techniques for their detection are discussed in the corresponding essays on determinacy and atomicity.
Complexity Netzer and Miller [] characterize race conditions in terms of feasible programs executions. A program execution is feasible, if the control flow and shared datadependencies in the execution could actually occur at run-time in accordance with the semantics of the program. The problem of detecting a race condition is at least as hard as determining if a feasible program execution exists, where a pair of suspect statements occurs
R
R
Race Detection Techniques
concurrently. Netzer and Miller [] found that problem to be intractable even if shared data-dependencies are ignored. Helmbold and McDowell [] confirm and refine this result by restricting programs to certain control-flow and synchronization models. For unrestricted programs, if data-dependencies are not ignored, the problem of deciding if two conflicting statement participate in a race is as hard as the halting problem []. In practice, most dynamic and also static data-flowbased race detection analyses have a complexity that is quasi-linear in the number of synchronization and shared resource accesses.
Related Entries Determinacy Formal Methods–Based Tools for Race, Deadlock, and Other Errors Race Conditions
Bibliography . Abadi M, Flanagan CE, Freund SN () Types for safe locking: static race detection for Java. Trans Program Lang Syst (TOPLAS) ():– . Adve S, Hill M (June ) Weak ordering — A new definition. In: Proceedings of the annual international symposium on computer architecture (ISCA’), pp – . Adve S, Hill M, Miller B, Netzer R (May ) Detecting data races on weak memory systems. In: Proceedings of the annual international symposium on computer architecture (ISCA’), pp – . Allen TR, Padua DA (August ) Debugging fortran on a shared memory machine. In: Proceedings of the international conference on parallel processing, pp – . Ayewah N, Hovemeyer D, Morgenthaler JD, Penix J, Pugh W () Using static analysis to find bugs. IEEE Softw ():– . Bacon D, Strom R, Tarafdar A (October ) Guava: a dialect of Java without data races. In: Proceedings of the conference on object-oriented programming, systems, languages, and applications (OOPSLA’), pp – . Balasundaram V, Kennedy K () Compile-time detection of race conditions in a parallel program. In: Proceedings of the international conference on supercomputing (ISC’), pp – . Boyapati C, Lee R, Rinard M (November ) Ownership types for safe programming: preventing data races and deadlocks. In: Proceedings of the conference on object-oriented programming, systems, languages, and applications (OOPSLA’), pp – . Bristow G, Dreay C, Edwards B, Riddle W () Anomaly detection in concurrent programs. In: Proceedings of the international conference on software engineering (ICSE’), pp –
. Burckhardt S, Alur R, Martin MMK () CheckFence: checking consistency of concurrent data types on relaxed memory models. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – . Charron-Bost B () Concerning the size of logical clocks in distributed systems. Inf Process Lett ():– . Chen F, Serbanuta TF, Rosu G () jpredictor: a predictive runtime analysis tool for java. In: ICSE’: Proceedings of the th international conference on software engineering. ACM, New York, pp – . Choi J-D, Min SL () Race frontier: reproducing data races in parallel program debugging. In: PPOPP’: Proceedings of the third ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp – . Choi J-D, Lee K, Loginov A, O’Callahan R, Sarkar V, Sridharan M (June ) Efficient and precise datarace detection for multithreaded object-oriented programs. In: Conference on programming language design and implementation (PLDI’), pp – . Dinning A, Schonberg E () An empirical comparison of monitoring algorithms for access anomaly detection. In: PPOPP’: Proceedings of the second ACM SIGPLAN symposium on principles & practice of parallel programming. ACM, New York, pp – . Dinning A, Schonberg E (December ) Detecting access anomalies in programs with critical sections. In: Proceedings of the ACM/ONR workshop on parallel and distributed debugging, pp – . Duesterwald E, Soffa M () Concurrency analysis in the presence of procedures using a data-flow framework. In: Proceedings of the symposium on testing, analysis, and verification (TAV), pp – . Engler D, Ashcraft K (October ) RacerX: Effective, static detection of race conditions and deadlocks. In: Proceedings of the symposium on operating systems principles (SOSP’), pp – . Fidge CJ () Timestamp in message passing systems that preserves partial ordering. In: Proceedings of the th Australian computing conference, pp – . Flanagan C, Freund SN (June ) Type-based race detection for Java. In: Proceedings of the conference on programming language design and implementation (PLDI’), pp – . Flanagan C, Freund SN (June ) Detecting race conditions in large programs. In: Proceedings of the workshop on program analysis for software tools and engineering (PASTE’), pp – . Flanagan C, Freund SN () FastTrack: Efficient and precise dynamic race detection. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – . Flanagan C, Qadeer S (June ) A type and effect system for atomicity. In: Proceedings of the conference on programming language design and implementation (PLDI’), pp – . Flanagan C, Leino R, Lillibridge M, Nelson G, Saxe J, Stata R (June ) Extended static checking for Java. In: Proceedings of the
Race Detection Techniques
.
.
.
.
.
.
. .
.
.
.
.
.
.
conference on programming language design and implementation (PLDI’), pp – Flanagan C, Freund SN, Yi J () Velodrome: a sound and complete dynamic atomicity checker for multithreaded programs. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – Helmbold DP, McDowell CE (September ) A taxonomy of race detection algorithms. Technical Report UCSC-CRL-, University of California, Santa Cruz, Computer Research Laboratory Henzinger TA, Jhala R, Majumdar R () Race checking by context inference. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – Hovemeyer D, Pugh W (July ) Finding concurrency bugs in java. In: Proceedings of the PODC workshop on concurrency and synchronization in Java programs Jannesari A, Bao K, Pankratius V, Tichy WF () Helgrind+: An efficient dynamic race detector. In: Proceedings of the rd international parallel & distributed processing symposium (IPDPS’). IEEE, Rome Kidd N, Reps T, Dolby J, Vaziri M () Finding concurrencyrelated bugs using random isolation. In: VMCAI’: Proceedings of the th international conference on verification, model checking, and abstract interpretation. Springer-Verlag, Heidelberg, pp – Lamport L (July ) Time, clock and the ordering of events in a distributed system. Commun ACM ():– Lhoták O, Hendren L (March ) Context-sensitive points-to analysis: is it worth it? In: Mycroft A, Zeller A (eds) International conference of compiler construction (CC’), vol of LNCS. Springer, Vienna, pp – Mattern F () Virtual time and global states of distributed systems. In: Proceedings of the Parallel and distributed algorithms conference. Elsevier Science, Amsterdam, pp – Marino D, Musuvathi M, Narayanasamy S () Literace: effective sampling for lightweight data-race detection. In: PLDI’: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – Masticola S, Ryder B () Non-concurrency analysis. In: Proceedings of the symposium on principles and practice of parallel programming (PPoPP’), pp – Mellor-Crummey J (November ) On-the-y detection of data races for programs with nested fork-join parallelism. In: Proceedings of the supercomputer debugging workshop, pp – Mellor-Crummey J (May ) Compile-time support for efficient data race detection in shared-memory parallel programs. In: Proceedings of the workshop on parallel and distributed debugging, pp – Midkiff S, Lee J, Padua D (June ) A compiler for multiple memory models. In: Rec. Workshop compilers for parallel computers (CPC’)
R
. Milanova A, Rountev A, Ryder BG () Parameterized object sensitivity for points-to analysis for Java. ACM Trans Softw Eng Methodol ():– . Min SL, Choi J-D () An efficient cache-based access anomaly detection scheme. In: ASPLOS-IV: Proceedings of the th international conference on architectural support for programming languages and operating systems. ACM, New York, pp – . Musuvathi M, Qadeer S, Ball T, Basler G, Nainar PA, Neamtiu I () Finding and reproducing heisenbugs in concurrent programs. In: OSDI’: Proceedings of the th USENIX conference on operating systems design and implementation. USENIX Association, Berkeley, pp – . Muzahid A, Suárez D, Qi S, Torrellas J () Sigrace: signaturebased data race detection. In: ISCA’: Proceedings of the th annual international symposium on computer architecture. ACM, New York, pp – . Naik M, Aiken A, Whaley J (June ) Effective static race detection for Java. In: Proceedings of the conference on programming language design and implementation (PLDI’), pp – . Naumovich G, Avrunin G, Clarke L (September ) An efficient algorithm for computing MHP information for concurrent Java programs. In: Proceedings of the European software engineering conference and symposium on the foundations of software engineering, pp – . Netzer R, Miller B (August a) Detecting data races in parallel program executions. Technical report TR-, Department of Computer Science, University of Wisconsin, Madison . Netzer R, Miller B (January b) On the complexity of event ordering for shared-memory parallel program executions. Technical report TR , Computer Sciences Department, University of Wisconsin, Madison . Netzer R, Miller B (March ) What are race conditions? Some issues and formalizations. ACM Lett Program Lang Syst (): – . Netzer R, Brennan T, Damodaran-Kamal S () Debugging race conditions in message-passing programs. In: SPDT’: Proceedings of the SIGMETRICS symposium on parallel and distributed tools. ACM, New York, pp – . Nudler I, Rudolph L () Tools for the efficient development of efficient parallel programs. In: Proceedings of the st Israeli conference on computer system engineering . O’Callahan R, Choi J-D (June ) Hybrid dynamic data race detection. In: Symposium on principles and practice of parallel programming (PPoPP’), pp – . Perkovic D, Keleher PJ (October ) Online data-race detection via coherency guarantees. In: Proceedings of the nd symposium on operating systems design and implementation (OSDI’), pp – . Pozniansky E, Schuster A (June ) Efficient on-the-y data race detection in multi-threaded c ++ programs. In: Proceedings of the symposium on principles and practice of parallel programming (PPoPP’), pp – . Pozniansky E, Schuster A () Multirace: efficient on-the-y data race detection in multithreaded c + + programs: research articles. Concurrency Comput: Pract Exper ():–
R
R
Race Detectors for Cilk and Cilk++ Programs
. Prvulovic M, Torrellas J () Reenact: using thread-level speculation mechanisms to debug data races in multithreaded codes. SIGARCH Comput Archit News ():– . Rajwar R, Goodman JR () Speculative lock elision: enabling highly concurrent multithreaded execution. In MICRO : Proceedings of the th annual ACM/IEEE international symposium on microarchitecture. IEEE Computer Society, Washington, DC, pp – . Ramalingam G () Context-sensitive synchronizationsensitive analysis is undecidable. ACM Trans Program Lang Syst (TOPLAS) :– . Richards B, Larus JR () Protocol-based data-race detection. In SPDT’: Proceedings of the SIGMETRICS symposium on parallel and distributed tools. ACM, New York, pp – . Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T (October a) Eraser: a dynamic data race detector for multi-threaded programs. In: Proceedings of the symposium on operating systems principles (SOSP’), pp – . Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T (b) Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on computer systems (): – . Scheurich C, Dubois M (June ) Correct memory operation of cache-based multiprocessors. In: Proceedings of th annual symposium on computer architecture, Computer Architecture News, pp – . Sen K, Rosu G, Agha G () Runtime safety analysis of multithreaded programs. In: ESEC/FSE-: Proceedings of the th European software engineering conference held jointly with th ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, pp – . Shacham O, Sagiv M, Schuster A () Scaling model checking of dataraces using dynamic information. In: PPoPP’: Proceedings of the tenth ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp – . Stoller SD (October ) Model-checking multi-threaded distributed Java programs. Int J Softw Tools Technol Transfer (): – . Sura Z, Fang X, Wong C-L, Midkiff SP, Lee J, Padua DA (June ) Compiler techniques for high performance sequentially consistent Java programs. In: Proceedings of the symposium principles and practice of parallel programming (PPoPP’), pp – . Taylor RN (May ) A general purpose algorithm for analyzing concurrent programs. Commun ACM ():– . Visser W, Havelund K, Brat G, Park S () Model checking programs. In: ASE’: Proceedings of the th IEEE international conference on automated software engineering. IEEE Computer Society, Washington, p . von Praun C, Gross T (October ) Object race detection. In: Conference on object-oriented programming, systems, languages, and applications (OOPSLA’), pp –
. Wang L, Stoller SD () Accurate and efficient runtime detection of atomicity errors in concurrent programs. In: PPoPP’: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, New York, pp – . Welc A, Jagannathan S, Hosking AL (June ) Transactional monitors for concurrent objects. In: Proceedings of the European conference on object-oriented programming (ECOOP’), pp – . Yu Y, Rodeheffer T, Chen W (October ) RaceTrack: Efficient detection of data race conditions via adaptive tracking. In: Proceedings of the symposium on operating systems principles (SOSP’), pp – . Yu Y, Rodeheffer T, Chen W () Racetrack: efficient detection of data race conditions via adaptive tracking. In SOSP’: Proceedings of the th ACM symposium on operating systems principles. ACM, New York, pp – . Zhou P, Teodorescu R, Zhou Y () Hard: hardware-assisted lockset-based race detection. In: HPCA’: Proceedings of the IEEE th international symposium on high performance computer architecture. IEEE Computer Society, Washington, DC, pp –
Race Detectors for Cilk and Cilk++ Programs Jeremy T. Fineman , Charles E. Leiserson Carnegie Mellon University, Pittsburgh, PA, USA Massachusetts Institute of Technology, Cambridge, MA, USA
Synonyms Cilkscreen; Nondeterminator
Definition The Nondeterminator race detector takes as input an ostensibly deterministic Cilk program and an input data set and makes the following guarantee: it will either determine at least one location in the program that is subject to a determinacy race when the program is run on the data set, or else it will certify that the program always behaves the same on the data set, no matter how it is scheduled. The Cilkscreen race detector does much the same thing for Cilk++ programs. Both can also detect data races, but the guarantee is somewhat weaker.
Race Detectors for Cilk and Cilk++ Programs
Discussion
int x;
Introduction
cilk void foo() { x = x + 1; return; }
Many Cilk programs are intended to be deterministic, in that a given program produces the same behavior no matter how it is scheduled. The program may behave nondeterministically, however, if a determinacy race occurs: two logically parallel instructions update the same location, where at least one of the two instructions writes the location. In this case, different runs of the program on the same input may produce different behaviors. Race bugs are notoriously hard to detect by normal debugging techniques, such as breakpointing, because they are not easily repeatable. This article describes the Nondeterminator and Cilkscreen race detectors, which are systems for detecting races in Cilk and Cilk++ programs, respectively. Determinacy races have been given many different names in the literature. For example, they are sometimes called access anomalies [], data races [], race conditions [], harmful shared-memory accesses [], or general races []. Emrath and Padua [] call a deterministic program internally deterministic if the program execution on the given input exhibits no determinacy race and externally deterministic if the program has determinacy races but its output is deterministic because of the commutative and associative operations performed on the shared locations. The Nondeterminator program checks whether a Cilk program is internally deterministic. Cilkscreen allows for some internal nondeterminism in a Cilk++ program, specifically, nondeterminism encapsulated by “reducer hyperobjects” []. To illustrate how a determinacy race can occur, consider the simple Cilk program shown in Fig. . The parallel control flow of this program can be viewed as the directed acyclic graph, or dag, illustrated in Fig. . The vertices of the dag represent parallel control constructs, and the edges represent strands: serial sequences of instructions with no intervening parallel control constructs. In Fig. , the strands of the program are labeled to correspond to code fragments from Fig. , and the subdags representing the two instances of foo() are shaded. In this program, both of the parallel instantiations of the procedure foo() update the shared variable x in the x = x + 1 statement. This statement actually causes the processor executing the strand to
R
/* F */
cilk int main() { x = 0; spawn foo();
/* /* /* /* /*
spawn foo(); sync; printf("x is %d\n", x); return 0;
e0 F1 e1 F2 e2
*/ */ */ */ */
/* e3 */
}
Race Detectors for Cilk and Cilk++ Programs. Fig. A simple Cilk program that contains a determinacy race. In the comments at the right, the Cilk strands that make up the procedure main() are labeled
e0
e1
F main()
e2
e3
F1
F2
foo()
foo()
spawn node
sync node
Race Detectors for Cilk and Cilk++ Programs. Fig. The parallel control-flow dag of the program in Fig. . A spawn node of the dag represents a spawn construct, and a sync node represents a sync construct. The edges of the dag are labeled to correspond with code fragments from Fig.
perform a read from x, increment the value, and then write the value back into x. Since these operations are not atomic, both might update x at the same time. Figure shows how this determinacy race can cause x to take on different values if the strands comprising the two instantiations of foo() are scheduled simultaneously.
R
R
Race Detectors for Cilk and Cilk++ Programs
The Nondeterminator The Nondeterminator [] determinacy-race detector takes as input a Cilk program and an input data set and either determines at least one location in the program that is subject to a determinacy race when the program is run on the data set, or else it certifies that the program always behaves the same when run on the data set. If a determinacy race exists, the Nondeterminator localizes the bug, providing variable name, file name, line number, and dynamic context (state of runtime stack, heap, etc.). The Nondeterminator is not a program verifier, because the Nondeterminator cannot certify that the program is race-free for all input data sets. Rather, it is a debugging tool. The Nondeterminator only checks a program on a particular input data set. What it verifies is that every possible scheduling of the program execution produces the same behavior. If the program relies on any random choices or runtime calls to hardware counters, etc., these values should also be viewed as part of the input data set. The Nondeterminator is a serial program that operates on-the-fly, meaning that it detects races as it simulates the execution of the program, rather than by logging and subsequent analysis. As it executes, it maintains various data structures for determining the existence of determinacy races. An “access history” maintains a subset of strands that access each particular memory location. An “SP-maintenance” data structure maintains the series-parallel (SP) relationships among strands. Specifically, the race detector must determine whether two strands (that access the same memory location) operate logically in parallel (i.e., whether it is possible to schedule both strands at the same time), or whether there is some serial relationship between the strands.
F1 read x = 0 write x = 1
Case 1 F2
e3
The Nondeterminator was implemented by modifying the ordinary Cilk compiler and runtime system. Each read and write in the user’s program is instrumented by the Nondeterminator’s compiler to perform determinacy-race checking at runtime. The Nondeterminator then takes advantage of the fact that any Cilk program can be executed as a C program. Specifically, if the Cilk keywords are deleted from a Cilk program, a C program results, called the Cilk program’s serialization (also called serial elision in the literature), whose semantics are a legal implementation of the semantics of the Cilk program []. The Nondeterminator executes the user’s program as the serialization would execute, but it performs race-checking actions when reads, writes, and parallel control statements occur. Since the Nondeterminator was implemented by modifying the Cilk compiler, it is unable to detect races that occur in precompiled library code. In contrast, the Cilk++ race detector, called Cilkscreen, uses binary instrumentation [, , ]. Cilkscreen dynamically replaces memory accesses in a Cilk++ program binary with instrumented memory accesses, and thus it can operate on executable binaries for which no source code is available. Cilkscreen employs much the same algorithmic technology as the Nondeterminator, however.
Detecting Races in Programs that Use Locks The original Nondeterminator [] is designed to detect determinacy races in a Cilk program. Typically a programmer may add locks to the program to protect against the externally nondeterministic interleaving given in Fig. , thus intending and sanctioning nondeterministic executions. Thus, it may not make sense to test the program for determinacy races, but some weaker form of race detection may be desired.
F1 read x = 0
Case 2 F2
e3
read x = 0 read x = 1 write x = 2
write x = 1 “x is 2”
write x = 1
“x is 1”
Race Detectors for Cilk and Cilk++ Programs. Fig. An illustration of a determinacy race in the code from Fig. . The value of the shared variable x read and printed by strands e can differ depending on how the instructions in the two instances F and F of the foo() procedure are scheduled
R
Race Detectors for Cilk and Cilk++ Programs
The Nondeterminator- [] detects data races, which occur when two logically parallel strands holding no locks in common update the same location, where at least one of the two instructions modifies the location. Although the Nondeterminator and Nondeterminator- test for different types of races, a significant chunk of the implementations and algorithms used (notably, the SP-maintenance algorithm) remains the same in both debugging tools. This article focuses on determinacy-race detection.
F
S S
S e
e
e
e
S F
e
e
S
P
P e
F
S
S
P S
F
P S
P e
F
e
P e
F F
Series-Parallel Parse Trees The structure of a Cilk program can be interpreted as a series-parallel (SP) parse tree rather than a series-parallel dag. Figure shows a parse tree for the dag in Fig. . In the SP parse tree, each internal node is either an Snode, denoted by S, or a P-node, denoted by P, and each leaf is a strand of the dag. If two subtrees are children of the same S-node, then the strands in the left subtree logically precede the strands in the right subtree, and (the subcomputation represented by) the left subtree must execute before (that of) the right subtree. If two subtrees are children of the same P-node, then the strands in the left subtree operate logically in parallel with those in the right subtree, and no ordering holds between (the subcomputations represented by) the two subtrees. For two strands e and e′ , the notation e ∥ e′ means that e and e′ operate logically in parallel, and the notation e ≺ e′ means that e logically precedes e′ . A canonical SP parse tree for a Cilk dag, shown in Fig. , can be constructed as follows: First, build a parse tree recursively for each child of the root procedure. Each sync block of the root procedure contains a series of spawns with strands of intervening C code (some of which may be empty). Then, create a parse tree for the sync block alternately applying series and parallel composition to the child parse trees and the root strands. Finally, string the parse trees for the sync blocks together into a spine for the procedure by applying a sequence of series compositions to the sync blocks. Sync blocks are composed serially, because a sync statement is never passed until all previously spawned subprocedures have completed. The only ambiguities that might arise in the parse tree occur because of the associativity of series composition and the commutativity of parallel composition. If, as shown in Fig. , the alternating S-nodes and P-nodes in a sync block always place
Race Detectors for Cilk and Cilk++ Programs. Fig. The canonical series-parallel parse tree for a generic Cilk procedure. The notation F represents the SP parse tree of any subprocedure spawned by this procedure, and e represents any strand of the procedure. All nodes in the shaded areas belong to the procedure, and the nodes in each oval belong to the same sync block. A sequence of S-nodes forms the spine of the SP parse tree, composing all sync blocks in series. Each sync block contains an alternating sequence of S-nodes and P-nodes. Observe that the left child of an S-node in a sync block is always a strand, and that the left child of a P-node is always a subprocedure
strands and subprocedures on the left, and the series compositions of the sync blocks are applied in order from last to first, then the parse tree is unique. Figure shows such a canonical parse tree for the Cilk dag in Fig. . One convenient feature of the SP parse tree is that the logical relationship between strands can be determined by looking at the least common ancestor (lca) of the strands in the parse tree. For example, consider the parse tree in Fig. . The least common ancestor of strands e and e , denoted by lca(e , e ), is the root S-node, and hence e ≺ e . In contrast, lca(F , F ) is a P-node, and hence F ∥ F . To see that the dag and the parse tree represent the same control structure requires only observing that lca(ei , ej ) is an S-node in the parse tree if and only if there is a directed path from ei to ej in the corresponding dag. A proof of this fact can be found in [].
R
R
Race Detectors for Cilk and Cilk++ Programs
Fψ:
S
write a shared location by strand e: if reader[] e or writer[] e then a determinacy race exists writer[] ← e
e3
S e0
P
read a shared memory location by strand e: if writer[] e then a determinacy race exists if reader[] ≺ e then reader[] ← e
S F1
e1
P e2
F2
Race Detectors for Cilk and Cilk++ Programs. Fig. The canonical series-parallel parse tree for the Cilk dag in Fig.
An execution of a Cilk program can be interpreted as a walk or traversal of the corresponding SP parse tree. The order in which nodes are traversed depends on the scheduler. A partial execution must obey series-parallel relationships, namely, that the tree walk cannot enter the right subtree of an S-node until the left subtree has been fully executed. Both subtrees of a P-node, however, can be traversed in arbitrary order or in parallel. The canonical parse tree is such that an ordinary, leftto-right, depth-first tree walk visits strands in the same order as the program’s serialization visits them.
Access History The Nondeterminator and Cilkscreen race detectors execute a Cilk program in serial, depth-first order while maintaining two data structures. The SP-maintenance data structure, described later in this article, is used to query the logical relationships among strands. The access history maintains information as to which strands have accessed which memory locations. At a high level, an access history maintains for each shared-memory location two sets of strands that have read from and written to the location. As the Nondeterminator executes, strands are added to and removed from the access history. Whenever an access of memory location v occurs, each of the strands in v’s access history are compared against the currently executing strand. If any of these strands operates logically in parallel with the currently executing strand, and one of the accesses is a write, then a race is reported. This query of logical relationships is determined by the SP-maintenance data
Race Detectors for Cilk and Cilk++ Programs. Fig. Pseudocode describing the implementation of read and write operations for a determinacy-race detector. The access history is updated on these operations
structure described in section “Series-Parallel Maintenance.” The main goal when designing an access history is to reduce the number of strands stored; the larger the access history the more series-parallel queries need to be performed. For a serial determinacy race detector, an access history of size O() per memory location suffices. In particular, the access history associates with each memory location ℓ two values reader[ℓ] and writer[ℓ], each storing a single strand that has previously read from or written to ℓ, respectively. Specifically, writer[ℓ] stores a unique runtime ID of the strand that has most recently written to ℓ. Similarly, reader[ℓ] stores the ID of some previous reader of ℓ, although it need not be the most recent reader. The Nondeterminator updates the access history as new memory accesses occur. Each read and write operation of the original program is instrumented to update the access history and discover determinacy races. Figure gives pseudocode describing the instrumentation of read and write operations. A race occurs if a strand e writes a location ℓ and discovers that either the previous reader or the previous writer of ℓ operates logically in parallel with e. Similarly, a race occurs whenever e reads a location ℓ and discovers that the previous writer operates logically in parallel with e. Whenever a location ℓ is written, the access history writer[ℓ] is updated to be the current strand e. The read history reader[ℓ] is updated to e when a read occurs, but only if the previous reader operates logically in series with e.
Race Detectors for Cilk and Cilk++ Programs
The cost of maintaining the access history is O() plus the cost of O() series-parallel queries, for each memory access. A correctness proof of the access history can be found in []. The proof involves showing that an update (or lack of an update) to reader[ℓ] does not discard any important information. Specifically, consider three strands e , e , and e that occur in that order in the serial execution, where e is the current value of reader[ℓ], e is the currently executing strand reading ℓ, and e is some future strand. If e ≺ e and e ∥ e , then e ∥ e . Thus, a test for a conflict between e and e produces the same result as a test for a conflict between e and e . On the other hand, if e ∥ e and e ∥ e , then e ∥ e , and so keeping e gives at least as much information as e . Since updates to reader[ℓ] do not lose information, the access history properly detects a race whenever the current writer operates logically in parallel with any previous reader. The full proof considers the other two race cases separately (the current writer operates logically in parallel with a previous writer, or the current reader operates logically in parallel with a previous writer), showing that either a race is reported on the access or an earlier writer-writer race was reported.
Data-Race Detection In contrast to the Nondeterminator, which detects determinacy races, the Nondeterminator- [] and Cilkscreen can also detect data races. The difference between the Nondeterminator and these other two debugging programs lies in how they deal with the access history. The Nondeterminator- and Cilkscreen tools report data races for locking protocols in which every lock is acquired and released within a single strand and cannot be held across parallel control constructs. Strands may hold multiple locks at one time, however. These tools do not detect deadlocks, but only whether during a serial left-to-right depth-first execution, two logically parallel strands holding no locks in common access a shared variable, where at least one of the strands modifies the location. The lock set of an access (read or write) is the set of locks held by the strand performing the access when the access occurs. If the lock sets of two parallel accesses to the same location have an empty intersection, and at least one of the accesses is a write, then a data race exists. To simplify the race-detection algorithm, a small trick avoids the extra condition that “at least one of the
R
access() in strand e with lock set H: for each e, H ∈ lockers[] do if e e and H ∩ H = ∅ then a data race exists redundant ← false for each e , H ∈ lockers[] do if e ≺ e and H ⊇ H then lockers[] ← lockers[] − {e , H } if e e and H ⊆ H then redundant ← true if redundant = false then lockers[] ← lockers[] ∪ {e, H}
Race Detectors for Cilk and Cilk++ Programs. Fig. Pseudocode describing the implementation of read and write operations for a data-race detector. The access history is updated on these operations using the lock-set algorithm
accesses is a write.” The idea is to introduce a fake lock for read accesses called the r-lock, which is implicitly acquired immediately before a read and released immediately afterward. The r-lock behaves from the race detector’s point of view just like a normal lock, but during an actual computation, it is never actually acquired and released (since it does not actually exist). The use of r-lock allows the condition for a data race to be stated more succinctly: If the lock sets of two parallel accesses to the same location have an empty intersection, then a data race exists. By this condition, a data race (correctly) does not exist for two read accesses, since their lock sets both contain the r-lock. The access history for the algorithm for detecting data races records a set of readers and writers for each memory location and their lock sets. For a given location ℓ, the entry lockers[ℓ] stores a list of lockers: strands that access ℓ, each paired with the lock set that was held during the access. If ⟨e, H⟩ ∈ lockers[ℓ], then location ℓ is accessed by strand e while it holds the lock set H. Figure gives pseudocode for the data-race-detection algorithm employed by the Nondeterminator- and Cilkscreen. This algorithm for data-race detection does not provide as strong a guarantee as the algorithm for determinacy-race detection. Programming with locks introduces intentional nondeterminism which may not be exposed during a serial left-to-right depth-first execution. Thus, although the algorithm always finds a data race if one is exposed, some data races may not
R
R
Race Detectors for Cilk and Cilk++ Programs
be exposed. Nevertheless, for abelian programs, where critical sections protected by the same lock “commute” – produce the same effect on memory no matter in which order they are executed – a guarantee can be provided. For a computation generated by a deadlock-free abelian program running on a given input, the algorithm guarantees to find a data race if one exists, and otherwise guarantee that all executions produce the same final result. A proof of the correctness of this assertion can be found in []. The cost of querying the access history for datarace detection is much more than the O() cost for determinacy-race detection. There are at most nk entries for each memory location, where n is the number of locks in the program and k ≤ n is the number of locks held simultaneously. Thus, the running time of the data-race detector may increase proportionally to nk in the worst case. In practice, however, few locks are held simultaneously, and the running time tends to be only slightly worse than for determinacy-race detection.
Algorithm
Space per node
Time per Strand Query creation English-Hebrew [32] Θ(f) Θ(1) Θ(f) Offset-Span [24] Θ(d) Θ(1) Θ(d) SP-bags [16] Θ(1) Θ(α(v,v ))* Θ(α(v,v ))* Modified SP-bags [17] Θ(1) Θ(1)* Θ(1) SP-order [4] Θ(1) Θ(1) Θ(1) f = number of forks/spawns in the program d = maximum depth of nested parallelism v = number of shared locations being monitored
Race Detectors for Cilk and Cilk++ Programs. Fig. Comparison of serial, SP-maintenance algorithms. An asterisk (*) indicates an amortized bound. The function α is Tarjan’s functional inverse of Ackermann’s function
is desirable, and hence SP-bags and SP-order are more appealing algorithms for serial race detectors. The situation is exacerbated in a data-race detector where the number of queries can be larger than T.
Series-Parallel Maintenance The main structural component of the two Nondeterminator programs and Cilkscreen is their algorithm for maintaining series-parallel (SP) relationships among strands. These race detectors execute the program in a left-to-right depth-first order while maintaining the logical relationships between strands. As new strands are encountered, the SP-maintenance data structure is updated. Whenever the currently executing strand accesses a memory location, the Nondeterminator queries the SP-maintenance data structure to determine whether the current strand operates in series or in parallel with strands (those in the access history) that made earlier accesses to the same memory location. Figure compares the serial space and running times of several SP-maintenance algorithms. The SPbags and SP-order algorithms, described in this section, are employed by various implementations of the Nondeterminator. The English-Hebrew [] and OffsetSpan [] algorithms are used in different race detectors and displayed for comparison. The modified SP-bags entry reflects a different underlying data structure than the SP-bags entry, but the algorithm for both is the same. As the number of queries performed by a determinacy race detector can be as large as O(T) for a program that runs serially in T time, keeping the query time small
SP-Bags Since the logical relationship between strands can be determined by looking at their least common ancestor in the SP parse tree, a straightforward approach for SP-maintenance would maintain this tree explicitly. Querying the relationship between strands can then be implemented naively by climbing the tree starting from each strand until converging at their least common ancestor. This approach yields expensive queries, as the worst-case cost is proportional to the height of the tree, which is not bounded. SP-bags [] is indeed based on finding the least common ancestor, but it uses a clever algorithm that yields far better queries. The algorithm itself is an adaptation of Tarjan’s offline least-common-ancestors algorithm. The SP-bags algorithm maintains a collection of disjoint sets, employing a “disjoint sets” or “union-find” data structure that supports the following operations: . Make-Set(x) creates a new set whose only member is x. . Union(x, y) unites the sets containing x and y. . Find(x) returns a representative for the set containing x.
Race Detectors for Cilk and Cilk++ Programs
A classical disjoint-set data structure with “union by rank” and “path compression” heuristics [, , ] supports m operations on n elements in O(mα(m, n)) time, where α is Tarjan’s functional inverse of Ackermann’s function [, ]. As the program executes, for each active procedure, the SP-bags algorithm maintains two bags (unordered sets) with the following contents at any given time: ●
The S-bag SF of a procedure F contains the descendant procedures of F that logically precede the currently executing strand. (The descendant procedures of F include F itself.) ● The P-bag PF of a procedure F contains the completed descendant procedures of F that operate logically in parallel with the currently executing strand. The S- and P-bags are represented using a disjoint sets data structure as described above. The SP-bags algorithm is given in Fig. . As the Cilk program executes in a serial, depth-first fashion, the SP-bags algorithm performs additional operations whenever one of the three following actions occurs: spawn, sync, return, resulting in updates to the Sand P-bags. Whenever a new procedure F (entering the left subtree of a P-node) is spawned, new bags are created. The bag SF is initially set to contain F, and PF is set to be empty. Whenever a procedure F′ returns to its parent procedure F (completing the walk of the left subtree of a P-node), the contents of SF′ are unioned
spawn a procedure F : SF ← Make-Set(F ) PF ← ∅ return from a procedure F to parent F : PF ← Union(PF , SF ) SF ← ∅ sync in a procedure F : SF ← Union(SF , PF ) PF ← ∅
Race Detectors for Cilk and Cilk++ Programs. Fig. The SP-bags algorithm. Whenever one of three actions occurs during the serial, depth-first execution of a Cilk parse tree, the operations in the figure are performed. These operations cause SP-bags to manipulate the disjoint sets data structure
R
into PF , since the descendants of F ′ can execute in parallel with the remainder of the sync block in F. When a sync occurs (traversing a spine node in the parse tree) in F, the bag PF is emptied into SF , since all of F’s executed descendants precede any future strands in F. As a concrete example, consider the program given in Fig. . The execution order of strands and procedures is e , F , e , F , e , e . Figure shows the state of the S- and P-bags during each step of the execution. As the S- and P-bags are only specified for active procedures, the number of bags changes during the execution. To determine whether a previously executed procedure (or strand) F is logically in series or in parallel with the currently executing strand, simply check whether F belongs to an S-bag by looking at the identity of Find-Set(F). Figure illustrates the correctness of SP-bags for the program Fig. . For example, when executing F , all previously executed instructions in F (namely, e and e ) serially precede F , and F does indeed belong to an S-bag SF . Moreover, the procedure F operates logically in parallel with F , and F belongs to the P-bag PF . For a proof of correctness, which is omitted here, refer to []. Combining SP-bags with the access history yields an efficient determinacy race detector. Consider a Cilk program that executes in time T on one processor and references v shared memory locations. The SP-bags algorithm can be implemented to check this program for determinacy races in O(Tα(v, v)) time. A proof of this theorem, which is provided in [], is conceptually straightforward. The number of MakeSet, Union, and Find-Set operations is at most O(T), yielding a total running time of O(Tα(m, n)) for some value of m and n. It turns out that these values can be reduced to v when garbage collection is employed. The theoretical running time of SP-bags can be improved by replacing the underlying disjoint sets data structure. Since the Unions are structured nicely, Gabow and Tarjan’s data structure [], which features O() amortized time per operation, can be employed. The SP-bags algorithm can be implemented to check a program for determinacy races in O(T) time, where the program executes in time T on one processor. A full discussion of this improvement appears in [].
R
R
Race Detectors for Cilk and Cilk++ Programs
Execution step e0 F1 e1 F2 e2 e3
SF SF SF SF SF SF
= {F } = {F } = {F } = {F } = {F } = {F, F1, F2}
State of the Bags PF = ∅ PF = ∅ SF1 = {F1} PF = {F1} SF2 = {F2} PF = {F1} PF = {F1, F2} PF = ∅
PF1 = ∅ PF2 = ∅
Race Detectors for Cilk and Cilk++ Programs. Fig. The state of the SP-bags algorithm when run on the parse tree from Fig.
SP-Order SP-order uses two total orders to determine whether strands are logically parallel, an English order and a Hebrew order. In the English order, the nodes in the left subtree of a P-node precede those in the right subtree of the P-node. In the Hebrew order, the order is reversed: the nodes in the right subtree of a P-node precede those in the left. In both orders, the nodes in the left subtree of an S-node precede those in the right subtree of the S-node. Figure shows English and Hebrew orderings for the strands in the parse tree from Fig. . Notice that if x belongs in the left subtree of an S-node and y belongs to the right subtree of the same S-node, then E[x] < E[y] and H[x] < H[y]. In contrast, if x belongs to the left subtree of a P-node y belongs to the right subtree of the same P-node, then E[x] < E[y] and H[x] > H[y]. The English and Hebrew orderings capture the SP relationships in the parse tree. Specifically, if one strand x precedes another strand y in both orders, then x ≺ y, since lca(x, y) is an S-node, and hence there is a directed path from x to y in the dag. If x precedes y in one order but x follows y in the other, then x ∥ y, since lca(x, y) is a P-node, and hence there is no directed path from one to the other in the dag. For example, in Fig. , the fact e precedes e can be determined by observing E[e ] < E[e ] and H[e ] < H[e ]. Similarly, it follows that F and F are logically parallel since E[F ] < E[F ] and H[F ] > H[F ]. The following lemma, proved in [], shows that this property always holds. Lemma Let E be an English ordering of strands of an SP-parse tree, and let H be a Hebrew ordering. Then, for any two strands x and y in the parse tree, x ≺ y if and only if E[x] < E[y] and H[x] < H[y]. Equivalently, x ∥ y if and only if E[x] < E[y] and H[x] > H[y], or if E[x] > E[y] and H[x] < H[y].
F:
S
e3 (6, 6)
S e0 (1, 1)
P S
F1 (2, 5)
e1 (3, 2)
P e2 (5, 3)
F2 (4, 4)
Race Detectors for Cilk and Cilk++ Programs. Fig. An English ordering E and a Hebrew ordering H for strands in the parse tree of Fig. . Under each strand/procedure e is an ordered pair (E[e], H[e]) giving its rank in each of the two orders. Equivalently, the English order is e , F , e , F , e , e and the Hebrew order is e , e , e , F , F , e
Labeling a static SP parse tree with an EnglishHebrew ordering is straightforward. To compute the English ordering, perform a depth-first traversal visiting left children of both P-nodes and S-nodes before visiting right children (an English walk). Assign label i to the ith strand visited. To compute the Hebrew ordering, perform a depth-first traversal visiting right children of P-nodes before visiting left children but left children of S-nodes before visiting right children (a Hebrew walk). Assign labels to strands as before, in the order visited. In race-detection applications, however, these orderings must be generated on-the-fly before the entire parse tree is known. If the parse tree unfolds according to an English walk, as in the Nondeterminator, then computing the English ordering is trivial. Unfortunately, computing the Hebrew ordering during an English walk is problematic. In a Hebrew ordering, the
Race Detectors for Cilk and Cilk++ Programs
label of a strand in the left subtree of a P-node depends on the number of strands in the right subtree. This number is unknown, while performing an English walk, until the right subtree has unfolded completely. Nudler and Rudolph [], who introduced EnglishHebrew labeling for race detection, addressed this problem by using large static strand labels. In particular, the number of bits in a label in their scheme can grow linearly in the number of P-nodes in the SP parse tree. Although they gave a heuristic for reducing the size of labels, manipulating large labels is the performance bottleneck of their algorithm. The solution in SP-order is to employ ordermaintenance data structures [, , , ] to maintain the English and Hebrew orders dynamically rather than using the static labels described above. An ordermaintenance data structure is an abstract data type that supports the following operations: ●
OM-Precedes(L, x, y): Return true if x precedes y in the ordering L. Both x and y must already exist in the ordering L. ● OM-Insert(L, x, y , y , . . . , yk ): In the ordering L, insert new elements y , y , . . . , yk , in that order, immediately after the existing element x. Any previously existing elements that followed x now follow yk . The OM-Precedes operation can be supported in O() worst-case time. The OM-Insert operation can be supported in O() worst-case time for each node inserted. In the efficient order-maintenance data structures, labels are assigned to each element of the data structure, and the relative ordering of two elements (OM-precedes) is determined by comparing these labels. Keeping the labels small guarantees good query cost. The labels change dynamically as insertions occur. The SP-order data structure consists of two ordermaintenance data structures to maintain English and Hebrew orderings. (In fact, the English ordering can be maintained implicitly during a left-to-right tree walk. For conceptual simplicity, however, both orderings are represented with the data structure here.) With the data structure chosen, the implementation of SP-order is remarkably simple. When first traversing an S-node S with left child x and right child y, perform OM-Insert(Eng, S, x, y) to insert x then y after S in the English ordering Eng. Also perform
R
OM-Insert(Heb, S, x, y) to insert these children in the same order after S in the Hebrew ordering Heb. When first traversing a P-node P with left child x and right child y, also perform OM-Insert(Eng, P, x, y) as before, but perform OM-Insert(Heb, P, y, x) to insert these children in the opposite order in the Hebrew ordering. To determine whether a previously executed strand e′ is logically in series or in parallel with the currently executing strand e, simply check whether OM-Precedes (Heb, e′ , e). (It is always the case that OM-Precedes (Eng, e′ , e), as the program executes in the order of an English walk.) If OM-Precedes(Heb, e′ , e), then e′ ≺ e. If not, then e′ ∥ e. While the preceding description of SP-order with respect to the parse tree fully describes the algorithm, it is somewhat opaque with respect to the effect on Cilk code and the Cilk compiler. Figure , which is analogous to Fig. for the SP-bags algorithm, describes the SP-order algorithm with respect to Cilk keywords. The main point of this figure is to show that, like SP-bags, incorporating SP-order into a Cilk program is fairly lightweight. In contrast to SP-bags, whose data structural elements are procedures, SP-order builds data structures over strands. As such, new objects or IDs are created whenever new strands are encountered. On a
spawn a child procedure F of F : create strand IDs for: e0 : the first strand in F ec : the continuation strand in F OM-Insert(Eng, current[F ], e0 , eC ) OM-Insert(Heb, current[F ], ec , e0 ) return from a procedure F to parent F : nothing following a spawn of procedure F or a sync in F : create strand ID for: eS : the strand following the next sync in F OM-Insert(Eng, current[F ], eS ) OM-Insert(Heb, current[F ], eS )
Race Detectors for Cilk and Cilk++ Programs. Fig. The SP-order algorithm. Whenever a spawn, sync, or return occurs during the serial, depth-first execution of the Cilk parse tree, the operations in the figure are performed. These operations cause SP-order to manipulate the order maintenance data structures. The value current[F] denotes the currently executing strand in procedure F
R
R
Race Detectors for Cilk and Cilk++ Programs
spawn, strands are created for both the spawned procedure and the continuation strand within the parent procedure. These strands are then inserted into Eng and Heb in opposite orders, as they operate logically in parallel. Whenever beginning to execute a sync block in procedure F, either after spawn’ing F or sync’ing within F, a new strand eS is created to represent the start of the next sync block. This eS is then inserted after the currently executing strand in both Eng and Heb, as it operates logically after the currently executing strand. Unlike SP-bags, no data structural changes occur in SP-order when return’ing from a procedure. One interesting feature of SP-order is that elements (strands) are only added to the data structure. No elements are ever reordered once added. Unreferenced elements may be removed as part of garbage collection. The (incomplete) English and Hebrew orderings at any time are thus consistent with the a posteriori orderings, with not-yet-encountered strands removed. The correctness of SP-order, the proof of which is given in [], is arguably easier to convince oneself of than the proof of SP-bags.
Making the Nondeterminator Run in Parallel The two Nondeterminators and Cilkscreen are serial race detectors. Even though these are debugging tools intended for parallel programs, they themselves run serially. This section discusses how to make these race detectors run in parallel. Both the access history and SP-maintenance algorithms must be augmented to permit a parallel algorithm. Moreover, contention due to concurrent updates to these data structures needs to be addressed.
A Parallel Access History The access history described in section “Access History” is appealing because it requires only a single reader and writer for a determinacy-race detector, thus keeping the cost of maintaining and querying against the access history low. This strategy, however, is inherently serial. If only a single reader and writer are recorded, a parallel race detector may not discover certain races in the program []. Fortunately, Mellor-Crummey [] shows that recording two readers and writers suffices to guarantee correctness in a parallel execution, which increases the number of SP-maintenance queries by
only a factor of . A full description of such a parallel access history and its correctness can be found in [, ]. The other challenge in parallelizing the access history is in performing concurrent updates to the data structure. Specifically, two parallel readers may attempt to update the same reader[ℓ] at the same time. (If the goal is to just report a single race in the program, updating writer[ℓ] in parallel is not as much of an issue, as discovering such a parallel update indicates a race on its own. If the goal is to report one race for each memory location, as in the Nondeterminator, then concurrent updates to writer[ℓ] also matter here.) Some extra machinery is required to guarantee the correct value is recorded when concurrent updates occur. As discussed by Fineman [], concurrent updates by P processors can be resolved in O(t lg P) worst-case time, where t is the serial cost of an access-history update: t = O() for a determinacy-race detector, and t = O(nk ) for a data-race detector.
Parallel SP-Maintenance The SP-bags algorithm is inherently serial, as the correctness relies on an execution order corresponding to an English walk of the parse tree. Correctness of SP-order, on the other hand, does not rely on any particular execution order. SP-order thus appears like a natural choice for a parallel algorithm. The main complication arises in performing concurrent updates to the underlying order-maintenance data structures of SPorder. The obvious approach for parallelizing SP-order directly requires locking the order-maintenance data structures whenever performing an update. As with the access history, a lock may cause all the other P − processors to waste time waiting for the lock, thus causing a Θ(P) overhead per lock acquisition. For programs with short strands, a Θ(P) overhead per strand creation results in no improvement as compared to a serial execution. The SP-ordered-bags algorithm, previously called SP-hybrid [, ], uses a two-tiered algorithm with a global tier and a local tier to overcome the scalability problems with lock synchronization. The global tier uses a parallel SP-order algorithm with locks, and the local tier uses the serial SP-bags algorithm without locks. By bounding the number of updates to the global tier, the locking overhead is reduced.
Race Detectors for Cilk and Cilk++ Programs
SP-ordered-bags leverages the fact that the Cilk scheduler executes a program mostly in an English order. The only exception occurs on steals when the thief processor causes a violation in the English ordering. The thief then continues executing its local subtree serially in an English order. Thus, any serial SPmaintenance algorithm, like SP-bags, is viable within each of these local serially executing subtrees. (SP-bags is also desirable as it interacts well with the global tier; details are omitted here.) Moreover, each local SP-bags data structure is only updated by the single processor executing that subtree, and thus no locking is required. The local tier permits SP queries within a single local subcomputation. The global tier is responsible for arranging the local subcomputations to permit SP-queries among these subcomputations. SP-ordered-bags employs the parallel SP-order algorithm to arrange the subcomputations. Although the global tier is locked whenever an update occurs, updates occur only when the Cilk scheduler causes a steal. Since the Cilk scheduler causes at most O(PT∞ ) steals, where T∞ is the span of the program being executed, updates to the global tier add only O(PT∞ ) to the running time of the program. Combining the local and global tiers together into SP-ordered-bags yields a parallel SP-maintenance algorithm that runs in O(T /P + PT∞ ) time on P processors, where T and T∞ are the work and span of the underlying program being tested. A detailed discussion of SP-ordered-bags, including correctness and performance proofs, can be found in [, ].
Related Entries Cilk Race Conditions Race Detection Techniques
Bibliographic Notes and Further Reading Static race detectors [, , , , , ] analyze the text of the program to determine whether a race occurs. Static analysis tools may be able to determine whether a memory location can be involved in a race for any input. These tools are inherently conservative, however, sometimes reporting races that do not exist, since the static debuggers cannot fully understand the control-flow and synchronization semantics of a program. For example,
R
dynamic control-flow constructs (e.g., a fork statement) inside if statements or loops are particularly difficult to deal with. Mellor-Crummey [] proposes using static tools as a way of pruning the number of memory locations being monitored by a dynamic race detector. Dynamic race detectors execute the program given a particular input. Some dynamic race detectors perform a post-mortem analysis based on programexecution logs [, , , , –], analyzing a log of program-execution events after the program has finished running. On-the-fly race detectors, like the Nondeterminators and Cilkscreen, report races during the execution of the program. Both dynamic approaches are similar and use some form of SP-maintenance algorithm in conjunction with an access history. On-the-fly race detectors benefit from garbage collection, thereby reducing the total space used by the tool. Post-mortem tools, on the other hand, must keep exhaustive logs. Netzer and Miller [] provide a common terminology to unify previous work on dynamic race detection. A feasible data race is a race that can actually occur in an execution of the program. Netzer and Miller show that locating feasible data races in a general program is NP-hard. Instead, most race detectors, including the ones in this article, deal with the problem of discovering apparent data races, which is an approximation of the races that may actually occur. These race detectors typically ignore data dependencies that may make some apparent races infeasible, instead considering only explicit coordination or control-flow constructs (like forks and joins). As a result, these race detectors are conservative and report races that may not actually occur. Dinning and Schonberg’s “lock-covers” algorithm [] detects apparent races in programs that use locks. Cheng et al. [] generalize this algorithm and improve its running time with their All-Sets algorithm, which is the one used in the Nondeterminator- and Cilkscreen. Savage et al. [] give an on-the-fly race detector called Eraser that does not use an SP-maintenance algorithm, and hence would report races between strands that operate in series if applied to Cilk. Their Eraser tool works on programs that have static threads (i.e., no nested parallelism) and enforces a simple locking discipline. A shared variable must be protected by a particular lock on every access, or they report a race. The Brelly
R
R
Race Detectors for Cilk and Cilk++ Programs
algorithm [] employs SP maintenance while enforcing an “umbrella” locking discipline, which generalizes Eraser’s locking discipline. Brelly was incorporated into the Nondeterminator-, but although it theoretically runs faster than the All-Sets algorithm in the worst case, most users preferred the All-Sets algorithm for the completeness of coverage, and the typical running time of All-Sets is not onerous. Nudler and Rudolph [] introduced the EnglishHebrew labeling scheme for their SP-maintenance algorithm. Each strand is assigned two static labels, similar to the labeling described for SP-order. They do not, however, use a centralized data structure to reassign labels. Instead, label sizes grow proportionally to the maximum concurrency of the program. Mellor-Crummey [] proposed an “offsetspan labeling” scheme, which has label lengths proportional to the maximum nesting depth of forks. Although it uses shorter label lengths than the EnglishHebrew scheme, the size of offset-span labels is not bounded by a constant as it is SP-order. Both of these approaches perform local decisions on strand creation to assign static labels. Although these approaches result in no locking or synchronization overhead for SPmaintenance, and are inherently parallel algorithms, the large labels can drastically increase the work of the race detector. Dinning and Schonberg’s “task recycling” algorithm [] uses a centralized data structure to maintain series-parallel relationships. Each strand (block) is given a unique task identifier, which consists of a task and a version number. A task can be reassigned (recycled) to another strand during the program execution, which reduces the total amount of space used by the algorithm. Each strand is assigned a parent vector that contains the largest version number, for each task, of its ancestor strands. To query the relationship between an active strand e and a strand e recorded in the access history, task recycling simply compares the version number of e ’s task against the version number stored in the appropriate slot in e ’s parent vector, which is a constant-time operation. The cost of creating a new strand, however, can be proportional to the maximum logical concurrency. Dinning and Schonberg’s algorithm also handles other coordination between strands, like barriers, where two parallel threads must reach a particular point before continuing.
Bibliography . Appelbe WF, McDowell CE () Anomaly reporting: a tool for debugging and developing parallel numerical algorithms. In: Proceedings of the st international conference on supercomputing systems. IEEE, pp – . Balasundaram V, Kennedy K () Compile-time detection of race conditions in a parallel program. In: Proceedings of the rd international conference on supercomputing, ACM Press, New York, pp – . Bender MA, Cole R, Demaine ED, Farach-Colton M, Zito J () Two simplified algorithms for maintaining order in a list. In: Proceedings of the European syposium on algorithms, pp – . Bender MA, Fineman JT, Gilbert S, Leiserson CE () On-thefly maintenance of series-parallel relationships in fork-join multithreaded programs. In: Proceedings of the sixteenth annual ACM symposium on parallel algorithms and architectures, Barcelona, Spain, June , pp – . Bruening D () Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology . Callahan D, Sublok J () Static analysis of low-level synchronization. In: Proceedings of the ACM SIGPLAN and SIGOPS workshop on parallel and distributed debugging, ACM Press, New York, pp – . Cheng G-I, Feng M, Leiserson CE, Randall KH, Stark AF () Detecting data races in Cilk programs that use locks. In: Proceedings of the ACM symposium on parallel algorithms and architectures, June , pp – . Choi J-D, Miller BP, Netzer RHB () Techniques for debugging parallel programs with flowback analysis. ACM Trans Program Lang Syst ():– . Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction to algorithms, rd edn. MIT Press, Cambridge . Dietz PF () Maintaining order in a linked list. In: Proceedings of the ACM symposium on the theory of computing, May , pp – . Dietz PF, Sleator DD () Two algorithms for maintaining order in a list. In: Proceedings of the ACM symposium on the theory of computing, May , pp – . Dinning A, Schonberg E () An empirical comparison of monitoring algorithms for access anomaly detection. In: Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming, pp – . Dinning A, Schonberg E () Detecting access anomalies in programs with critical sections. In: Proceedings of the ACM/ONR workshop on parallel and distributed debugging, May , ACM Press, pp – . Emrath PA, Ghosh S, Padua DA () Event synchronization analysis for debugging parallel programs. In: Proceedings of the ACM/IEEE conference on supercomputing, November , pp – . Emrath PA, Padua DA () Automatic detection of nondeterminacy in parallel programs. In: Proceedings of the workshop
Rapid Elliptic Solvers
.
.
.
.
.
.
.
.
.
.
.
.
.
on parallel and distributed debugging, Madison, Wisconsin, May , pp – Feng M, Leiserson CE () Efficient detection of determinacy races in Cilk programs. In: Proceedings of the ACM symposium on parallel algorithms and architectures, June , pp – Fineman JT () Provably good race detection that runs in parallel. Master’s thesis, Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science, August Frigo M, Halpern P, Leiserson CE, Lewin-Berlin S () Reducers and other cilk++ hyperobjects. In: Proceedings of the twentyfirst annual symposium on parallelism in algorithms and architectures, pp – Frigo M, Leiserson CE, Randall KH () The implementation of the Cilk- multithreaded language. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, pp – Gabow HN, Tarjan RE () A linear-time algorithm for a special case of disjoint set union. J Comput System Sci (): – Helmbold DP, McDowell CE, Wang J-Z () Analyzing traces with anonymous synchronization. In: Proceedings of the international conference on parallel processing, August . pp II–II Steele GL Jr () Making asynchronous parallelism safe for the world. In: Proceedings of the seventeenth annual ACM symposium on principles of programming languages, ACM Press, pp – Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K () Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, ACM Press, New York, pp – Mellor-Crummey J () On-the-fly detection of data races for programs with nested fork-join parallelism. In: Proceedings of supercomputing, pp – Mellor-Crummey J () Compile-time support for efficient data race detection in shared-memory parallel programs. In: Proceedings of the ACM/ONR workshop on parallel and distributed debugging, San Diego, California, May . ACM Press, pp – Miller BP, Choi J-D () A mechanism for efficient debugging of parallel programs. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, Atlanta, Georgia, June , pp – Nethercote N, Seward J () Valgrind: A framework for heavyweight dynamic binary instrumentation. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementaion, ACM, San Diego, June , pp – Netzer RHB, Ghosh S () Efficient race condition detection for shared-memory programs with post/wait synchronization. In: Proceedings of the international conference on parallel processing, St. Charles, Illinois, August
R
. Netzer RHB, Miller BP () On the complexity of event ordering for shared-memory parallel program executions. In: Proceedings of the international conference on parallel processing, August . pp II:– . Netzer RHB, Miller BP () Improving the accuracy of data race detection. In: Proceedings of the third ACM SIGPLAN symposium on principles and practice of parallel programming, New York, NY, USA. ACM Press, pp – . Netzer RHB, Miller BP () What are race conditions? ACM Lett Program Lang Syst ():– . Nudler I, Rudolph L () Tools for the efficient development of efficient parallel programs. In: Proceedings of the first Israeli conference on computer systems engineering, May . Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T () Eraser: a dynamic race detector for multi-threaded programs. In: Proceedings of the sixteenth ACM symposium on operating systems principles (SOSP), ACM Press, New York, pp – . Tarjan RE () Efficiency of a good but not linear set union algorithm. J ACM ():– . Tarjan RE () Data structures and network algorithms. Society for Industrial and Applied Mathematics, Philadelphia . Taylor RN () A general-purpose algorithm for analyzing concurrent programs. Commun ACM ():– . Tsakalidis AK () Maintaining order in a generalized linked list. Acta Inform ():–
Race Hazard Race Conditions
Radix Sort Sorting
Rapid Elliptic Solvers Efstratios Gallopoulos University of Patras, Patras, Greece
Synonyms Fast poisson solvers
Definition Direct numerical methods for the solution of linear systems obtained from the discretization of certain partial
R
R
Rapid Elliptic Solvers
differential equations, typically elliptic and separable, defined on rectangular domains in d dimensions, with sequential computational complexity O(N log N) or less, where N is the number of unknowns.
Discussion Mathematical models in many areas of science and engineering are often described, in part, by elliptic partial differential equations (PDEs), so their fast and reliable numerical solution becomes an essential task. For some frequently occurring elliptic PDEs, it is possible to develop solution methods, termed Rapid Elliptic Solvers (RES), for the linear systems obtained from their discretization that exploit the problem characteristics to achieve (almost linear) complexity O(N log N) or less (all logarithms are base ) for systems of N unknowns. RES are direct methods, in the sense that in the absence of roundoff they give an exact solution and require about the same storage as iterative methods. When well implemented, RES can solve the problems they are designed for faster than other direct or iterative methods [, ]. The downside is their limited applicability; since RES impose restrictions on the PDE, the domain of definition and their performance and ease of implementation may also depend on the boundary conditions and the size of the problem. Major historical milestones were the paper [], by Hyman, where Fourier analysis and marching were proposed to solve Poisson’s equation as a precursor of methods that were analyzed more than a decade later; and the paper by Bickley and McNamee [, Sect. ], where the inverse of the special block tridiagonal matrices occurring when solving Poisson’s equation and a first version of the Matrix Decomposition algorithm were described. Reference [] by Hockney with its description of the FACR() algorithm and the solution of tridiagonal systems with cyclic reduction (in collaboration with Golub) is widely considered to mark the beginning of the modern era of RES. That and the paper by Buzbee, Golub, and Nielson [], with its detailed analysis of the major RES were extremely influential in the development of the field. The evolution to methods of low sequential complexity was enabled by two key developments: the advent of the FFT and fast methods for the manipulation of Toeplitz matrices. Interest in the design and implementation of parallel algorithms for RES started in the early s, with the
papers of Buzbee [] and Sameh et al. [] and has been attracting the attention of computational scientists ever since. These parallel RES can solve the linear systems under consideration in O(log N) parallel operations on O(N) processors instead of the fastest but impractical algorithm for general linear systems that requires O(log N) parallel operations on O(N ) processors. Parallel implementations were discussed as early as for the Illiac IV (see Illiac IV) [] and subsequently for most important high-performance computing platforms, including vector processors, vector multiprocessors, shared memory symmetric multiprocessors, distributed memory multiprocessors, SIMD and MIMD processor arrays, clusters of heterogeneous processors, and in Grid environments. References for specific systems can be found for the Cray- [, ]; the ICL DAP []; Alliant FX/ [, ]; Caltech and Intel hypercubes [, , , ]; Thinking Machines CM- []; Denelcor HEP []; the University of Illinois Cedar machine [, ]; Cray X-MP []; Cray Y-MP []; Cray TE [, ]; Grid environments []; Intel multicore processors and processor clusters []; GPUs []. Regarding the latter, it is worth noting the extensive studies conducted at Yale on an early GPU, the FPS- []. Proposals for special-purpose hardware are in []. RES were included in the PELLPACK parallel problem-solving environment for elliptic PDEs []. More information can be found in the annotated bibliography provided in []. As will become clear, RES are an important class of structured matrix computations whose design and parallelization depends on special mathematical transformations not usually applicable when dealing with direct matrix computations. This is a topic with many applications and the subject of intensive investigation in computational mathematics.
Problem Formulation RES are primarily applicable for solving numerically separable elliptic PDEs whose discretization (cf. [, , ]) leads to block Toeplitz tridiagonal linear systems of order N = mn, AU = F, where A = tridn [W, T, W], and W, T ∈ Rm×m ,
()
where W, T are symmetric, simultaneously diagonalizable and thus WT = TW. Symbol trid n [C, A, B] denotes
Rapid Elliptic Solvers
block Toeplitz tridiagonal or Toeplitz tridiagonal matrices, where C, A, B are the (matrix or scalar) elements along the subdiagonals, diagonals, and superdiagonals, respectively, and n is their number. A simple but very common problem is Poisson’s equation on a rectangular region. After suitable discretization, e.g., using an m × n rectangular grid, () becomes ⎛ T −I ⎞⎛ U ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ −I T −I ⎟⎜ U ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ ⎟⎜ ⋮ ⋱ ⋱ ⋱ ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ ⎟⎜ ⋮ −I T −I ⎜ ⎟⎜ ⎜ ⎟⎜ ⎜ ⎟⎜ −I T ⎠ ⎝ Un ⎝
⎞ ⎛ F ⎞ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ F ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ = ⎜ ⋮ ⎟, ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⋮⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎠ ⎝ Fn ⎠
()
where T = trid m [−, , −] is Toeplitz symmetric and tridiagonal. Vectors Ui = [Ui, , . . . , Ui,m ]⊺ ∈ Rm for i = , . . . , n contain the unknowns at the ith grid row, and Fi = [Fi, , . . . , Fi,m ]⊺ the values of the right-hand side F including the boundary terms and scalings related to the discretization. This is the problem for which RES are eminently applicable and mostly discussed in the literature of RES, hence the term “fast Poisson solvers.” The homogeneous case, f ≡ , is Laplace’s equation, that lends itself to even more specialized fast methods (not discussed here). RES can also be designed to solve the general separable problem − (a(x)uxx + b(x)ux ) − (d(y)uyy + e(y)uy ) + (c(x) + ˜c(y))u = f (x, y)
()
whose discrete form is ⎛T + α I −β I ⎞ ⎛ U ⎞ ⎛ F ⎞ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ −γ I T + α I −β I ⎟ ⎜ U ⎟ ⎜ F ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⋮⎟ = ⎜ ⋮⎟ , () ⋱ ⋱ ⋱ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⋮⎟ ⎜ ⋮⎟ I T + α I −β I −γ n− n− n− ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ −γn I T + α n− I ⎠ ⎝Un ⎠ ⎝Fn ⎠
for scalars α j , β j , γj . This is no longer block Toeplitz, but the diagonal and off diagonal blocks have special structure. Once the equations become nonseparable and/or the domain irregular, RES are not directly applicable and other methods are preferred. All is not lost, however, because frequently RES could be used as building
R
blocks of more general solvers, e.g., in domain decomposition and preconditioning. Sequential RES are much faster than direct solvers such as sparse Cholesky elimination (see Sparse Direct Methods). This advantage carries over to parallel RES. Other methods that can be used but are addressed elsewhere in this volume are multigrid methods (see Algebraic Multigrid) and preconditioned conjugate gradients (see Iterative Methods) that are the basis for software PDE packages such as PETSc.
Mathematical Preliminaries and Notation The following concepts from matrix analysis are essential to describe and analyze RES. . Kronecker products of matrices and their properties. . Chebyshev polynomials (first and second kind) Cd , Sd , and modified Chebyshev polynomials (second kind) C˜d . . The analytic expression for the eigenvalues and eigenvectors of the symmetric tridiagonal Toeplitz matrix T = trid m [−, α, −]: Specifically √ πj λj = α − cos( ) , qj = m+ m+ πj πjm ⊺ [sin( ) , . . . , sin( )] . m+ m+
()
If Q = [q , . . . , qm ] then √ the product Qy for any πij m m y ∈ R has elements m+ ∑j= sin( m+ ) for i = , . . . , m and so is the discrete sine transform (DST) of y times a scaling factor. Also Q⊺ = Q and Q⊺ Q = I. The DST can be computed in O(m log m) operations using FFT-type methods. Similar results for the eigenstructure also hold for slightly modified matrices that occur when the boundary conditions for the continuous problem are not strictly Dirichlet. . The fact (see [, , ]) that for any nonsingular T ∈ Rm×m , the matrix A = tridn [−I, T, −I] is nonsingular and A− can be written as a block matrix with general term ⎧ ⎪ − ⎪ ⎪ ⎪Sn (T)Si− (T)Sn−j (T), j ≥ i, − (A )ij = ⎨ () ⎪ − ⎪ ⎪ S (T)S (T)S (T), i ≥ j. j− n−i ⎪ ⎩ n . The vec operator and its inverse, unvec n . Also the vec-permutation matrix Π m,n Rmn×mn , that is the
R
R
Rapid Elliptic Solvers
unique matrix such that vec(A) = Π m,n vec(A⊺ ), where A ∈ Rm×n . The matrix is orthogonal and Πm,n (In ⊗ T˜ m ) = (T˜ m ⊗ In )Πm,n .
Chapter (see FFTW) documents very efficient algorithms for the single and multiple FFTs needed for kernels of type (). Parallel algorithms for the fast transforms required in the context of the RES have O(log m) complexity on m processors.
Algorithmic Infrastructure Most RES make use of the finite difference analogue of a key idea in the separation of variables method for PDEs, namely, that under certain conditions, some multidimensional problems can be solved by solving several one-dimensional problems []. In line with this, it turns out that RES for () make extensive use of and their performance greatly depends on two basic kernels: () Tridiagonal linear system solvers and () fast Fourier and trigonometric transforms. These kernels are also used extensively in collective form, that is for given tridiagonal T ∈ Rm×m and any Y ∈ Rm×s , solve TX = Y for s ≥ . In many cases, shifted or multiply shifted tridiagonal systems with one or multiple righthand sides must be solved. Also if the m × m matrix Q represents the discrete sine, cosine or Fourier transform of length m, then RES call for the fast computation of QY as well as the inverse transforms Q− Y. The best known parallel techniques for solving tridiagonal systems are recursive doubling, cyclic reduction, and parallel cyclic reduction, that all need O(log m) parallel arithmetic operations on O(m) processors, e.g., [, , ]. For more flexibility, it is possible to combine those or Gaussian elimination with divide-and-conquer methods. Fourier-based RES also require fast implementations of the discrete cosine transform and the discrete Fourier transform to handle Neumann and periodic boundary conditions. Many publications on RES (e.g., [, , , ]) include extensive discussion of these building blocks. It is common for the tridiagonal matrices to have additional properties that can be used for faster processing. They are usually symmetric positive definite, Toeplitz, and have the form tridm [−, τ, −], where τ ≥ . When τ > , they are diagonally dominant in which case it is possible to terminate early Gaussian elimination (as implemented by O’Donnell et al. in []) and cyclic reduction (as proposed by Hockney in []), at negligible error. ScaLAPACK includes efficient tridiagonal and narrow-banded solvers, but the Toeplitz structure is not exploited (see ScaLAPACK). A detailed description of the fast Fourier and other discrete transforms used in RES can be found in [].
Matrix Decomposition MD refers to a large class of methods to solve systems with matrices such as (). As Buzbee noted in []: “It seldom happens that the application of L processors would yield an L-fold increase in efficiency relative to a single processor, but that is the case with the MD algorithm.” Sameh et al. in [] provided the first detailed study of parallel MD for Poisson’s equation. MD can be succinctly described, as shown early on by Lynch et al. and Egerváry, using the compact Kronecker product representation of A (see [] and extensive discussion and references in []). Of greatest relevance here is the case that A = (In ⊗ T˜ m + T˜ n ⊗ Im ), where any of T˜ m , T˜ n is diagonalizable using matrices expressed by Fourier or trigonometric transforms. Denoting by Qx ∈ Rm×m (resp. Qy ∈ Rn×n ) the matrix of eigenvectors of T˜ m (resp. T˜ n ), then AU = F can be rewritten as (In ⊗ Q⊺x ) (In ⊗ T˜ m + T˜ n ⊗ Im )(In ⊗ Qx ) (In ⊗ Q⊺x ) U = (In ⊗ Q⊺x ) F and equivalently as ˜ m + T˜ n ⊗ Im )(In ⊗ Q⊺x ) U = (In ⊗ Q⊺x ) F. (In ⊗ Λ Applying the similarity transformation with the vecpermutation matrix Πm,n , the system becomes ˜ m ⊗ In + Im ⊗ T˜ n )(Πm,n (In ⊗ Q⊺x ) U) = Πm,n (In ⊗ Q⊺x ) F. (Λ B
Matrix B is block diagonal, so these are m independent tridiagonal systems of order n each. Thus U = (In ⊗ Qx )Π⊺m,n B − Π m,n (In ⊗ Q⊺x ) F.
()
When either of T˜ m of T˜ n is diagonalizable with trigonometric or Fourier transforms, as in the matrix of () for the Poisson equation, where they both are of the form trid[−, , −], and matrix () when the coefficients are constant in one direction, MD will also be called Fourier MD and can be applied at total cost O(N log N). For example, if Qx can be applied in O(m log m) operations, the overall cost becomes O(nm log m) for Fourier MD. From () emerge the three major computational
Rapid Elliptic Solvers
phases of Fourier MD for (). In phase (I), the term (In ⊗ Q⊺x ) F is computed, which amounts to the DST of n vectors Fi , i = , . . . , n to compute the Fˆ i = Q⊺x Fi . In phase (II), m independent tridiagonal sys(m) tems with coefficient matrices Bj = tridn [−, λ j , −] (m)
are solved, where λ j is the jth eigenvalue of T in ( ). The right-hand sides are the m columns of [Fˆ , . . . , Fˆ n ]⊺ . In terms of the vec- constructs, these are the m contiguous, length-n subvectors of the mn length vector Π m,n vec[Fˆ , . . . , Fˆ n ]. The respective soluˆ j , j = , . . . , m are stacked and reordered into tions U ˆ , . . . , U ˆ m ] so that phase (III) consists ¯ U = Π⊺m,n vec[U of independent application of a scaled, length m DST ¯ , . . . , U ¯ Opportu¯ n ] = unvec n U. of each column of [U nities for parallelism abound, and if mn processors are available and each tridiagonal system is solved using a parallel algorithm of logarithmic complexity, then MD can be accomplished in O(log mn) parallel operations. There are several ways of organizing the computation. When there are n = m processors, phases (I) and (III) can be performed independently in m log m steps. This will leave the data in the middle phase distributed across the processors so that the multiple tridiagonal systems will be solved with a parallel algorithm that would require significant data movement. Instead, one can transpose the output data from phase (I) and also at the end of phase (II) so that all data is readily accessible from any processor that needs it. This approach has the advantage that it builds on mature numerical software for uniprocessors. It requires, however, the use of efficient algorithms for matrix transposition. MD algorithms have been designed and implemented on vector and parallel architectures and are frequently used as the baseline in evaluating new Poisson solvers; cf. [, –, , , , , , , , ] and [] for more details. More applications of MD can be found in the survey [] by Bialecki et al.
Complete Fourier Transform The idea of CFT was discussed by Hyman in [] and described in the context of the tensor product methods of Lynch et al. (cf. [, ]). For low cost, this requires that both T˜ m and T˜ n are diagonalizable with Fourier or trigonometric transforms, as is the case for (). Then the solution of () is ˜m +Λ ˜ n ⊗ Im )− (Q⊺y ⊗ Q⊺x ) F, U = (Qy ⊗ Qx )(In ⊗ Λ
R
˜ m +Λ ˜ n ⊗Im ) is diagonal. Multiplicawhere matrix (In ⊗ Λ ⊺ ⊺ tion by Qy ⊗ Qx amounts to performing n independent DSTs of length m and m independent DSTs of length n; similarly for Qy ⊗ Qx . The middle phase is an element˜ m +Λ ˜ n ⊗Im by-element division by the diagonal of In ⊗ Λ that contains the eigenvalues of A. The parallel computational cost is O(log m + log n) operations on O(mn) processors. Hockney and Jesshope [], Swarztrauber and Sweet [], among others, studied CFT for various parallel systems. In the latter, CFT was found to have lower computational and communication complexity than MD and BCR for systems with O(mn) processors connected in a hypercube. Specific parallel implementations were described by O’Donnel et al. in [] for the FPS- attached array processor where CFT had lower performance than MD and FACR(), by Cote in [] for Intel hypercubes; see also [].
Block Cyclic Reduction BCR is a solver for () that generalizes the (scalar) cyclic reduction of Hockney and Golub. It is also more general than Fourier MD because it does not require knowledge of the eigenstructure of T neither uses fast transforms. A detailed analysis is found in []. BCR was key to the design of FISHPAK, the influential numerical library by Swarztrauber, Sweet, and Adams. It is outlined next for n = k − blocks, though, as shown by Sweet, the method can be modified to handle any n. For steps r = , . . . , k − , adjacent blocks of equations are combined in groups of three to eliminate two blocks of unknowns; in the first step, for instance, unknowns from even-numbered blocks are eliminated and a reduced system. A() = tridk− − [−I, A − I, −I], containing approximately half the blocks remains. Setting T () = T, and T (r) = (T (r−) ) − I, at the rth step the reduced system is tridk−r − [−I, T (r) , −I]U(r) = F(r) , where F(r) = vec[Fr . , . . . , Fr .(k−r −) ]. These are computed by Fr .j = Fr− (j−) + Fr− (j+) + T (r−) Fr− j .
()
Matrix T (r) can be written in terms of Chebyshev ˜ r (T). The polynomials, T (r) = Cr ( T ) = C (r)
roots are known analytically, ρ i
= cos( (i−) π), r+
therefore the product form of T (r) is also available.
R
R
Rapid Elliptic Solvers (r)
After r = k − steps, only the system T (k) Uk− = Fk− remains. After this is solved, a back substitution phase recovers the remaining unknowns. Each reduction step requires k−r − independent matrix-vector multiplications with T (r−) . All multiplications with T (r−) are done using its product form so that the total cost is O(nm log n) operations without recourse to fast transforms. Even with unlimited parallelism, the systems composing T (r) must be solved in sequence, which creates a parallelization bottleneck. Also the algorithm as stated (sometimes called CORF []) is not numerically viable because the growing discrepancy in the magnitude of the terms in () leads to excessive roundoff. Both problems can be overcome. Stabilization is achieved by means of a clever modification due to Buneman and analyzed in detail in [], in which the unstable recurrence () is replaced with a coupled recurrence for a pair of new vectors, that can be computed stably, by exchanging tridiagonal matrix-vector multiplications with tridiagonal linear system solves. The number of sequential operations remains O(mn log n). The parallel performance bottleneck can be resolved using partial fractions, a versatile tool for parallel processing first used, ingeniously, by H.T. Kung in [] to enable the evaluation of arithmetic expressions such as xn , ∏ni= (x + ρ i ) and Chebyshev polynomials in few parallel divisions and O(log n) parallel additions. For scalars, these findings are primarily of theoretical interest since the expressions can also be computed in O(log n) multiplications. The idea becomes much more attractive for matrices (the problem of computing powers of a matrix is mentioned in [] but not pursued any further) as is the case of BCR in the CORF or the stable variants of Buneman. Specifically, the parallelization bottleneck that occurs when solving
T
(r)
Xr = Yr , where T
(r)
r
(r)
FACR
= ∏(T − ρ i I) and i=
Xr , Y r ∈ R
m×( k−r −)
for large r is resolved because, for each r, the roots (r) ρ i , i = , . . . , r are distinct and so for any right-hand side f ∈ Rm , r
(r)
(r)
−
(T (r) )− f = ∑ αi (T − ρ i I) f , i=
(r)
where the α i are the corresponding partial fraction coefficients that can be easily computed from the val˜ r (z) at z = ρ (r) . This was ues of the derivative of C i described by Sweet (see []) and also by Gallopoulos and Saad in []. Moreover, when BCR is applied for more general values of n or other boundary conditions that lead to more general matrix rational functions, partial fractions eliminate the need for operations with the numerator. Expression () can be evaluated independently for each right-hand side. Therefore, solving with coefficient matrix T (r) and k−r − right-hand sides for r = , . . . , k − can be accomplished by solving r independent tridiagonal systems for each right-hand side and then combining the partial solutions by multiplying k−r − matrices, each of size m × r with the vector of r partial fraction coefficients. The algorithm has parallel complexity O(log n log m). Even though this is somewhat inferior to parallel MD and CFT, BCR has wider applicability; cf. the survey of Bini and Meini in [] for many additional uses and parallelization. References [, , , ] discuss the application of partial fraction based parallel BCR and evaluate its performance on vector processors and multiprocessors, multicluster vector multiprocessors, and hypercubes. As shown by Calvetti et al. in [], the use of partial fractions above is numerically safe for the polynomials that occur when solving Poisson’s equation, but caution is required in the general case. The above techniques for parallelizing MD and BCR were implemented in CRAYFISHPAK, a package that contains most of the functionality of FISHPAK (results for the Cray Y-MP listed in []). Beyond BCR, partial fractions can also be used to parallelize the computation important matrix functions, e.g., the matrix exponential.
()
This method, proposed in [], is based on the fact that at any step of the reduction phase of BCR, the coefficient matrix is A(r) = tridk−r − [−I, T (r) , −I], and that because T (r) = Cr ( T ) has the same eigenvectors λ
(m)
as T and eigenvalues Cr ( j ), the reduced system can be solved using Fourier MD. FACR(l) (acronym for Fourier Analysis Cyclic Reduction) is a hybrid algorithm consisting of the following phases: () Perform l steps of BCR to obtain A(l) and F (l) . () Use MD to
Rapid Elliptic Solvers
solve A(l) U (l) = F (l) . () Use l steps of back substitution to obtain the remaining subvectors of U. Research of Hockney, Swarztrauber, and Temperton has shown that a value l ≈ log log m reduces the number of sequential operations down to O(mnl). Therefore, properly designed FACR is faster than MD and BCR. In practice, the best choice for l depends on the relative performance of the underlying kernels and other characteristics of the computer platform. The selection for vector and parallel architectures has been studied at length by Hockney (cf. [] and references therein) and Jesshope and by Briggs and Turnbull []. A general conclusion is that l is typically very small (so that BCR can be applied in its simplest CORF form), especially for high levels of parallelism, and that it is worth determining it empirically. The parallel implementation of all steps can proceed using the techniques deployed for BCR and MD; in fact, FACR(l) can be viewed as an alternative to partial fractions to avoid the bottleneck to parallelization that was observed after a few steps of reduction in BCR. For example, one can monitor the number of systems that can be solved in parallel in BCR, and before the number of independent computations is small and no longer acceptable, switch to MD. This approach was analyzed by Briggs and Turnbull in [] on a sharedmemory multiprocessor. Corroborating the complexity estimates above, FACR is frequently found to be faster than MD and CFT; see, e.g., [] for results on an attached array processor.
Marching and Other Methods Marching methods are RES that can be used to solve Poisson’s equation as well as general separable PDEs. The idea was already mentioned in [] and by , as recounted by Hockney [], a marching algorithm by Lorenz was cautiously cited as the fastest performing RES for Poisson’s equation on uniprocessors, provided that it is “used sensibly.” The caution is due to numerical instability that can have deleterious effects on accuracy. Bank and Rose in [] proposed Generalized Marching (GM) methods that are stable with operation count O(n log nk ) when m = n, where k is determined by the sought accuracy, instead of the theoretical O(n ) for simple but unstable marching. The key idea in marching methods is that if one knows Un in (), then the remaining Un− , . . . , U could
R
be computed from this and the boundary values Un+ by the recurrence Uj− = Fj + Uj+ − TUj in O(mn) operations. Vector Un can also be computed in O(mn) operations in the course of block LU of a reordering of () by solving a linear system with a matrix rational function of Chebyshev polynomials like Sn . The process is not numerically viable for large n, thus in GM the domain is partitioned into k strips that are small enough so that sufficient accuracy is maintained when marching is used to compute the solution in each. This is a form of domain decomposition, whose merits for parallel processing are discussed in the next section. The interface values of U necessary to start the independent marches can be computed independently extending the process described above. Additional parallelism is available from using partial fractions. When the right-hand-side F of ( ) is sparse and one is interested in only few values of U, then the cost of computing such a solution with MD is reduced. Based on an idea of Banegas, Vassilevski and Kuznetsov designed variants of cyclic reduction that generate systems where a partial solution method is deployed to build the final solution. Parallel algorithms using these ideas were implemented by Petrova (on a workstation cluster) [] and by Rossi and Toivanen (algorithm PSCR on a Cray TE) who also showed a radix-q version of cyclic reduction that eliminates a factor of q ≥ blocks at a time. As with other RES (e.g., []), to solve the general separable system (), these methods require additional preprocessing to compute eigenvalues and eigenvectors that are not available analytically. The special form of the inverse blocks of ( ) in ( ) shows that both the matrix and its inverse are “data sparse,” a property that is used to design effective approximate algorithms to solve elliptic equations using an integral equation formulation; cf. [, ]. The explicit formula for the inverse and partial fractions can be combined to solve () for any rows of the grid in O(log N) parallel operations, as proposed in []. Marching can also be parallelized by diagonalizing T. Description of this and other methods are found in the monograph [] by Vajteršic.
Domain Decomposition Domain decomposition (DD) is a methodology for solving elliptic PDEs in which the solution is constructed
R
R
Rapid Elliptic Solvers
by partitioning and solving the problem on subdomains (when these do not overlap, DD is called substructuring) and synthesizing the solution by combining the partial solutions, usually iteratively, until a satisfactory solution is obtained. Whenever the equations and subdomains are suitable, RES can be used as building blocks for DD methods; the algorithm outlined by Buzbee et al. in [, Sect. ] is a case in point and confirms that RES were an early motivation behind the development of DD. In some cases, one can dispense of the iterative process altogether and compute the solution to the PDE at sequential cost O(N log N), thus extending RES to irregular domains. DD naturally lends itself for parallel processing (see Domain Decomposition), introducing another layer of parallelism in the RES described so far and thus providing greater flexibility in their mapping on parallel architectures, e.g., when processors are organized in loosely coupled clusters, for better load balancing on heterogeneous processors, etc. One such method based on Fourier MD was proposed in [] and can be viewed as a special case of the Spike banded solver method developed by Sameh and collaborators (cf. []) (Spike). A slightly different formulation of this DD solver is based on reordering the equations in () combined with block LU. This was investigated by Chan and Resasco and then at length with hypercube implementations (cf. [] and references therein) and by Chan and Fatoohi for vector multiprocessors in []. The interface values in domain decomposition can also be computed using the partial solution methodology, e.g., in combination with GM as suggested by Vassilevski and studied by Bencheva in the context of parallel processing (cf. []). DD was also used by Barberou to increase the performance of the partial solution method of Rossi and Toivanen on a computational Grid of high-performance systems [].
Extensions The above methods can be extended to solve Poisson’s equation in three dimensions. The corresponding matrix will be multilevel Toeplitz: It is block Toeplitz tridiagonal with each block being of the form (). Sameh in [] proposed a six phase parallel MD-like algorithm that combines independent FFTs in two dimensions and tridiagonal system solutions along the third dimension. Similar parallel RES for the Intel hypercube were
described in [, ] and in [] for the Cray TE. The partial solution method was also extended to threedimensional problems, e.g., [, cf. comments on solver PDCD]. The parallel RES methods above can be extended to handle other types of boundary conditions (Dirichlet, Neumann and periodic) and operators (e.g., biharmonic) and provide a blueprint for the design of parallel RES of higher order of accuracy, e.g., the FACR-like method FFT of Houstis and Papatheodorou (see [, ]).
Related Entries Algebraic Multigrid Domain Decomposition FFTW Illiac IV Preconditioners for Sparse Iterative Methods ScaLAPACK Sparse Direct Methods Spike
Bibliography . Bank RE, Rose DJ () Marching algorithms for elliptic boundary value problems. Part I: the constant coefficient case. Part II: the variable coefficient case. SIAM J Numer Anal ():– (Part I); – (Part II) . Bialecki B, Fairweather G, Karageorghis A () Matrix decomposition algorithms for elliptic boundary value problems: a survey. Numer Algorithms. http://www.springerlink.com/content/ gp/fulltext.pdf . Bickley WG, McNamee J () Matrix and other direct methods for the solution of systems of linear difference equations. Philos Trans R Soc Lond A: Math Phys Sci ():– . Bini DA, Meini B () The cyclic reduction algorithm: from Poisson equation to stochastic processes and beyond. Numer Algorithms ():– . Birkhoff G, Lynch RE () Numerical solution of elliptic problems. SIAM, Philadelphia . Börm S, Grasedyck L, Hackbusch W () Lecture notes /: hierarchical matrices. Technical report, Max Planck Institut fuer Mathematik in den Naturwissenschaften, Leipzig . Botta EFF et al () How fast the Laplace equation was solved in . Appl Numer Math ():– . Briggs WL, Turnbull T () Fast Poisson solvers for MIMD computers. Parallel Comput :– . Buzbee B, Golub G, Nielson C () On direct methods for solving Poisson’s equation. SIAM J Numer Anal ():– (see also the comments by Buzbee in Current Contents, , September )
Rapid Elliptic Solvers
. Buzbee BL () A fast Poisson solver amenable to parallel computation. IEEE Trans Comput C-():– . Calvetti D, Gallopoulos E, Reichel L () Incomplete partial fractions for parallel evaluation of rational matrix functions. J Comput Appl Math :– . Chan TF, Fatoohi R () Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP. In: Proceedings of the fourth SIAM conference on parallel processing for scientific computing. SIAM, Philadelphia, pp – . Cote SJ () Solving partial differential equations on a MIMD hypercube: fast Poisson solvers and the alternating direction method. Technical report UIUCDCS-R--, University of Illinois at Urbana-Champaign, Urbana . Ericksen JH () Iterative and direct methods for solving Poisson’s equation and their adaptability to Illiac IV. Technical report UIUCDCS-R--, Department of Computer Science, University of Illinois at Urbana-Champaign . Ethridge F, Greengard L () A new fast-multipole accelerated Poisson solver in two dimensions. SIAM J Sci Comput (): – . Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh AH, Voigt RG () Parallel algorithms for matrix computations. SIAM, Philadelphia . Gallopoulos E. An annotated bibliography for rapid elliptic solvers. http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/ Papers/myresbib . Gallopoulos E, Saad Y () Parallel block cyclic reduction algorithm for the fast solution of elliptic equations. Parallel Comput ():– . Gallopoulos E, Saad Y () Some fast elliptic solvers for parallel architectures and their complexities. Int J High Speed Comput ():– . Gallopoulos E, Sameh AH () Solving elliptic equations on the Cedar multiprocessor. In: Wright MH (ed) Aspects of computation on asynchronous parallel processors. North-Holland, Amsterdam, pp – . Giraud L () Parallel distributed FFT-based solvers for -D Poisson problems in meso-scale atmospheric simulations. Int J High Perform Comput Appl ():– . Hockney R () A fast direct solution of Poisson’s equation using Fourier analysis. J Assoc Comput Mach :– . Hockney R, Jesshope C () Parallel computers. Adam Hilger, Bristol . Hockney RW () Rapid elliptic solvers. In: Hunt B (ed) Numerical methods in applied fluid dynamics. Academic, London, pp – . Hockney RW () Characterizing computers and optimizing the FACR(l) Poisson solver on parallel unicomputers. IEEE Trans Comput C-():– . Houstis EN, Rice JR, Weerawarana S, Catlin AC, Papachiou P, Wang K-Y, Gaitatzes M () PELLPACK: a problem-solving environment for PDE-based applications on multicomputer platforms. ACM Trans Math Softw ():– . Hyman MA (–) Non-iterative numerical solution of boundary-value problems. Appl Sci Res B :–
R
. Intel Cluster Poisson Solver Library – Intel Software Network. http://software.intel.com/en-us/articles/intel-clusterpoisson-solver-library/ . Iserles A () Introduction to numerical methods for differential equations. Cambridge University Press, Cambridge . Johnsson L () Solving tridiagonal systems on ensemble architectures. SIAM J Sci Statist Comput :– . Jwo J-S, Lakshmivarahan S, Dhall SK, Lewis JM () Comparison of performance of three parallel versions of the block cyclic reduction algorithm for solving linear elliptic partial differential equations. Comput Math Appl (–):– . Knightley JR, Thompson CP () On the performance of some rapid elliptic solvers on a vector processor. SIAM J Sci Statist Comput ():– . Kung HT () New algorithms and lower bounds for the parallel evaluation of certain rational expressions and recurrences. J Assoc Comput Mach ():– . Lynch RE, Rice JR, Thomas DH () Tensor product analysis of partial differential equations. Bull Am Math Soc :– . McBryan OA () Connection machine application performance. Technical report CH-CS--, Department of Computer Science, University of Colorado, Boulder . McBryan OA, Van De Velde EF () Hypercube algorithms and implementations. SIAM J Sci Stat Comput (): s–s . Meurant G () A review on the inverse of symmetric tridiagonal and block tridiagonal matrices. SIAM J Matrix Anal Appl ():– . O’Donnell ST, Geiger P, Schultz MH () Solving the Poisson equation on the FPS-. Technical report YALE/DCS/TR, Yale University . Petrova S () Parallel implementation of fast elliptic solver. Parallel Comput ():– . Polizzi E, Sameh AH () SPIKE: A parallel environment for solving banded linear systems. Comput Fluids ():– . Resasco DC () Domain decomposition algorithms for elliptic partial differential equations. Ph.D. thesis, Yale University, Department of Computer Science . Rossi T, Toivanen J () A parallel fast direct solver for block tridiagonal systems with separable matrices of arbitrary dimension. SIAM J Sci Stat Comput ():– . Rossinelli D, Bergdorf M, Cottet G-H, Koumoutsakos P () GPU accelerated simulations of bluff body flows using vortex particle methods. J Comput Phys ():– . Sameh AH () A fast Poisson solver for multiprocessors. In: Birkhoff G, Schoenstadt A (eds) Elliptic problem solvers II. Academic, New York, pp – . Sameh AH, Chen SC, Kuck DJ () Parallel Poisson and biharmonic solvers. Computing :– . Swarztrauber PN () A direct method for the discrete solution of separable elliptic equations. SIAM J Numer Anal ():– . Swarztrauber PN, Sweet RA () Vector and parallel methods for the direct solution of Poisson’s equation. J Comput Appl Math :–
R
R
Reconfigurable Computer
. Sweet RA () A parallel and vector cyclic reduction algorithm. SIAM J Sci Statist Comput ():– . Sweet RA () Vectorization and parallelization of FISHPAK. In: Dongarra J, Kennedy K, Messina P, Sorensen DC, Voigt RG (eds) Proceedings of the fifth SIAM conference on parallel processing for scientific computing. SIAM, Philadelphia, pp – (see also http://www.cisl.ucar.edu/softlib/CRAYFISH.html) . Sweet RA, Briggs WL, Oliveira S, Porsche JL, Turnbull T () FFTs and three-dimensional Poisson solvers for hypercubes. Parallel Comput :– . Temperton C () Fast Fourier transforms and Poisson solvers on Cray-. In: Hockney RW, Jesshope CR (eds) Infotech state of the art report: supercomputers, vol . Infotech, Maidenhead, pp – . Tromeur-Dervout D, Toivanen J, Garbey M, Hess M, Resch MM, Barberou N, Rossi T () Efficient metacomputing of elliptic linear and non-linear problems. J Parallel Distrib Comput ():– . Vajteršic M () Algorithms for elliptic problems: efficient sequential and parallel solvers. Kluwer, Dordrecht . Van Loan C () Computational frameworks for the fast Fourier transform. SIAM, Philadelphia . Widlund O () A Lanczos method for a class of nonsymmetric systems of linear equations. SIAM J Numer Anal ():– . Zhang Y, Cohen J, Owens JD () Fast tridiagonal solvers on the GPU. In: Proceedings of the th ACM SIGPLAN PPoPP , Bangalore, India, pp –
Reconfigurable Computer Blue CHiP
Reconfigurable Computers Reconfigurable computers utilize reconfigurable logic hardware – either standalone, as a part of, or in combination with conventional microprocessors – for processing. Computing using reconfigurable logic hardware that can adapt to the computation can attain only some level of the performance and power efficiency gain of custom hardware but has the advantage of being retargetable for many different applications.
Bibliography . Hauck S, DeHon A () Reconfigurable computing: the theory and practice of FPGA-based computing. Morgan Kaufmann, Burlington
Reconstruction of Evolutionary Trees Phylogenetics
Reduce and Scan Marc Snir University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Parallel Prefix Sums; Prefix; Prefix Reduction
Definition Given n inputs x , . . . , xn , and an associative operation ⊗, a reduction algorithm computes the output x ⊗ ⋯ ⊗ xn . A prefix or scan algorithm computes the n outputs x , x ⊗ x , . . . , x ⊗ ⋯ ⊗ xn .
Discussion The reduction problem is a subset of the prefix problem, but as we shall show, the two are closely related. The reduction and scan operators are available in APL []: Thus +/ computes the value and +\ computes the vector . The two primitives appears in other programming languages and libraries, such as MPI []. There are two interesting cases to consider: Random access: The inputs x , . . . , xn are stored in consecutive location in an array. Linked list: The inputs x , . . . , xn are stored in consecutive locations in a linked list; the links connect each element to its predecessor. We analyze these two cases using a circuit or PRAM model, and ignoring communication costs.
Random-Access Reduce and Scan The sequential circuit for a reduction computation is shown in Fig. ; the corresponding sequential code is shown in Algorithm . This circuit has W = n − operations and depth D = n − . It also computes a scan. One can take advantage of associativity in order to change the order of evaluation, so as to reduce the parallel computation time. An optimal parallel reduction
Reduce and Scan
R
abcdefgh abcdefg abcdef abcde
abcd abc
ab a
b
c
d
e
f
g
h
Reduce and Scan. Fig. Sequential prefix circuit
abcdefgh
abcd
efgh
ab
a
cd
b
c
ef
d
e
gh
f
g
h
Reduce and Scan. Fig. Binary tree reduction
algorithm is achieved by using a balanced binary tree for the computation, as shown in Figs. and , for n = . The corresponding algorithm is shown in Algorithm .
Algorithm Sequential reduction and scan for (i = ; i ≤ n; i ++) xi = xi− ⊗ xi ;
Algorithm Binary tree reduction for (i = ; i ≤ k; i ++) forall (j = i ; j ≤ k ; j+= i ) xj = xj−i− ⊗ xj ;
The algorithm has W = n − operations and depth D = ⌈log n⌉, which is optimal; it requires ⌊n/⌋ processes. If the number p of processes is smaller, then the inputs are divided into segments of ⌈n/p⌉ consecutive inputs; a sequential reduction is applied to each group, followed by a parallel reduction of p values; we obtain a running time of ⌈n/p⌉ − + log p. In particular, the reduction of n values can be computed in O(log n) time, using n/ log n processes. Given a tree that computes the reduction of n inputs, one can build from it a parallel prefix circuit, as shown in Figs. and , for n = : The reduction is computed when values move up the tree, while the remaining prefix values are computed when values move down the tree. The algorithm is shown in Algorithm , for n = k . This
R
R
Reduce and Scan
abcdefgh
efgh
abcd
ab
a
cd
b
c
gh
ef
d
e
f
g
h
Reduce and Scan. Fig. Binary tree reduction
abcdefgh
abcd
efgh abcdef
ab
cd
ef
abc a
b
gh
abcde c
d
abcdefg
e
f
g
h
Reduce and Scan. Fig. Tree prefix computation
abc
abcdefg
abcde abcdef
abcdefgh efgh
abcd cd
ab a
b
c
d
ef e
f
gh g
h
Reduce and Scan. Fig. Parallel prefix circuit
algorithm performs n−⌊lg n⌋− ≤ W ≤ n−⌊lg n⌋− operations and requires ⌊lg n⌋ − ≤ D ≤ ⌊lg n⌋ steps, with n/ processors. If we have p ≤ n/ processes, then one can compute the parallel prefix using the algorithm shown in
Algorithm : One computes prefix products sequentially for each sublist, combine the sublist products to compute a global parallel prefix, and then update sequentially elements in each sublist by the product of the previous sublists. The running time is
Reduce and Scan
Algorithm Parallel prefix algorithm for (i = ; i ≤ k; i ++) forall (j = i ; j ≤ k ; j+= i ) xj = xj−i− ⊗ xj ; for (i = k − ; i ≥ ; i −−) forall (j = i + i− ; j ≤ k ; j+= i ) xj = xj−i− ⊗ xj ;
Algorithm Parallel prefix with limited number of processes // We assume that p divides n forall (i = ; i < p; i + +) for (j = i × n/p + ; j ≤ (i + ) × n/p; j ++) xj = xj− ⊗ xj ; parallel prefix(xn/p , xn/p , . . . , xn ); forall (i = ; i < p; i + +) for (j = i × n/p + ; j ≤ (i + ) × n/p; j ++) xj = xi×n/p ⊗ xj ;
D = n/p + lg p + O() and the number of operations is W = ( − /p)n + O(). It follows that the parallel prefix computation can be done in time O(log n) using n/ log n processes. The last circuit can be improved: Snir shows in [] that any parallel prefix circuit must satisfy W+D ≥ n− and that this bound can be achieved for any log n − ≤ D ≤ n − . Constructions for D in the range log n ≤ D ≤ log n − are given by Ladner and Fisher in []; they achieve D = ⌈log n⌉ + k and W = ( + {−k} )n − o(n). Thus, it is possible to have parallel prefix circuits of optimal depth and linear size. The upper and lower bounds hold for circuits that only use the operation ⊗. Bilardi and Preparata study prefix computations in a more general Boolean network model []. In this model, prefix can be computed in time T by a network of size S = Θ((n/T) lg(n/T)), in general, or size S = Θ(n/T) if the operation ⊗ satisfies some additional properties, for any T in a range [c lg n, n].
Applications Parallel reduction is a frequent operation in parallel algorithms: It is required for parallel dot products, parallel matrix multiplications, counting, etc.
R
The parallel prefix algorithm has many applications, too. Blelloch shows in [] how to use parallel prefix for sorting (quicksort and radix sort) and merging, graph algorithms (minimum spanning trees, connected components, maximum flow, maximal independent set, biconnected components), computational geometry (convex hull, K-D tree building, closest pair in plane, line of sight), and matrix operations. These algorithms use parallel prefix with the operations + and max. Other usages are listed in []: addition in arbitrary precision arithmetic, polynomial evaluation, recurrence solution, solution of tridiagonal linear systems, lexical analysis, tree operations, etc. Greenberg et al. show how the use of parallel prefix can speed up discrete event simulations []. A few other examples are shown below. Enumeration and packing: One has n processes and wants to enumerate all processes that satisfy some condition. Process i sets xi to if it satisfies the condition, otherwise. A scan computation will enumerate the processes with a , giving each a distinct ordinal number. The same approach can be used to pack all nonzero entries of a vector into contiguous locations. Enumeration is used in radix-sort and quicksort to bin keys. Segmented scan: A segmented parallel prefix operation computes parallel prefix within segments of the full list, rather than on the entire list. Thus +\ [ | | ] yields [ | | ]. Let < S, ⊗ > be a semi-ring (S is a set closed under the associative binary operator ⊗). Define a new semiˆ > whose elements are S ∪ {∣s : s ∈ S} and the ring < S,⊗ ˆ is defined as follows: operation ⊗ ˆ = a ⊗ b if a, b ∈ S a⊗b ˆ = ∣c, where c = a ⊗ b; ∣a⊗b ˆ = ∣a⊗∣b ˆ = ∣b. a⊗∣b The new operation is associative; a parallel prefix computation with the new operation computes segmented parallel prefixes for the old operation. Carry-lookahead adder: Consider the addition of two binary numbers an , . . . , a and bn , . . . , b . Let ci be the carry bit at position i. Then the addition result is
R
R
Reduce and Scan
sn+ , . . . , , s , where sn+ = cn , and si = ai ⊕ bi ⊕ ci− , for i = , . . . , n. Define pi = ai ∨ bi (carry propagate) and si = ai ∧ bi (carry set). The carry bits for the adder are defined by the recursion c = ; ci = pi ci− ∨ si . This can be written, in matrix form, as ( (
c )=( )
ci pi s i ci− )=( ) )(
with addition being ∨ (or) and multiplication being ∧ (and). Thus, if M = (
p s ) , Mi = ( i i ) , for i = , . . . n,
we can compute the carry bits by computing the prefixes Mi ⋯M , for i = , . . . , n; any parallel prefix circuit can be used to build a carry-lookahead adder. Linear recurrences: Consider an order one linear recurrence, of the form xi = ai xi− + bi . The can be written as a i bi xi xi− )=( ), ( )( with the usual interpretation of addition and multiplication. This has the same formal structure as the previous recursion. The same approach extends to higher-order linear recurrences: A recurrence of the form xi = ∑kj= aij xi−j + bi can be solved in parallel by computing the parallel prefix of the product of (k + )×(k + ) matrices. Function composition: Given n functions f , . . . , fn consider the problem of computing f , f f , . . . , fn ⋯ f . Since function composition is associative, then a parallel prefix algorithm can be used. This is a generalization of the usual formulation for prefix: Define f : x → a , fi : x → x ⊗ ai , for i = , . . . , n then fi ⋯ f = a ⊗ ⋯ ⊗ ai .
In this formulation, one has a set F of functions that is closed under composition. This general formulation leads to a practical algorithm if functions in F have a representation such that function composition is easily computed in that representation. The linear recurrence example is such a case: The affine functions x → ax + b is represented by the matrix (
a b )
and function composition corresponds, in this representation, to matrix product. Another example is provided by deterministic finite state transducers: The sequence of outputs generated by the automaton can be computed in logarithmic time from the sequence of inputs. Finally, consider a sequence of accesses to a memory location that are either loads, stores, or fetch&adds. Then the values returned by the loads and fetch&adds can be computed in logarithmic time []. (Fetch&add(address, increment) is defined as the atomic execution of the following code: {temp = &address; &address += increment; return temp}; the operation increments a shared variable and returns its old value.)
Linked List Reduce and Scan The previous algorithms are not easily applied in the case where elements are in a linked list, as the rank of the elements is not known (indeed, computing the ranks is a parallel prefix computation!): It is not obvious which elements should be involved in the computation at each stage. One approach is the recursive-doubling algorithm shown in Algorithm , The algorithm has ⌈lg n⌉ stages and uses n processes, one per node in the linked list; at stage i, each process computes the product of the (up
Algorithm Recursive doubling algorithm for (i = ; i ≤ ⌈log n⌉; i ++) forall v in linked list if v.next ≠ NULL { v.x = (v.next → x) ⊗ v.x; v.next = v.next → next; }
Reduce and Scan
to) i values at consecutive nodes in the list ending with the process’ node, and links that node to the node that is i positions down the list. This algorithm performs W = n⌈log n⌉ − ⌈log n⌉ + operations and requires
R
D = ⌈log n⌉ steps. The ⌈log n⌉ depth is optimal, but this algorithm is not work-efficient, as it uses Θ(n log n) operations, compared to Θ(n) for the sequential algorithm.
Algorithm Generic linked list reduce algorithm Reduce(list L) { if (L.head == &v)&&(v.next == NULL) return v.x; // list has only one element else { pick an indepedendent subset V of active nodes from L; forall v ∈ V { // delete predecessor of v from the list ptr = v.next; (v.x = ptr → x) ⊗ v.x; v.next = ptr → next; } } barrier; Reduce(L); }
Algorithm Generic linked list scan algorithm Scan(list L) { if ((L.head == &v)&&(v.next == NULL)) return v.x; // list has only one element else { pick an indepedendent subset V of active nodes from L; forall v ∈ V { // delete predecessor of v from the list ptr = v.next; v.x = (ptr → x) ⊗ v.x; v.next = ptr → next; } } barrier; Scan(L); barrier; forall v ∈ V { // reinsert predecessor of v in list if (ptr.next ≠ NULL) ptr.x = ptr.next → x ⊗ ptr.x; v.next = ptr; } barrier; }
R
R
Reduce and Scan
A set of nodes of the linked list is independent if each node in the set has a predecessor outside the set. A linked list reduction algorithm that performs no superfluous products has the generic form shown in Algorithm . At each step of this parallel algorithm one replaces nonoverlapping pairs of adjacent nodes with one node that contains the product of these two nodes, until we are left with a single node. Given such reduction algorithm, we can design a parallel prefix algorithm, by “climbing back” the reduction tree, as shown in Algorithm . The reduction is illustrated, for a possible schedule, on the top of Fig. , while the unwinding of the recursion tree that computes the remaining prefix values is shown on the bottom of Fig. .
The problem, then, is to find a large independent subset and schedule the processes to handle the nodes in the subset at each round. One possible approach is to use randomization for symmetry breaking: At each round, active processes choose randomly, with probability /, whether to participate in this round; if a process decided to participate and its successor also decided to participate, it drops from the current round. The remaining processes form an independent set. On average, / of the processes that are active at the beginning of a round participate in the round. With overwhelming probability, the algorithm will require a logarithmic number of rounds. Deterministic “coin-tossing” algorithms have been proposed by various authors. Cole and Vishkin are
a
b
c
d
e
f
a
ab
c
cd
e
ef
a
ab
c
abcd
e
ef
a
ab
c
abcd
e
abcdef
a
ab
c
abcd
e
abcdef
a
ab
c
abcd
e
abcdef
a
ab
abc
abcd
Reduce and Scan. Fig. Linked list parallel prefix
abcde
abcdef
Reduce and Scan
1
R
–1 0
1
1
–1
0
0
1
1
–1
–1
–1 0
0
1
–1
1
0
1
–1 0
–1 0
Reduce and Scan. Fig. Tree depth computation
showing in [] how to solve deterministically the linked list parallel prefix problem in time O(log n log∗ n) using n/ log n log∗ n processes.
Connection Machine MapReduce NESL Ultracomputer, NYU
Applications Linked list prefix computations can be used to solve a variety of graph problems, such as computing spanning forests, connected components, biconnected components, and minimum spanning trees []. A simple example is shown below. Tree depth computation: Given a tree, one wishes to compute for each node the distance of that node from the root. One builds an Euler tour of the tree, as shown in Fig. . Each tree node is replaced by three nodes in the tour, with value (moving down), (moving right), and − (moving up). One can easily see that a prefix sum on the tour will compute the depth of each node. A variant of this algorithm (with − replaced by ) will number the nodes in preorder [].
Related Entries Array Languages Collective Communication
Bibliography . Bilardi G, Preparata FP () Size-time complexity of Boolean networks for prefix computations. J ACM :– . Blelloch GE () Scans as primitive parallel operations. IEEE Tram Comp :– . Chatterjee B, Blelloch GE, Zagha M () Scan primitives for vector computers. In: Proceedings of supercomputing conference. IEEE Computer Society Press, New York . Cole R, Vishkin U () Deterministic coin tossing with applications to optimal parallel list ranking. Inform Control :– . Greenberg AG, Lubachevsky BD, Mitrani I () Superfast parallel discrete event simulations. ACM Trans Model Comput Simul :– . Iverson KE () A Programming language. Wiley, New York . Kruskal CP, Rudolph L, Snir M () Efficient synchronization on multiprocessors with shared memory. ACM TOPLAS :– . Kruskal CP, Rudolph L, Snir M () Efficient parallel algorithms for graph problems. Algorithmica :– . Ladner RE, Fischer MJ () Parallel prefix computation. J Assoc Comput Mach :–
R
R
Relaxed Memory Consistency Models
. Message Passing Interface Forum () MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput :– . Snir M () Depth-size trade-offs for parallel prefix computation. J Algorithms :– . Tarjan RE, Vishkin U () An efficient parallel biconnectivity algorithm. Siam J Comput :–
Relaxed Memory Consistency Models Memory Models
Reliable Networks Networks, Fault-Tolerant
Rendezvous Synchronization
Reordering Michael Heath University of Illinois at Urbana-Champaign, Urbana, IL, USA
Definition Reording for parallelism is a technique for enabling or enhancing concurrent processing of a list of items by determining an ordering that removes or reduces serial dependencies among the items.
Discussion Ordered lists are ubiquitous and seem entirely natural in many aspects of daily life. But the specific order in which items are listed may be irrelevant, or even
counterproductive, to the purpose of the list. In making a shopping list, for example, we usually list items in the order we happen to think of them, but rarely will such an order coincide with the most efficient path for locating the items in a store. Even for a structure that may have no natural linear ordering, such as a graph, we nevertheless typically number its nodes in some order, which may or may not facilitate efficient processing. In general, finding an optimal ordering is often a difficult combinatorial problem, as famously exemplified by the traveling salesperson problem. Similarly, in programming when we write a for loop, we specify an ordering that may or may not have any significance for the successive tasks to be done inside the loop, yet the semantics of most programming languages usually require the compiler to schedule the tasks in the specified sequential order, even if the tasks are totally independent and could validly be done in any order. This constraint often goes unnoticed in serial computation, but it can seriously inhibit parallel processing of sequential loops. If each successive task depends on the result of the immediately preceding one, then the given sequential order must be honored in order for the program to execute correctly. But if the tasks have no such interdependence, for example, in a loop initializing all the elements of an array to zero, then the element assignments could be done in any order, or simultaneously in parallel. Unfortunately, conventional programming languages offer no mechanism for indicating when tasks are independent, which has motivated the development of compiler techniques for discovering and exploiting loop-based parallelism, as well as compiler directives, such as those provided by OpenMP, that enable programmers to specify parallel loops. It has also motivated the development of higher-level languages in which structures such as arrays are first-class objects that can be referenced as a whole without having to enumerate array elements sequentially. Although smart compilers, compiler directives, and higher-level objects can help identify and preserve potential parallelism, it may still be difficult to determine how best to exploit the parallelism, especially if the ordering affects the total amount of work required. In such cases, the necessary reordering may be considerably deeper than it is reasonable to expect any compiler or automated system to identify or exploit.
Reordering
These issues are well illustrated by computational linear algebra. When we write a vector in coordinate notation, we implicitly specify an ordering of the coordinate basis for the vector space that is generally arbitrary. Similarly, when we write a system of n linear equations in n unknowns, Ax = b, where A is an n × n matrix, b is a given n-vector, and x is an n-vector to be determined, the ordering of the rows of A is arbitrary, as each equation in the system must be satisfied individually, irrespective of the order in which they happen to be listed. Consequently, the rows of the matrix A can be permuted arbitrarily without affecting the solution x. The column ordering of A is also arbitrary in the sense that it corresponds to the arbitrary ordering chosen for the entries of x, so that permuting the columns of A affects the solution only in that it correspondingly permutes the entries of x. To summarize, the solution to the system Ax = b is given by x = Qy, where y is the solution to the permuted system PAQ y = Pb, and P and Q are any permutation matrices, so in this sense the original and permuted systems are mathematically equivalent, although the computational resources required to solve them may be quite different. This freedom in choosing the ordering can be exploited to enhance numerical stability (e.g., partial or complete pivoting) or computational efficiency, or a combination of these. The order in which the entries of a matrix are accessed can have a dramatic effect on the performance of matrix algorithms, in part because of the memory hierarchy (registers, multiple levels of cache, main memory, etc.) typical of modern microprocessors. A dense matrix is usually stored in a two-dimensional array, whose layout in memory (typically row-wise or column-wise) strongly affects the cache behavior of programs that access the matrix entries. Many matrix algorithms, such as matrix multiplication and matrix factorization, systematically access the entries of the matrix using nested loops whose indices can validly be arranged in any order, so the order can be chosen to make optimal use of data residing in the most rapidly accessible level of the memory hierarchy. In addition, the order in which matrix rows or columns (or both, in the case of a two-dimensional decomposition) are assigned to processors may strongly affect parallel efficiency. Assignment by contiguous blocks, for example, tends to reduce communication but inhibits
R
concurrency and can yield a poor load balance, whereas assigning rows or columns to processors in a cyclic manner (like dealing cards) has the opposite effects. Optimizing the tradeoff between these extremes may suggest a hybrid block-cyclic assignment, with smaller blocks of tunable size assigned cyclically. These issues are further complicated for matrices that are sparse, meaning that the entries are mostly zeros, so that it is advantageous to employ data structures that store only the nonzero entries (along with information indicating their locations within the matrix) and algorithms that operate on only nonzero entries. For a sparse matrix, the ordering of the rows and columns strongly affects the total amount of work and storage required to compute a factorization of the matrix. Sparsity can also introduce additional parallelism into the factorization process, beyond the substantial parallelism already available in the dense case. To illustrate the effects of ordering for sparse systems, we focus on systems for which the sparse matrix A is symmetric and positive definite, so that we will not have the additional complication of ordering to preserve numerical stability as well. In this case, for a direct solution we compute the Cholesky factorization A = LLT , where L is lower triangular, so that the solution x to the system Ax = b can be computed by solving the lower triangular system Ly = b for y by forward substitution and then the upper triangular system LT x = y for x by back substitution. To see how the ordering of the sparse matrix affects the work, storage, and parallelism, we need to take a closer look at the Cholesky algorithm, shown in Fig. , in which only the lower triangle of the input matrix A is accessed, and is overwritten by the Cholesky factor L. The outer loop processes successive columns of the matrix. The first inner loop, labeled cdiv(k), scales column k by dividing it by the square root of its diagonal entry, while the second inner loop, labeled cmod( j, k), modifies each subsequent column j by subtracting from it a scalar multiple, ajk , of column k. The first important observation to make about the Cholesky algorithm is that a given cmod( j, k) operation may introduce new nonzero entries, called fill, into column j in locations that were previously zero. Such fill entries require additional storage and work to process, so we would like to limit the amount of fill, which is dramatically affected by the ordering of the matrix, as
R
R
Reordering
for k = 1 to n √ akk = akk for i = k + 1 to n aik = aik=akk end for j = k + 1 to n for i = j to n aij = aij − aik ajk end end end
{cdiv(k)}
{cmod(j;k)}
Reordering. Fig. Cholesky factorization algorithm
Reordering. Fig. “Arrow” matrix example illustrating dependence of fill on ordering. Nonzero entries indicated by ×
illustrated for an extreme case by the “arrow” matrix whose nonzero pattern is shown for a small example in Fig. with two different orderings. The ordering on the left results in a dense Cholesky factor (complete fill), whereas the ordering on the right results in no new nonzero entries (no fill). In general, we would like to choose an ordering for the matrix A that minimizes fill in the Cholesky factor L, but this problem is known to be NP-complete, so instead we use heuristic ordering strategies such as minimum degree or nested dissection, which effectively limit fill at reasonable cost. A second important observation is that the cdiv operation that completes the computation of a given column of L cannot take place until the modifications (cmods) by all previous columns have been done. Thus, the successive cdiv operations appear to require serial execution, and this is indeed the case for a dense matrix. But a given cmod( j, k) operation has no effect, and hence need not be done, if ajk = , so a sparse matrix may therefore enable additional parallelism that is not available in the dense case. Once again, however, this potential benefit depends critically on the ordering chosen for the rows and columns of the sparse matrix.
Even if the cdiv operations must be performed serially, however, multiple cmod operations can still be performed in parallel, and this is the principal source of parallelism for a dense matrix. The interdependence among the columns of the Cholesky factor L is precisely characterized by the elimination tree, which has one node per column of A, with the parent of node j being the row index of the first subdiagonal nonzero in column j of L. In particular, each column of L depends only on its descendants in the elimination tree. For a dense matrix, this potential source of additional parallelism is of no help, as the elmination tree is simply a linear chain. But for a sparse matrix, more advantageous tree structures are possible, depending on the ordering chosen. A small example is shown in Fig. , in which a two-dimensional mesh G(A), whose edges correspond to nonzero entries of matrix A, is ordered row-wise, resulting in a banded matrix A and Cholesky factor L. With this ordering, the elimination tree T(A) is a linear chain, implying sequential execution of the corresponding cdiv operations, as each column of L depends on all preceding columns. An alternative ordering is shown in Fig. , in which the same two-dimensional mesh is ordered by nested dissection, where the graph is recursively split into pieces with the nodes in the separators at each level numbered last. In this example, numbering the “middle” nodes last, that is, , , and , leaves two disjoint pieces that are in turn split by their “middle” nodes, numbered and , leaving four isolated nodes, numbered , , , and . The absence of edges between the separated pieces of the graph induces blocks of zeros in the matrix A that are preserved in the Cholesky factor L, which accordingly suffers fewer fill entries than with the banded ordering, and hence requires less storage and work to compute. Equally important, the elimination tree now has a hierarchical structure, with multiple leaf nodes corresponding to cdiv operations that can be executed simultaneously in parallel at each level of the tree up to the final separator. The lessons from this small example apply much more broadly: with an appropriate ordering, sparsity can enable additional parallelism that is unavailable for dense matrices. A divide-and-conquer approach, exemplified by nested dissection, is an excellent way to identify and exploit such parallelism, and it typically also reduces the overall
Reordering
R
9 8 7 7
8
9
4
5
6
1
2
3
6 5 4 3 2
G(A)
A
L
1
T(A)
Reordering. Fig. Two-dimensional mesh G(A) ordered row-wise, with corresponding matrix A, Cholesky factor L, and elimination tree T(A). Nonzero entries indicated by × and fill entries by +
4
6
9
5
8 7
8
9
7 3
1
3
G(A)
2
1
A
L
6 2
4
5
T(A)
Reordering. Fig. Two-dimensional mesh G(A) ordered by nested dissection, with corresponding matrix A, Cholesky factor L, and elimination tree T(A). Nonzero entries indicated by × and fill entries by +
work and storage required for sparse factorization by reducing fill. In addition to direct methods based on matrix factorization, reordering also plays a major role in parallel implementation of iterative methods for solving sparse linear systems. For example, reordering by nested dissection can be used to reduce communication in parallel sparse matrix-vector multiplication, which is the primary computational kernel in Krylov subspace iterative methods such as conjugate gradients. Another example is in parallel implementation of the Gauss-Seidel or Successive Overrelaxation (SOR) iterative methods. These methods repeatedly sweep through the successive rows of the matrix, updating each corresponding component of the solution vector x by solving for it based on the most recent values of all the other components. This dependence of each component update on the immediately preceding one seems to require strictly serial processing, as illustrated for a small example in Fig. , where a two-dimensional mesh is numbered in
a natural row-wise ordering, with the corresponding matrix shown, and indeed only one row can be processed at a time. By reordering the mesh and corresponding matrix using a red-black ordering, in which the nodes of the mesh alternate colors in a checkerboard pattern as shown in Fig. , with the red nodes numbered before the black ones, solution components having the same color no longer depend directly on each other, as can be seen both in the graph and by the form of the reordered matrix. Thus, all of the components corresponding to red nodes can be updated simultaneously in parallel, and then all of the black nodes can similarly be done in parallel, so the algorithm proceeds in alternating sweeps between the two colors with ample parallelism.For an arbitrary sparse matrix, the same idea still works, but more colors may be required to color its graph, and hence the resulting parallel implementation will have as many successive parallel phases as the number of colors. One cautionary note, however, is that the convergence
R
R
Reordering
13
14
15
16
9
10
11
12
5
6
7
8
1
2
3
4
G(A)
A
Reordering. Fig. Two-dimensional mesh G(A) ordered row-wise, with corresponding matrix A. Nonzero entries indicated by ×
15
7
16
8
5
13
6
14
11
3
12
4
1
9
2
10
G(A) A
Reordering. Fig. Two-dimensional mesh G(A) with red-black ordering and corresponding reordered matrix A. Nonzero entries indicated by ×
rate of the underlying iterative method may depend on the particular ordering, so the gain in performance per iteration may be offset somewhat if an increased number of iterations is required to meet the same convergence tolerance. In this brief article we have barely scratched the surface of the many ways in which reordering can be used to enable or enhance parallelism. We focused mainly on computational linear algebra, but additional examples abound in many other areas, such as the use of multicolor reordering to enhance parallelism in solving ordinary differential equations using waveform relaxation methods or in solving partial differential equations using domain decomposition methods. Yet another example is in the parallel implementation of
the Fast Fourier Transform (FFT), where reordering can be used to optimize the tradeoff between latency and bandwidth for a given parallel architecture.
Related Entries Dense Linear System Solvers Graph Partitioning Layout, Array Loop Nest Parallelization Linear Algebra, Numerical Loops, Parallel Scheduling Algorithms Sparse Direct Methods Task Graph Scheduling
Roadrunner Project, Los Alamos
Bibliographic Notes and Further Reading
R
Roadrunner Project, Los Alamos
For a broad overview of parallelism in computational linear algebra, see [–]. For a general discussion of sparse factorization methods, see [, ]. For specific discussion of reordering for parallel sparse elimination, see [–].
Don Grice , Andrew B. White, Jr. IBM Corporation, Poughkeepsie, NY, USA Los Alamos National Laboratory, Los Alamos, NM, USA
Bibliography
Petaflop barrier
. Davis TA () Direct methods for sparse linear systems. SIAM, Philadelphia, PA . Demmel JW, Heath MT, van der Vorst HA () Parallel numerical linear algebra. Acta Numerica :– . Dongarra JJ, Duff IS, Sorenson DC, van der Vorst HA () Numerical linear algebra for high-performance computers. SIAM, Philadelphia . Gallivan KA, Heath MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh AH, Voigt RG () Parallel algorithms for matrix computations. SIAM, Philadelphia . George A, Liu JW-H () Computer solution of large sparse positive definite systems. Prentice Hall, Englewood Cliffs . Liu JW-H () Computational models and task scheduling for parallel sparse Cholesky factorization. Parallel Comput :– . Liu JW-H () Reordering sparse matrices for parallel elimination. Parallel Comput :– . Liu JW-H () The role of elimination trees in sparse factorization. SIAM J Matrix Anal Appl :–
Definition
Resource Affinity Scheduling Affinity Scheduling
Resource Management for Parallel Computers Operating System Strategies
Rewriting Logic Maude
Ring Networks, Direct
Synonyms
The Los Alamos Roadrunner Project was a joint Los Alamos and IBM project that developed the first supercomputer to break the Petaflop barrier on a Linpack run for the Top list.
Discussion Introduction In , Los Alamos and IBM began to explore the potential of a new class of power-efficient highperformance computing systems based on a radical view of the future of supercomputing. The resulting LANL Roadrunner project had three primary roles: () an advanced architecture which presaged the primary components of next generation supercomputing systems; () a uniquely capable software and hardware environment for investigation and solution of science and engineering problems of importance to the US Department of Energy (DOE). Section . describes some of the key LANL applications that were studied; and () achievement of the first sustained petaflops. A significant point in the thinking surrounding petascale and now exascale computing, is that the challenge is not just “solving the same problem faster” but rather “solving a better problem” by harnessing the leap in computing performance; that is, the key to better predictive simulations in national nuclear security, in climate and energy simulations, and in basic science is increasing the fidelity of our computational science by using better models, by increasing model resolution, and imbedding uncertainty quantification within the solution itself. The computing capability of the supercomputer on the top of the Top list continues to grow at a staggering rate. In , Roadrunner broke the sustained petaflops boundary on the list for the first time. This represented a performance improvement of over a
R
R
Roadrunner Project, Los Alamos
factor of , in just years. As a result of the availability of such large machines, and the accompanying price and performance improvements for smaller machines, HPC solutions are now moving into markets that were previously considered beyond the reach of computing technology. More comprehensive drug simulations, car crash simulations, and climate models are advances one might have easily predicted, but the impacts of HPC on movie making, games, and even remote surgery are things that might not have been so easy to see on the horizon. Today even financial institutions are using real-time modeling to guide investment choices. The growth of HPC opportunities in markets that did not involve traditional HPC users is what is driving the business excitement around the improvements in computing capability. However, these opportunities do not come without their own technical challenges. The total amount of power and cooling required to drive some large systems, the programming complexity as hundreds of thousands of threads need to be harnessed into a single solution, and overall system reliability, availability, and serviceability (RAS) are the biggest hurdles. It was important to address the fundamental technology challenges facing further advances in application performance. Designed and developed for the US Department of Energy (DOE) and Los Alamos National Laboratory
(LANL), the DOE/LANL Roadrunner project was a joint venture between LANL and IBM to address computing challenges with a range of technical solutions that are opening up new opportunities for business, scientific, and social advances. The Roadrunner project is named after the state bird of New Mexico where the Roadrunner system has been installed at LANL. To address the power and performance issues, a fundamental design decision was made to use a hybrid programming approach that combines standard xarchitecture processors and Cell BE–based accelerators. The Cell BE accelerators provide a much denser and power-efficient peak performance than the standard processors, but introduce a different dimension in programming complexity because of the hybrid architecture and the necessity to explicitly manage memory on the Cell processor. When the project first started, this approach was considered radical, but now these attributes are widely accepted as necessary components of next generation systems leading to exascale. The Roadrunner program was targeted at getting a head start on the software issues involved in programming these architectures.
Roadrunner Hardware Architecture As shown in Fig. and described in the Sect. “Management for Scalability,” the Roadrunner system
Roadrunner is a hybrid petascale system delivered in 2008
6,528 dual-core Opterons ⇒ 47 TF 12,240 Cell eDP chips ⇒ 1.2 PF
Connected Unit cluster 180 compute nodes w/Cells 12I/O nodes
17 clusters 3,264 nodes 288-port IB 4x DDR
288-port IB 4x DDR
12 links per CU to each of 8 switches
Eight 2nd-stage 288-port IB 4X DDR switches
Roadrunner Project, Los Alamos. Fig. Overall roadrunner structure
Roadrunner Project, Los Alamos
is a cluster of over , triblade nodes (described in detail below). The nodes are organized in smaller clusters called connected units (CUs). Each CU is composed of computational triblade nodes, I/O nodes, and a service node. Each CU has its own -port Voltaire Infiniband (IB) switch. [] These CUs are then connected to each other with a standard second level switch topology where each of the first level switches has connections to each of the second level switches. There are of these CUs in the overall system. The triblade node architecture and implementation described in the Sect. “Triblade Architecture” were chosen in order to use existing hardware-building blocks and to capitalize on the RAS characteristics of the IBM Blade Center [] infrastructure. Ease of construction and service were critical requirements because of the short schedule and the large number of parts in the final system. Since the compute nodes make up the bulk of the system hardware, they were the focus of the hardware and software design efforts.
Management for Scalability The management configuration of the Roadrunner system was designed for scalability from the beginning. The primary concern was the ability to use and control individual CUs without impacting the operation of the other CUs. That was the fundamental reason for having a first level IB switch dedicated to each CU. One can reboot the nodes in a CU, and even replace nodes in a CU, without impacting the operation of the other CUs, or applications running in those CUs. It is possible to isolate the first level switches from the second level switches with software commands so that other CUs do not route data to a CU that is under debug or repair. The I/O nodes are dedicated to use by nodes within the CU which allows file system traffic to stay within a CU and not utilize any second level switch bandwidth. This also allows for file system connectivity and performance measurements to be done in isolation, including rebooting the CU, at the CU level. Actually, there are two sets of six I/O nodes for each CU which each serve half of the compute nodes, thus providing further performance isolation and greater performance consistency for applications.
Triblade Architecture As described in the Sect. “Heterogeneous Cores and Hybrid Architecture,” a fundamental innovation in the
R
Roadrunner design was the use of heterogeneous core types to maximize the energy efficiency of the system. Having cores designed for specific purposes allows for energy efficiency by eliminating circuits in the cores that will not be used. The three core types in the Roadrunner system are all programmable and not hardwired algorithmic cores, such as a cryptography engine, but the circuits are optimized for the type of code that will run on each core type. There are two computational chips in the system: the PowerXCell i .GHz chip which is packaged on the QS [] blade and the AMD Opteron dual core .GHz chip that is packaged on the LS [] blade. The PowerXCell i has eight specialized SPEs (Synergistic Processing Elements) that run the bulk of the application floating point operations but do not need the circuits nor the state to run the OS. It also has a PowerPC core that is neither as space nor energy efficient as the SPEs but it does run the OS on the QS and controls the program flow for the SPEs. The AMD chip runs the MPI stack and controls the communications between the triblades in the system. It also controls the overall application flow and does application work for sections of the code that are not floating point intensive. The LS blade has its own address space, separate from the QS, which introduces another level of hybridization into the node structure. The triblades, as shown in Fig. , are the compute nodes that contain these computational chips and the heart of the system. They consist of three computational blades: one LSand two QSs and a fourth expansion blade that handles the connectivity among the three computational blades as well as the connection to the CU IB switch. The LS is a dual socket, dual core AMD based blade, so there are four AMD cores per triblade. The QS is a dual socket PowerXCell i (Cell BE chip) blade, so there are four Cell BE chips in the triblade to pair with the four AMD cores. The software model most commonly used is one MPI (message passing interface) task per AMD core-Cell BE chip pair. There is GB of main storage on the LS and a total of GB of main storage on the pair of QSs. Having matched storage of GB on each AMD core and each Cell BE chip simplifies designing and coding the applications that share data structures on each processing unit in the pair, since
R
R
Roadrunner Project, Los Alamos
there can be duplicate copies kept on each side and only updates to the structures need to be communicated
Roadrunner Project, Los Alamos. Fig. Triblade architecture
Roadrunner Project, Los Alamos. Fig. System configuration
between the process running on the AMD core and the process running on the partner Cell BE chip. Since the triblade structure presents dual address spaces, (one on the LS and one on the QS) to each of the MPI tasks, as shown in Fig. , having bandwidth and latency efficient methods for transferring data between these address spaces is critical to the scaling success of the applications. A significant architectural feature of the triblade is the set of four dedicated peripheral component interconnect express (PCIe)-x links that connect the four AMD cores on the LS to their partner Cell BE chips on the QSs. All four links can transmit data in both directions simultaneously for moving Cell BE data structures up to the Opteron processors and Opteron structures down to the Cell BE chips. A common application goal, not surprisingly, was to maximize the percentage of the work done by the Cell BE synergistic processing elements (SPEs) []. When the SPEs need to communicate boundary information with their logical neighbor SPEs in the problem space, that are located in physically different triblades, they need to send the data up to their partner AMD core, which in turn will send it, usually using MPI, to the AMD core that is the partner of the actual target SPE. The receiving AMD will then send the data down to the actual target SPEs. Having the four independent
Roadrunner Project, Los Alamos
PCIe links allows all four Cell BE chips to send the data to and from their partner AMD cores at the same time. Figure shows a logical picture of two triblades connected together with their IB link to show the different address spaces that exist and need to be managed by the application data flow. What is shown is one of the four pairs of address spaces on each of the triblades. Each AMD core and each Cell BE chip has its own logical address space. Since each AMD core is paired with its own Cell BE chip, there are four pairs of logical address spaces on a triblade. Address space is the AMD address space for Core in the AMD chip on the left of triblade . That core is partnered with Cell BE- on QS- which owns address space . There are corresponding address spaces and on triblade . There is a set of library functions that can be called from the application on the AMD or the Cell BE to transfer data between address spaces and , and and . OpenMPI is used to transfer data between the AMD address spaces and using typical MPI and IB methods. The difference between the standard MPI model and the triblade MPI model is that in the triblade model, an MPI task controls both a standard Opteron core and the heterogeneous Cell BE chip. This is similar in some ways to having multiple threads running under a single MPI task except that the address spaces need to be explicitly controlled and the cores are heterogeneous. The ability to control the data flow both between the hybrid address spaces and within the Cell BE address space allows the application to optimize the utilization of the memory bandwidths in the system. As chip densities increase, but circuit and pin frequencies do not, utilizing pin bandwidths (both memory bandwidth and communication bandwidth) efficiently will become increasingly important. The Cell BE chip local stores, which act like software controlled cache structures, allow for some very efficient computational models. As an example, the DGEMM (double-precision general matrix multiply) routine that ran in the hybrid Linpack application [], was over % efficient, and was the heart of the computational efficiency that led to the sustained petaflops achievement. The same software cache model applies to the separate address spaces that exist in the AMD and Cell BE chips within the MPI task.
R
Technology Drivers The ongoing reduction in the rate of processor frequency improvement has led to three fundamental issues in development to achieve dramatically better application performance: () Software complexity which is caused by the growing number of cores needed for ever greater performance, () Energy requirements driven by the increased number of cores and chips needed for performance improvements, and () RAS requirements driven by the ever increasing number of circuits and components in the larger systems. All three of these issues were important features of the Roadrunner project overall design and implementation.
Software Complexity For the last several decades, lithographic technology improvements guided by Dennard’s CMOS Scaling Theory [] in accordance with Moore’s law [] have not only provided circuit density improvements, but the shrinkage in the vertical dimension also provided an accompanying frequency increase. This meant that for applications to scale in performance, one did not need to modify the software to utilize additional cores or threads, since system performance improved simply because of the frequency scaling. The elimination of future frequency scaling puts the burden of application performance improvement on the application software and programming environment, and requires a doubling of the number of cores applied to a problem in order to achieve a doubling of the peak performance available to the application. Since sustained application performance for strong scaling, and to a lesser extent weak scaling, is likely to be less than linear in the number of cores available, more than double the number of cores will usually be required to achieve double the sustained application performance. At a nominal GHz, effectively a million functional units are required for petascale sustained performance, and the number of cores will continue to grow as greater sustained performance is desired. Finding ways to cope with this fundamental issue, in a way that is tenable for both algorithm development and code maintenance, while still achieving reasonable fractions of nominal peak performance for the sustained performance on real (as opposed to test or benchmark) applications, was a major goal of the project. Since application scaling going to petascale and beyond was such a major focus,
R
R
Roadrunner Project, Los Alamos
considerable time and energy were spent on making sure that real applications would scale on the machine. While high-level language constructs and advances, in and of themselves were not a major focus of the project, rethinking the applications themselves was. The LANL application and performance modeling teams, as well as the IBM application and performance modeling teams, analyzed how the construction of the base algorithm flow and the interaction with the projected hardware would work. The IBM Linpack team modified Linpack to operate in the hybrid environment and modeling was done to optimize the structure of the code around the anticipated performance characteristics of the system. The workhorse of the sustained petaflops performance was the DGEMM algorithm on the Cell BE chip since it provided over % of the computation required by the overall solution of the Linpack problem. The AMD core provided the overall control flow for the application, performed the OpenMPI communication, and did some of the algorithmic steps that could be overlapped with the Cell BE DGEMM calculations. The final structure of the solutions was both hybrid (between the x and Cell BE) units and heterogeneous on the Cell BE itself []. Linpack was only one of the codes that were studied in great detail, and had the machine only been useful for a Linpack run, and not for critical science and DOE codes, it would not have been built. A full year was spent analyzing and testing pieces of several applications prior to making the decision to go forward with the actual machine manufacture and construction. Los Alamos National Laboratory revamped several applications to the Roadrunner architecture. These applications represent a wide variety of workload types. Among the applications ported are []: ●
VPIC – full relativistic, charge-conserving, D explicit particle-in-cell code. ● SPaSM – Scalable Parallel Short-range Molecular Dynamics code, originally developed for the Thinking Machines Connection Machine model CM-. ● Milagro – Parallel, multidimensional, object-oriented code for thermal x-ray transport via implicit Monte Carlo methods on a variety of meshes. ● SweepD – Simplified -group D Cartesian discrete ordinates kernel representative of a neutron transport code.
The different applications employed multiple methods to exploit the capabilities of Roadrunner. Milagro used a host-centric accelerator offload model where the majority of the application was unchanged, (except for the control code which was modified due to the new hybrid structure of the triblade,) and selected computationally intensive portions were modified to move their most performance critical sections to the accelerator (Cell BE). VPIC employed an acceleratorcentric approach where the majority of the program logic was migrated to the accelerator and the host systems and the InfiniBand interconnect served primarily to provide accelerator to accelerator message communications. Early performance results on the applications showed speed-ups over an unaccelerated cluster in the range of – times depending on how much work could be moved to the Cell. Between KLOC (thousand lines of code) and KLOC needed to be modified to achieve these speed-ups. Some of the applications had obvious compute kernels that could be transferred to the Cell BE and those applications required fewer changes to the code than the cases where the compute parts were more distributed or where the control code itself had to be restructured, as was the case with Milagro. But, even in the case of Milagro, it was a small percentage of the code that was changed. Understanding of how to construct applications and lay out data structures to achieve reasonable sustained scaled performance will be important to future machine designs as well as provide input and examples for the translation of high-level language constructs into operational code. One example of a hardware feature in the base architecture of the QS is the ability for software to control the Local Store memory traffic [, ]. Since memory bandwidth will be an important commodity to optimize around in the future processor designs, understanding how best to do data pipelining from an application point of view is important base data to have. Hardware controlled flow of memory pre-fetching (cache) is a good feature for processors to have in general, but optimal efficiency will require explicit application tuning of bandwidth utilization. Automating the software control based on algorithms, hints, or language constructs is something that can be added later to take some of the burden off of the software developers but for now, understanding how best to use features like these is more important.
Roadrunner Project, Los Alamos
Energy As was stated earlier, energy consumption and cooling of IT facilities are an increasing critical problem. The business-as-usual technology projections in indicate that by large-scale systems will be more expensive to operate than to acquire []. Due to the increase in power requirements as ultra-scale computing systems grow in computing capability, it is often the case that new facilities need to be built to house the anticipated supercomputing machines. Another innovation in the Roadrunner project was minimizing the overall power required by using heterogeneous core types that optimize the power utilization, not by lowering the thread performance (frequency) but by eliminating unnecessary circuits in the SPEs themselves. As was discussed previously, this presents a different set of issues to the programmer. The Roadrunner system, using heterogeneous core types, uses fewer overall threads for the same performance than a homogeneous core machine using the same amount of power, but the core types need separate compiled code, and more explicit thought about where each piece of the algorithm should run. Heterogeneous Cores and Hybrid Architecture
Using heterogeneous core types on the PowerXCelli was a way of dealing with the computational density and power efficiency. Using a hybrid node architecture that included a combination of the QSs for computational efficiency and the LS for the general purpose computations and communications control was a way of overcoming the limitations of the PowerPC core on the Cell BE chip. One can easily imagine future supercomputers based on SoC designs that do not required the additional level of heterogeneity utilized in Roadrunner. The PowerXCelli has two core types, a general purpose embedded PowerPC core (the power processing element, PPE) which handles all of the operating system and hybrid communication tasks and eight specialized compute cores (the synergistic processing elements, SPEs). The SPEs take up significantly less chip area than the PowerPC core since they only need the facilities and instruction set to handle the intense computational tasks. This adds to the software complexity of the application algorithms because tasks must be assigned appropriately to the cores, and additional memory space management is required, but the results are well worth
R
it []. Software tools such as OpenMP and OpenCL are emerging which will help hide the heterogeneity from the application user, but the complexity will still exist in the system itself. The Roadrunner solution was not only the first to break the sustained petaflops Linpack barrier and therefore first on the Top list in June but it was third on the June Green as well. The first and second places on that Green list went to QS only based systems, so the energy efficiency of using the heterogeneous core approach is obvious. It is a trend that will continue into the future as the industry works to figure out the best utilization of the increasing number of circuits available on chips while also making the computational efficiency of the circuits at the application level improve.
Reliability, Availability, and Serviceability As machines grow in size and complexity, and applications scale to use increasing amounts of hardware for long run times, the reliability and availability characteristics become increasingly important. For an application to make forward progress, it needs to either checkpoint its progress often enough or have a mechanism for migrating processes running on failing parts of the system onto other working hardware. The Roadrunner system chose to use the checkpoint/restart approach since it seemed the most generally practical for a machine of its size and for the types of applications that it would be running. High availability (HA) through failover system design, at the node or even cluster level, was not a cost-effective approach for the technologies that are employed, and checkpointing could be tolerated. Of course, there are certainly parts of the system, like the memory DIMMs (dual in-line memory modules), that have error-correcting facilities as well as memory sparing capabilities, and these increase the inherent overall node availability. But making the nodes themselves redundant or failover capable outside of the checkpoint/restart process was not an objective. Three technology developments whose impact on service were very critical to the overall success, and in particular the ability to build the entire system in a short time, are IBM BladeCenter technology, the Roadrunner triblade mechanical design, and the integrated optical cable that was used for the IB subsystem [].
R
R
Router Architecture
The IBM BladeCenter was designed from the outset to maximize serviceability and ease of upgradability. The modular nature of the BladeCenter chassis with the ability to just slide new blades into an existing and pretested infrastructure that has the system management and power management built in, and I/O connections that can stay in place as the blades are slid in and out, provides a very robust base for building large-scale systems. This was proven in practice when the petaflop barrier was broken only four days after the triblades for the final CU were delivered to the lab from manufacturing. They were plugged in, verified to be good, integrated in with the other CUs, and then the entire system turned over to the applications group in less than h.
Conclusion Roadrunner has achieved each and every one of its original goals, and more. First and foremost, fully capable simulations from fluid mechanics to radiation transport, from molecular to cosmological scale, are running routinely and efficiently on this system. The development of these new capabilities provides a roadmap for developing exascale applications over the next decade. While there will always be challenges to getting to the next orders of magnitude in computing performance, the Roadrunner project collaboration between LANL and IBM was the first to recognize that a new paradigm was necessary for high-performance computing and to address the energy efficiency and computational performance of a new generation of these systems. There was concentrated effort on the software algorithm structures needed to implement heterogeneous and hybrid computing models at this scale. It has shown that Open Source components can be used to advantage when the technical challenges attract interested and motivated participants. The use of a modular building block approach to hardware design, including the BladeCenter management and control infrastructure, provided a very stable base for doing the software experimentation. With some of the breakthroughs that this program produced, the HPC community is now well positioned to solve, or help solve, the scaling issues that will produce major breakthroughs in business, science, energy, and social arenas.
The HPC community needs to be thinking “Ultra Scale” computing in order to harness the emerging technological advances and head into the exascale arena with the momentum that IBM and LANL built crossing the petaflops boundary.
Bibliography . Los Alamos National Laboratory, LANL Roadrunner. http:// www.lanl.gov/roadrunner/ . Voltaire Inc. Infiniband Products. http://www.voltaire.com/ Products/Infiniband . xCAT: Extreme Cloud Administration Tool .. http://xcat. sourceforge.net/ . IBM Corporation, IBM BladeCenter LS. http://www-.ibm. com/systems/bladecenter/hardware/servers/ls/index.html . Vogt J-S, Land R, Boettiger H, Krnjajic Z, Baier H () IBM BladeCenter QS: design, performance and utilization in hybrid computing systems. IBM J Res Dev () Paper :– . Gschwind M () Integrated execution: a programming model for accelerators. IBM J Res Dev () Paper :– . Kistler M, Gunnels J, Brokenshire D, Benton B () Programming the Linpack benchmark for Roadrunner. IBM J Res Dev () Paper :– . IBM Corporation, IBM BladeCenter. http://www-.ibm.com/ systems/bladecenter/ . EMCORE Corporation, Infiniband Optics Cable. http://www. emcore.com/fiber_optics/emcoreconnects . Moore G () Cramming more components onto integrated circuits. Electronics ():– . Los Alamos National Laboratory, LANL Applications. http://lanl. gov/roadrunner/rrseminars.shtml . Humphreys J, Scaramella J IDC The Impact of Power and Cooling on Data Center Infrastructure,’ Document #, May . http://www--ibm.com/systems/resources/systems_ z_pdf_IDC_ImpactofPowerandCooling.pdf . Green List. http://www.green.org/ . Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A () Design of ion-implanted MOSFETs with very small physical dimensions. IEEE J Solid State Circuits SC-():– . IBM Corporation, IBM Deep Computing. http://www-.ibm. com/systems/deepcomputing/
Router Architecture Switch Architecture
Routing (Including Deadlock Avoidance)
Router-Based Networks Networks, Direct Networks, Multistage
Routing (Including Deadlock Avoidance) Pedro López Universidad Politécnica de Valencia, Valencia, Spain
Synonyms Forwarding
Definition The routing method used in an interconnection network defines the path followed by packets or messages to communicate two nodes, the source or origin node, which generates the packet, and the destination node, which consumes the packet. This method is often referred to as the routing algorithm.
Discussion Introduction Processing nodes in a parallel computer communicate by means of an interconnection network. The interconnection network is composed of switches interconnected by point to point links. Topology is the representation of how these switches are connected. Direct or indirect regular topologies are often used. In direct topologies every network switch has an associated processing node. In indirect topologies, only some switches have processing nodes attached to them. The k-ary n-cube (which includes tori and meshes) and k-ary n-tree (an implementation of fat-trees) are the most frequently used direct and indirect topologies, respectively. In the absence of a direct connection among all network nodes, a routing algorithm is required to select the path packets must follow to communicate every pair of nodes.
Taxonomy of Routing Algorithms Routing algorithms can be classified according to several criteria. The first one considers the place where
R
decisions are made. In source routing, the full path is computed at the source node, before injecting the packet into the network. The path is stored as a list of switch output ports in the packet header, and intermediate switches are configured according to this information. On the contrary, in distributed routing, each switch computes the next link that will be used by the packet while the packet travels across the network. By repeating this process at each switch, the packet reaches its destination. The packet header only contains the destination node identifier. The main advantage of source routing is that switches are simpler. They only have to select the output port for a packet according to the information stored in its header. However, since the packer header itself must be transmitted through the network, it consumes network bandwidth. On the other hand, distributed routing has been used in most hardware routers for efficiency reasons, since packet headers are more compact. Also, it allows more flexibility, as it may dynamically change the path followed by packets according to network conditions or in case of faults. Routing algorithms can be also classified as deterministic or adaptive. In deterministic routing, the path followed by packets exclusively depends on their source and destination nodes. In other words, for a given source-destination pair, they always choose the same path. On the contrary, in adaptive routing, the path followed by packets also depends on network status, such as node or link occupancy or other network load information. Adaptive routing increases routing flexibility, which allows for a better traffic balance on the network, thus improving network performance. Deterministic routing algorithms are simpler to implement than adaptive ones. On the other hand, adaptive routing introduces the problem of out of order delivery of packets, as two consecutive packets sent from the same source to the same destination may follow different paths. An easy way of increasing the number of routing options is virtual channel multiplexing []. In this case, each physical link is decomposed into several virtual channels, each one with its own buffer, which can be separately reserved. Physical link is shared among all the virtual channels using time division multiplexing. Adaptive routing algorithms may provide all the feasible routing options
R
R
Routing (Including Deadlock Avoidance)
to forward the packet towards its destination (fully adaptive routing) or only a subset (partially adaptive routing). If adaptive routing is used, the selection function chooses one of the feasible routing options according to some criteria. For instance, in case of distributed adaptive routing, the selection function may select the link with the lowest buffer occupation or with the lowest number of busy virtual channels. A routing algorithm is minimal if the paths it provides are included in the set of shortest or minimal paths from source to destination. In other words, with minimal routing, every link crossed by a packet reduces its distance to destination. On the contrary, a nonminimal routing algorithm allows packets to use paths longer than the shortest ones. Although these routing algorithms are useful to circumvent congested or faulty areas, they use more resources than strictly needed, which may negatively impact on performance. Additionally, paths having an unbounded number of allowed non-minimal hops might result in packets never reaching their destination, situation that is referred to as livelock. Routing algorithms can be implemented in several ways. Some routers use routing tables with a number of entries equal to the number of destinations. With source routing, these tables are associated with each source node, and each entry contains the whole path towards the corresponding destination. With distributed routing, these tables (known as forwarding tables) are associated to each network switch, and contain the next link a packet destined to the corresponding entry must follow. With a single entry per destination, only deterministic routing is allowed. The main advantage of table-based routing is that any topology and any routing algorithm can be used in the network. However, routing based on tables suffers from a lack of scalability. The size of the table grows linearly with network size (O(N) storage requirements), and, most important, the time required to access the table also depends on network size. An alternative to tables is to place a specialized hardware at each node that implements a logic circuit and computes the output port to be used as a function of the current and destination nodes and the status of the output ports. The implementation is very efficient in terms of both area and speed, but the algorithm is specific to
the topology and to the routing strategy used on that topology. This is the approach used for the fixed regular topologies and routing algorithms used in large parallel computers. There are hybrid implementations, such as the Flexible Interval Routing (FIR) [] approach, which defines the routing algorithm by programming a set of registers associated to each output port. The strategy is able to implement the most commonly-used deterministic and adaptive routing algorithms in the most widely used regular topologies, with O(log(N)) storage requirements.
Deadlock Handling A deadlock in an interconnection network occurs when some packets cannot advance toward their destination because all of them are waiting on one another to release resources (buffers or links). In other words, the involved packets request and hold resources in a cyclic fashion. There are three strategies for deadlock handling: deadlock prevention, deadlock avoidance, and deadlock recovery. Deadlock prevention incurs in a significant overhead and is only used in circuit switching. It consists of reserving all the required resources before starting transmission. In deadlock avoidance, resources are requested as packets advance but routing is restricted in such a way that there are no cyclic dependencies between channels. Another approach consists of allowing the existence of cyclic dependencies between channels while providing some escape paths to avoid deadlock, therefore increasing routing flexibility. Deadlock recovery strategies allow the use of unrestricted fully adaptive routing, potentially outperforming deadlock avoidance techniques. However, these strategies require a deadlock detection mechanism [] and a deadlock recovery mechanism [, ] that is able to recover from deadlocks. If a deadlock is detected, the recovery mechanism is triggered, resolving the deadlock. Progressive recovery techniques deallocate buffer resources from other “normal” packets and reassign them to deadlocked packets for quick delivery. Disha [] and software-based recovery [] are examples of this approach. Regressive techniques deallocate resources from deadlocked packets by killing and later re-injecting them at the original source router (i.e., abort-and-retry).
Routing (Including Deadlock Avoidance)
In deadlock-avoidance, the routing algorithm restricts the paths allowed by packets to only those ones that keep the global network state deadlock-free. Dally [] proposed the necessary and sufficient condition for a deterministic routing algorithm to be deadlock-free. Based on this condition, he defined a channel dependency graph (CDG) and established a total order among channels. Routing is restricted to visit channels in order, to eliminate cycles in the CDG. The CDG associated to a routing algorithm on a given topology is a directed graph, where the vertices are the network links and the edges join those links with direct dependencies between them, i.e., those adjacent links that could be consecutively used by the routing algorithm. The routing algorithm is deadlock free if and only if the CDG is acyclic. To design deadlock-free routing algorithms based on this idea while keeping network connectivity, it may be required that physical channels are split into several virtual channels, and the ordering is performed among the set of virtual channels. Although this idea was initially proposed for deterministic routing, it has been also applied to adaptive routing. For instance, in the turn model [], a packet produces a turn when it changes direction, usually changing from one dimension to another in a mesh. Turns are combined into cycles. Prohibiting just enough turns to break all the cycles prevents deadlock. Although imposing an acyclic CDG avoids deadlock, it results in poor network performance, as routing flexibility is strongly constrained. Duato [] demonstrates that a routing algorithm can have cycles in the CDG while remaining deadlock-free. This allows the design of deadlock-free routing algorithms without significant constraints imposed on routing flexibility. The key idea that supports deadlock freedom despite the existence of cyclic dependencies is the concept of escape path. Packets can be routed without restrictions, provided that there exists a subset of network links – the escape paths – without cyclic dependencies, which allow every packet to escape from a potential cycle. An important characteristic of the escape paths is that it must be possible to deliver packets from any point in the network to their corresponding destination using the escape paths alone. However, this does not imply that a packet that used an escape path at some router must continue using escape paths until delivered. Indeed, packets are free to leave the escape paths at any point in
R
the network and continue using full routing flexibility until delivered. These ideas can be stated more formally as follows. A routing subfunction R associated to the routing function R supplies a subset of the links provided by R (the set of escape channels), therefore restricting its routing flexibility, while still being able to deliver packets from any valid location in the network to any destination. The concept of indirect dependency between two escape channels is defined as follows. A packet may reserve a set of adjacent channels ci , ci+ ,.., ck− , ck , where only ci and ck are provided by R, but ci+ ,.. ck− are not. The extended CDG for R has as vertices the escape channels, and as edges both the direct and indirect dependences between escape channels. The theory states that a routing function R is deadlock-free if there exists a routing subfunction R without cycles in its extended CDG. A simple way to design adaptive routing algorithms based on the theory is to depart from a well-known deterministic deadlockfree routing function, adding virtual channels in a regular way. The new additional channels can be used for fully adaptive routing. As mentioned above, when a packet uses an escape channel at a given node, it can freely use any of the available channels supplied by the routing function at the next node. Duato also developed necessary and sufficient conditions for deadlock-free adaptive routing under different switching techniques, and for different sets of input data for the routing function. All of those theoretical extensions are based on the principle described above.
R Routing in k-ary n-cubes A k-ary n-cube is a n-dimensional network, with k nodes in each dimension, connected in a ring-fashion. Every node is connected to n neighbors. The number of nodes is N = kn . This topology is often referred to as a n-dimensional torus, and it has been used in some of the largest parallel computers, such as the Cray TE or the IBM BlueGene (both of them use a D-torus). The ndimensional mesh is a particular case where the nodes of each dimension are connected as a linear array (i.e., there are not wraparound connections). The symmetry and regularity of the k-ary n-cube simplify network implementation and packet routing as the movement of a packet along a given network dimension does not
R
Routing (Including Deadlock Avoidance)
modify the number of remaining hops in any other dimension toward its destination. In a mesh, dimension-order routing (DOR) avoids deadlocks (and livelocks) by allowing only the use of minimal paths that cross the network dimensions in some total order. That is, the routing algorithm does not allow the use of links of a given dimension until no other links are needed by the packet in all of the preceding dimensions to reach its destination. In a D-mesh, this is easy to achieve by crossing dimensions in XY order. This simple routing algorithm is not deadlock-free in a torus. The wraparound links create a cycle among the nodes of each dimension. This is easily solved by multiplexing each physical channel into two virtual channels (H -high- and L -low-), and restricting routing in such a way that H channels can only be used if the coordinate of the destination node in the corresponding dimension is higher than the current one and vice versa. Consider a -node unidirectional ring (a ary -cube) with nodes ni , i = , , , and two channels connecting each pair of adjacent nodes. Let cHi and cLi , i = , , , be output channels of node ni . Figure shows the CDG corresponding to the routing algorithm presented above. As there are not cycles in the channel dependency graph, it is deadlock free. An adaptive routing algorithm for tori can be easily defined by merely adding a new virtual channel to each physical channel (the A -adaptive- channel). This new channel can be freely used, even to cross network dimensions in any order, thus providing fully adaptive routing on the k-ary n-cube. In this routing algorithm,
cH0 n0
n1 cL0 cL3
cH3
cL1
cH1
cL2 n3
n2 cH2
Routing (Including Deadlock Avoidance). Fig. Channel dependency graph for a -ary -cube
the H and L channels belong to the set provided by the routing subfunction. Figure shows the extended CDG for this routing algorithm in a -node unidirectional ring, which is acyclic, and thus the routing algorithm is deadlock-free. This routing algorithm will provide several routing options, depending on the relative locations of the node that currently stores the packet and the destination node. For instance, in a D-torus, if both X and Y channels remain to be crossed, two adaptive virtual channels in both dimensions and one escape channel (the H or L in the X dimension) will be provided. The selection function is in charge of making the final choice, selecting one of the links also considering network status. A simple selection function may be based on assigning static priorities to virtual channels, returning the first free channel according to this order. Priorities could be assigned to give more preference to adaptive channels. More sophisticated selection functions could be also used, such as selecting a channel of the least multiplexed physical channel to minimize the negative effects of virtual channel multiplexing.
Routing in k-ary n-trees Clusters of PCs are being considered as a cost-effective alternative to small and medium scale parallel computing systems. These machines use any of the commercial high-performance switch-based point-to-point interconnects. Multistage interconnection networks (MINs) are the most usual choice. In particular, fat-trees have risen in popularity in the past few years (for instance, Myrinet, InfiniBand, Quadrics). A fat-tree is based on a complete tree that gets thicker near the root. Processing nodes are located at the leaves. However, assuming constant link bandwidth, the number of ports of the switches increases as we approach the root, which makes the physical implementation unfeasible. For this reason, some alternative implementations have been proposed in order to use switches with fixed arity. In k-ary n-trees, bandwidth is increased as we approach the root by replicating switches. A k-ary n-tree (see Fig. ) has n stages, with k (arity) links connecting each switch to the previous or to the next stage. A k-ary n-tree is able to connect N = kn processing nodes using nkn− switches.
Routing (Including Deadlock Avoidance)
cA0
R
cH0 n1
n0
Direct dependency Indirect dependency
cL0 cA1 cH3
cL1
cL3
cH1
cA3 cL2 n3
n2 cA2
cH2
Routing (Including Deadlock Avoidance). Fig. Extended channel dependency graph for a -ary -cube
0
stage
0
000
0
1
0,00
2,4,6
001
2
4
8
4
2,00
0
1,00
0
3,5,7
1
6
2
4
2 010 3
1 0,01
0,4,6
5 1
1,5,7
2,01
1
1,01 7
3
011
9
5
5
4 100 5
2 0,10
6 0
0,2,6
101
2,10
2
1,10
4
1,3,7
10 2 6
6
6 110 7
3 0,11
0,2,4
7 5
1,3,5
11 3
3
7
111 Number of paths:7
1,11
1
6
2,11
5 4
Routing (Including Deadlock Avoidance). Fig. Load-balanced deterministic routing in a -ary n-tree
In a k-ary n-tree, minimal routing from a source to a destination can be accomplished by sending packets upwards to one of the nearest common ancestors of the source and destination nodes and then, from there, downwards to destination. When crossing stages in the upward direction, several paths are possible, thus providing adaptive routing. In fact, each switch can select
any of its up output ports. Once a nearest common ancestor has been reached, the packet is turned around and sent downwards to its destination and just a single path is available. A deterministic routing algorithm for k-ary n-trees can be obtained by reducing the multiple ascending paths in a fat-tree to a single one for each source-
R
R
Routing (Including Deadlock Avoidance)
destination pair. The path reduction should be done trying to balance network link utilization. That is, all the links of a given stage should be used by a similar number of paths. A simple idea is to shuffle, at each switch, consecutive destinations in the ascending phase []. In other words, consecutive destinations in a given switch are distributed among the different ascending links, reaching different switches in the next stage. Figure shows the destination node distribution in the ascending and descending links of a -ary -tree following this idea. In the figure, each ascending link has been labeled with the destinations it can forward to. As can be seen, packets destined to the same node reach the same switch at the last stage, independently of their source node. Each switch of the last stage receives packets addressed only to two destinations, and packets destined to each one are forwarded through a different descending link (as there are two descending links per switch). Therefore, this mechanism efficiently distributes the traffic destined to different nodes, and balances load across the links.
Routing in Irregular Networks. Agnostic Routing As stated above, direct topologies and multistage networks are often used when performance is the primary concern. However, in the presence of some switch or link failures, a regular network will become an irregular one. In fact, most of the commercially available interconnects for clusters support irregular topologies. An alternative for tolerating faults in regular networks without requiring additional logic is the use of generic or topology agnostic routing algorithms. These routing algorithms can be applied to any topology and provide a valid and deadlock free path for every sourcedestination pair of nodes in the network (if they are physically connected). Therefore, in the presence of any possible combination of faults, a topology agnostic routing algorithm provides a valid solution for routing. An extensive number of topology agnostic routing strategies have been proposed. They differ in their goals and approaches (e.g., obtaining minimal paths, fast computation time) and the resources they need (typically virtual channels). The oldest topology agnostic routing algorithm is up∗ /down∗ [], being the
most popular topology agnostic routing scheme used in cluster networks. Routing is based on an assignment of direction labels (up or down) to the links in the network by building a breadth first spanning tree (BFS). Cyclic channel dependencies are avoided by prohibiting messages to traverse a link in the up direction after having traversed one in the down direction. Although simple, this approach generates a network unbalance because a high percentage of traffic is forced to cross the root node. Derived from up∗ /down∗ , some improvements were proposed, such as assigning directions to links by computing a depth first spanning tree (DFS), the Flexible Routing scheme, which introduced unidirectional routing restrictions to break cycles in each direction at different positions, or the Segment-based Routing algorithm, which splits a topology into subnets, and subnets into segments. This allows placing bidirectional turn restrictions locally within a segment, which results in a larger degree of freedom. Most of these routing algorithms do not guarantee that all packets will be routed through a minimal path. This causes an increment in the packet latency and an inefficient use of network resources, affecting the overall network performance. The Minimal Adaptive routing algorithm, which is based on Duato’s theory, adds a new virtual channel which provides minimal routing, keeping the paths provided by up∗ /down∗ as escape paths. The In-Transit Buffers mechanism [] also provides minimal routing for all the packets. To avoid deadlocks, packets are ejected from the network, temporarily stored in the network interface card at some intermediate hosts and later re-injected into the network. However, this mechanism requires some support at NICs and at least one host to be attached to every switch in the network, which cannot always be guaranteed. Other approaches make use of virtual channels to route all the packets through minimal paths while still guaranteeing deadlock freedom. For instance, Layered Shortest Path (LASH) guarantees deadlock freedom by dividing the physical network into a set of virtual networks using separate virtual channels. Minimal paths between every sourcedestination pair of hosts are spread onto these layers, such that each layer becomes deadlock free. In Transition Oriented Routing (TOR), unlike the LASH routing, up∗ /down∗ is used as a baseline routing algorithm in order to decide when to change to a new virtual net
Routing (Including Deadlock Avoidance)
work (when a forbidden transition down→up appears). To remove cyclic channel dependencies, virtual networks are crossed in increasing order, thus guaranteeing deadlock freedom. However, due to the large number of routing restrictions imposed by up∗ /down∗ , providing minimal paths among every pair of hosts may require a high number of virtual channels.
R
table-based routing that could be used in commercial switches. The load-balanced deterministic routing algorithm described for fat-trees can be found in []. Up∗ /down∗ routing was initially proposed for Autonet []. The ITB mechanism [] was proposed to improve performance of source routing in the context of Myrinet networks.
Related Entries Clusters Deadlocks Flow Control Infiniband Interconnection Networks Myrinet Network of Workstations Networks, Direct Networks, Fault-Tolerant Networks, Multistage Switch Architecture Switching Techniques
Bibliographic Notes and Further Reading Dally’s [] and Duato’s [] books are excellent textbooks on Interconnection Networks and contain abundant material about routing. The sources of the theories about deadlock-free routing can be found in [] and []. Although virtual channels where introduced in [], they were formally proposed in [] as “virtual channel flow control.” The Turn-model, which allows adaptive routing in tori without virtual channels was proposed in []. The selection function has some impact on network performance. Several selection functions have been proposed. For instance, [] proposed a set of possible selection functions in the context of fat-trees. Deadlock-recovery strategies are based on the assumption that deadlocks are rare. This assumption was evaluated in [], were the frequency of deadlocks was measured. Disha and software-based recovery, two progressive deadlock-recovery mechanisms, were proposed in [] and [], respectively. To make deadlock-recovery feasible, an efficient deadlock detection mechanism that matches the low deadlock occurrence must be used, such as the FCD deadlock-detection mechanism proposed in []. FIR [] is an alternative implementation to algorithmic and
Bibliography . Anjan KV, Pinkston TM () DISHA: a deadlock recovery scheme for fully adaptive routing. In: Proceedings of the th international parallel processing symposium, Santa Barbara, CA, pp – . Dally WJ () Virtual-channel flow control. IEEE Trans Parallel Distributed Syst ():– . Duato J () A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans Parallel Distributed Syst ():– . Dally WJ, Seitz CL () Deadlock-free message routing in multiprocessor interconnection Networks. IEEE Trans Comput C-():– . Dally WJ, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco, CA . Duato J, Yalamanchili S, Ni LM () Interconnection networks: an engineering approach. Morgan Kaufmann, San Francisco, CA . Flich J, López P, Malumbres MP, Duato J () Boosting the performance of myrinet networks. IEEE Trans Parallel Distributed Syst ():– . Gómez C, Gilabert F, Gómez ME, López P, Duato J () Deterministic vs. adaptive routing in fat-trees. In: Proceedings of the international parallel and distributed processing symposium, IEEE Computer Society Press, Long Beach, CA . Gilabert F, Gómez ME, López P, Duato J () On the influence of the selection function on the performance of fat-trees. Lecture Notes in Computer Science , Springer, Berlin . Glass CJ, Ni LM () The turn model for adaptive routing. In: Proceedings of the th international symposium on computer architecture, ACM, New York, pp – . Gómez ME, López P, Duato J () FIR: an efficient routing strategy for tori and meshes. J Parallel Distributed ComputElsevier ():– . Martínez JM, López P, Duato J () A cost-effective approach to deadlock handling in wormhole networks. IEEE Trans Parallel Distributed Syst ():– . Martínez JM, López P, Duato J () FCD: flow control based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks. IEEE Trans Parallel Distributed Syst ():– . Schroeder MD et al () Autonet: a high-speed, selfconfiguring local area network using point-to-point links. Technical report SRC research report , DEC, April
R
R
R-Stream Compiler
. Warnakulasuriya S, Pinkston TM () Characterization of deadlocks in interconnection Networks. In: Proceedings of the th international parallel processing symposium, IEEE Computer Society, Washington, DC, pp –
R-Stream Compiler Benoit Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, Richard Lethin Reservoir Labs, Inc., New York, NY, USA
Definition R-Stream (R-Stream is a registered trademark of Reservoir Labs, Inc.) is a source-to-source, autoparallelizing compiler developed by Reservoir Labs, Inc. R-Stream compiles programs in the domains of highperformance scientific (HPC) and high-performance embedded computing (HPEC), where loop nests, dense matrices, and arrays are the common idioms. It generates mapped programs, i.e., optimized programs for parallel execution on the target architecture. R-Stream targets modern, heterogeneous, multi-core architectures, including multiprocessors with caches, systems with accelerators, and distributed memory architectures that require explicit memory management and data movement. The compiler accepts the C language as input. Depending on the target platform, the compiled output program can be in C with the appropriate target APIs or annotations for parallel execution, other data parallel languages such as CUDA for GPUs, or dataflow assembly for FPGA targets.
History DARPA funded R-Stream development in the Polymorphous Computer Architecture (PCA) research program, starting in . The PCA program developed new programmable computer architectures for approaching the energy efficiency of applicationspecific fixed function processors, particularly those for advanced signal processing in radars. These new PCA chips included RAW [], TRIPS [], Smart Memories [], and Monarch []; these chips innovated architecture features for energy efficiency, such
as manycores, heterogeneity, mixed execution models, and explicit memory communications management. DARPA wanted R-Stream for programming the architectural space spanned by these chips from a single high-level and target-independent source. Since the PCA architectures uniformly presented a very high ratio of on-chip computational rates to off-chip bandwidth, R-Stream research prioritized transformations that increased arithmetic intensity (the ratio of computation to communication) in concert with parallelization, and on creating mappings that executed the program in a streaming manner – hence the name of the compiler, R-Stream (the “R” stands for Reservoir). As the R-Stream compiler team refined the mapping algorithms, the conception of what the high-level, machine independent programming language should be also evolved. Streaming as an execution model had been well established for many decades, as the dominant execution paradigm for programming digital signal processors (DSPs): software-pipelining of computation with direct memory access (DMA). The PCA research program investigated whether streaming should be considered as the programming model. Initially, two streaming-based research programming languages, Stanford Brook [] and MIT StreamIt [], were supported in the PCA program. These languages provide means for expressing computations as aggregate operations (Brook calls them kernels; StreamIt calls them filters) on linear streams of data. However, as more understanding of the compilation process was acquired, the R-Stream compiler team started to move away from streaming as being the right expression form for algorithms for several reasons: . The expressiveness of the streaming languages was limited. They were biased toward one-dimensional streams, but radar signal processing involves multi dimensional data “cubes” processed along various dimensions []. . Transformations for compiling streaming languages are isomorphic to common loop transformations – e.g., parallelization, tiling, loop fusion, index set splitting, array expansion/contraction, and so forth. . While streaming notations were found to sometimes be a good shorthand for expression of a particular mapping of an algorithm to a particular architecture, that particular mapping would not run well
R-Stream Compiler
on another architecture without significant transformation. The notation and idioms of the streaming languages actually frustrate translation, because a compiler would have to undo them to extract semantics for re mapping. Consequently, the R-Stream team moved away from streaming as a programming paradigm to instead supporting input programs (mainly loops) written in the C language: C can more succinctly express the multi dimensional radar algorithms (compared to the streaming languages); the basic theory of loop optimization for C was well understood; and C can be written in a textbook manner with the semantics plain. However, the compilation challenges coming from the architectural features in the PCA chips were open problems. To attack these problems, the R-Stream team adopted and extended the powerful polyhedral framework for loop optimization. PCA features are now in commercial computing chips. Programming such chips is complex, requiring more than just parallelization. Thus, the R-Stream compiler is being used by programmers and researchers to map C to chips such as Cell, GPGPUs, Tilera, and many-core SMP. It is also being extended in research on compilation to ExaScale supercomputers. R-Stream is available commercially and also licensed in source form to research collaborators.
R
only a small dedicated runtime layer that is specialized to each architecture.
Using R-Stream Programmers write succinct loop nests in sequential ANSI C. The loop nest can be imperfect but must fit within an extended static control program form. That is, the loops have affine parametric extents and static integer strides. Multidimensional array references must be affine functions of loop index variables and parameters. Certain dynamic control features are allowed, such as conditionals that are affine functions of index variables and parameters. Programs that are not directly within the affine static control program form can be handled by wrapping the code in an abstraction layer and providing a compliant image function in C that models the effects of the wrapped function. The programmer indicates the region to be mapped with a simple single pragma. Again, R-Stream does not require the programmer to indicate how this mapping should proceed: not through any particular idiomatic correspondence of the loop nest to particular hardware features, nor through detailed specifications encoded in pragma. R-Stream automatically determines the mapping, based on the target machine and emits transformed code. For example, a loop nest expressed in ANSI C can be rendered into optimized CUDA [].
Architecture Design Principles R-Stream performs high-level automatic parallelization. The high-level mapping tasks that R-Stream performs include parallelism extraction, locality improvement, processor assignment, managing the data layout, and generating explicit data movements. R-Stream takes as input sequential programs written in C, with the kernels to be optimized marked with a simple pragma, but with all mapping decisions for the target automated. Low-level or machine-level optimizations, such as instruction selection, instruction scheduling, and register allocation, are left to an external low-level compiler that takes the mapped source produced by R-Stream. R-Stream currently performs mostly static mapping for all its target architectures. Resource and scheduling decisions are computed statically at compile time, with very little dynamic decisions done at runtime. For all architectures, the mapping is also “bare-metal,” with
R-Stream is built on a set of independent and reusable optimizations on two intermediate representations (IRs). Unlike many source-to-source transformation tools, R-Stream shuns ad hoc pattern-matchingbased transformations in favor of mathematically sound transformations, in particular dependence- and semantics-based transformations in the polyhedral model. The two intermediate representations are designed to support this view. The first intermediate representation, called Sprig, is used in the scalar optimizer. It is based on static single assignment (SSA) form [], similar to Click’s graph-based IR [], augmented with operators from value dependence graph [] for expressing state dependencies. The second intermediate representation is the generalized dependence graph (GDG), the representation used in the polyhedral mapper. Conversions from the SSA representation to the GDG representation are performed by the raising phase,
R
R
R-Stream Compiler
Forward in time
Inputs
Outputs
Polyhedral mapper
Machine model
MPA
Multiple mappings
OpenMP
Generalized dependence graph (Polyhedral representation) Tilera Raising
Serial C code
Parsing
Lowering Code generators
Static single assignment form Multiple optimizations
Multiple optimizations
Scalar optimizations
Scalar optimizations
STI cell
Clearspeed
CUDA
R-Stream Compiler. Fig. R-Stream’s architecture
and the converse by the lowering phase. These components are shown as different parts of the R-Stream compiler flow in Fig. .
Scalar Optimizer The main task of the scalar optimizer is to simplify and normalize the input source so that subsequent analysis of the program in the raising phase can proceed. The scalar optimizer also runs after mapping and lowering, to remove redundancies and simplify expressions that result from the polyhedral scanning phases. The scalar optimizer provides traditional optimizations.
code can be effectively optimized by the low-level compiler, R-Stream’s code generation back-end performs syntax recovery to convert a program in the scalar representation into idioms that look human-like. The engineering challenge is that R-Stream is designed to represent and transform programs in semantics-based IRs, far from the original syntax and language. Source reconstruction is more complex than simple unparsing. R-Stream uses an algorithm using early coalescing of ϕ-functions to exit SSA, extensions to Cifuentes [, ] to detect high-level statement forms, and then localized pattern matching to generate syntax. Sprig also represents types in terms of the input source language, to allow for faithful reproduction of those types at output.
Syntax Recovery The output of R-Stream has to be processed by low-level compilers corresponding to the target architectures. These low-level compilers often have trouble compiling automatically generated programs, even if the code is semantically legal, because they pattern-match optimizations to the code idioms that humans typically write. For example, many low-level compilers cannot perform loop optimizations on loops that are expressed using gotos and labels. Thus, to ensure that the output
Machine Model R-Stream supports Cell, Tilera, GPUs, ClearSpeed, symmetric multiprocessors, and FPGAs as targets. These are rendered and detailed in machine models (MM), expressed in an XML-based description language. The target description is an entity graph connecting physical components that implicitly encodes the system capabilities and execution models that can be exploited by the mapper.
R-Stream Compiler
The nodes in the entity graph are processors, memories, data links, and command links. Processor entities are computation engines, including scalar processors, SIMD processors, and multiprocessors. These are structured as a multi dimensional grid geometry. Memory entities are data-caches, instruction-caches, combined data and I-caches, and main and scratchpad memories. There are two types of edges in the graph. Data-link edges stand for explicit communication protocols that require software control, such as DMA. Command-link edges stand for instructions that one processor can send to another entity, such as thread spawning or DMA initiation. The final ingredient in an MM is the morph. A morph is a parallel computer, or a part of one, on which the mapping algorithms are focused. The term morph derives from the polymorphous aspect of the DARPA PCA program for efficient chips that could be reconfigured; a morph would describe one among many configurations of a polymorphous architecture. R-Stream also uses the term “morph” to describe the single configuration for a non-reconfigurable portion of a target. Each morph in a MM contains a host processor and a set of processing elements (PEs), plus its entity graph. The host processor and the PEs are allowed to be identical, e.g., in a SMP environment. Associated with the hosts and PEs is its topology and a list of submorphs. A machine model is subdivided into a set of morphs. This mechanism allows a mapping problem for a complex machine to be decomposed into a set of smaller problems, each concentrating on one morph of the machine. In hierarchical mapping, the set of morphs form a tree, with sub-morphs representing submapping problems after the high-level mapping problems have been completed.
The Generalized Dependence Graph The GDG is the IR used by the mapper. All phases of the mapper take a set of GDGs as an input and produce a set of modified GDGs as output. Thus, mapping proceeds strictly in this IR. Initially, the input to the mapper is a single GDG that represents the part of the program that the programmer has indicated should be mapped. Mapping phases annotate and rewrite this graph, essentially choosing a
R
projection of the GDG into the target machine space (across processors) and time (schedule), and then find a good description of that projection in terms of the target machine execution model. That is, the mapper finds a schedule of execution of computation, memory use, and communication. As the mapping process proceeds, the GDG may be split into multiple GDGs, for submapping problems to portions of a hierarchical or heterogeneous machine. The GDG (defined below) uses polyhedra to model the computations, dependences, and memory references of the part of the program being mapped. The polyhedral description provides compactness and precision, for a broad set of useful loop nest structures, and enables tractible analytical formulations of concepts like “dataflow analysis” and “coarse-grained parallelization.” A GDG is a multigraph, where the nodes represent statements (called polyops), and the edges represent the dependences between the statements. The basic information attached to each polyop node includes: ● ●
Its iteration domain, represented as a polyhedral set A list of array references, each represented as an affine function from the iteration space, to the data space of the corresponding array ● A predicate function indicating under what conditions the statement should be executed ● Data-dependent conditionals are supported through the predicates. Attached to each edge of the GDG is its dependence polyhedron, a polyhedral set that encodes the dependence between the iterations of the edge’s nodes. The polyhedral sets used in the GDG are unions of intersections of integer lattices with parametric polyhedra, called Z-Domains. Z-Domains are the essence of the mathematical language of polyhedral compilation. The use of unions allows for describing a broad range of “shapes” of programs, the integer lattices provide precision through complex mappings. Using parametric polyhedra also provides a richness of description; it enables parametrically specified families of programs to be optimized at once and it enables describing loop nests and array references that depend on iteration variables from enclosing loop nests. R-Stream includes a ground-up implementation of a library for manipulating Z-Domains that provides performance
R
R
R-Stream Compiler
and functionality that we could not obtain had we used Loechner’s Polylib [].
Mapping Flow The basic mapping flow in R-Stream is through the following mapper phases: ● ● ● ●
●
●
Perform raising to convert mappable regions into the polyhedral form. Perform dependence analysis and/or dependence removal optimizations such as array expansion. Perform affine scheduling to extract parallelism. Perform task formation (a generalization of tiling) and processor placement. Also perform thread partitioning to split the mapped region among the host processor and the co-processing elements. Perform memory promotion to improve locality. If necessary, also perform data layout transformations and insert communications. Perform lowering to convert the mapped program back into the scalar IR, Sprig.
These phases are mixed and matched and can be recursively called for different target architectures. For example, for SMP targets, it is not necessary to promote memory and insert explicit communications, since such machines have caches. (However, in some cases-explicit copies also help on cache-based machines by compacting the footprint of the data accessed by a task; we call such efforts “virtual scratchpad.”) For a target machine with a host processor and multiple GPUs, R-Stream will first map coarsely across GPUs (inserting thread controls and explicit communictions) and then recursively perform mapping for parts of the code that will be executed within the GPUs.
Raising The raising phase is responsible for converting loop nests inside mappable regions into the polyhedral form. R-Stream’s raising process is as follows: . Mappable regions are identified by the programmer via pragmas. Mappable regions can also be expanded by user-directed inlining pragmas. . Perform loop detection and if/then/else region detection. . Perform if-conversion [] to convert data dependent predicates into predicated statements.
. Perform induction variable detection using the algorithm of Pop []. The algorithm has been extended to detect inductions based on address arithmetic. Affine loop bounds and affine address expressions can be converted into the iteration domains and access functions in the GDG. . Within each basic block, partition the operators into maximal subsets of operators connected via value flow such that the partitions can be sequentially ordered. These maximal subsets are made into individual polyops. . Finally, convert the loop iteration domains and access functions into constraints form in the GDG.
Dependence Analysis and Array Expansion R-Stream provides a more advanced covering dependence analysis from Vasilache [] to remove transitively implied dependencies, which improves upon the algorithms from Feautrier []. R-Stream provides a novel corrective array expansion based on Vasilache’s violated dependence analysis [, ]. Corrective array expansion essentially fuses array expansion phases into scheduling phases. A “violated” dependence is a relationship between a source and a target instruction of the program that exhibit a memory-based dependence in the original program that is not fulfilled under a chosen schedule. A violation arises when the target statement is scheduled before the source statement in the transformed program. Such a violation can be corrected by either recomputing a new schedule or by changing the memory locations read and written. This allows for better schedules with as small a memory footprint as possible by performing the following steps: . Provide a subset of the original dependencies in the program to the affine scheduler phase so that it determines an aggressive schedule with maximum parallelism. . When the schedule turns out to be incorrect with regards to the whole set of dependencies, perform a correction of the resulting schedule by means of renaming and array expansion. . Perform lazy corrections targeting only the very causes of the semantics violations in the aggressively scheduled program.
R-Stream Compiler
Unlike previous array expansion algorithms in the literature by Feautrier [], Barthou et al. [], and Offner et al. [], which tend to expand all arrays fully and render the program in dynamic single assignment form, the above algorithm only expands arrays by need – i.e., only if the parallelism exposed by the scheduling algorithm requires the array to be expanded.
Affine Scheduling Under the exact affine representation of dependences in the GDG, it was known that useful scheduling properties of programs can be optimized such as maximal finegrained parallelism using Feautrier’s algorithm [], maximal coarse-grained parallelism using Lim and Lam’s algorithm [], or maximal parallelism given a (maximal) fusion/distribution structure using Bondhugula’s et al.’s algorithm []. R-Stream improved on these algorithms by introducing the concept of affine fusion and this enables a single seamless formulation of the joint optimization of cost functions representing trade-offs between amount of parallelism and amount of locality at various depths in the loop nest hierarchy. Scheduling algorithms in R-Stream search an optimization space that is either constructed on a depth-bydepth basis as solutions are found, or based on the convex space of all legal multi dimensional schedules as had been illustrated by Vasilache []. R-Stream allows direct optimization of the tradeoff function using Integer Linear Programming (ILP) solvers as well as iterative exploration of the search space. Redundant solutions in the search space are implicitly pruned out by the combined tradeoff function as they exhibit the same overall cost. In practice, a solution to the optimization problem represents a whole class of equivalent scheduling functions with the same cost. Building upon the multi dimensional affine fusion formulation, R-Stream also introduced the concept of joint optimization of parallelism, locality, and amount of contiguous memory accesses. This additional metric is targeted at memory hierarchies where accessing a contiguous set of memory references is crucial to obtain high performance. Such hardware features include hardware and software prefetchers, and explicit vector load and store into and from registers, and coalescing hardware in GPUs. R-Stream defines the cost of contiguous memory accesses up to an affine data layout transformation.
R
To support hierarchical/heterogeneous architectures, the scheduling algorithm can be used hierarchically. Different levels of the hardware hierarchy are optimized using very different cost functions. At the outermost levels of the hierarchy, weight is put on obtaining coarse-grained parallelism with minimal communications that keep as many compute cores busy. At the innermost levels, after communications have been introduced, focus is set to fine-grained parallelism, contiguity and alignment to enable hardware features such as vectorization. Tiling also separates mapping concerns between different levels of the hierarchy.
Task Formation and Placement The streaming execution model that R-Stream targets is as follows: ●
A bulk set of input data Di is loaded into a memory M that is close to processing element P. ● A set of operations T, called “task,” works on Di . ● A live-out data set Do produced by T is unloaded from M to make room for the next task’s input data. The role of task formation is to partition the program’s operations into such tasks in a way that minimizes the program execution time. Since computing is much faster than transferring data, one of the goals of task formation is to obtain a high computation-percommunication ratio for the tasks. The memory M may be a local or scratchpad memory, which holds a fixed number of bytes. This imposes a capacity constraint: the working data set of a task should not overflow M’s capacity. Finally, the considered processing element P could be composite, in the sense that it may be itself formed of several processing elements. Task formation also ensures that enough parallelism is provided to such processing elements. R-Stream’s task formation algorithm improves on prior algorithms for tiling by Ancourt et al. [], Xue [], and Ahmed et al. []. The algorithm is enabled by using Ehrhart polynomial techniques developed by Clauss et al. [], Verdoolage et al. [], and Meister et al. []. The Ehrhart polynomials are used to quickly count the volume (number of integer points) in Z-Domains that model the data footprint implied by prospective iteration tilings. The R-Stream task formation algorithm directly tiles imperfect loop nests independently. The prior work requires perfect loop nests or
R
R
R-Stream Compiler
expanded imperfect loop nests into a perfect form that forced all sections of code, even vastly unrelated ones into the same tiling structure. The R-Stream task formation algorithm uses evaluators and manipulators, which can be associated with task groups, loops or the whole function. Evaluators are a function of the elements of the entity they are associated with. They can be turned into a constraint (implicitly: evaluator must be nonnegative) or an objective function. Manipulators enforce constraints on the entity they are associated with. For instance, the tile size of a loop can be made to be a multiple of a certain size, or a power of two, and a set of loops can be set to have the same tile size. Evaluators and manipulators can be combined (affine combination, sequence, conditional sequence) to produce a custom tiling problem. The task formation algorithm addresses whether the task can be executed in parallel or sequentially; it uses heuristics that walk the set of operations of the GDG and groups the operations based on fusion decisions made by the scheduler, data locality considerations, and the possibility of executing the operations on the same processor. The task formation algorithm also can detect a class of reductions and mark such loops. It then uses associativity of the scanned operations to relax polyhedral dependence constraints when computing the loop’s tiling constraints.
Memory Promotion R-Stream performs memory promotion for architectures with fast but limited scratch-pad memories on distributed memory architectures. When data are migrated from one memory to another, the compiler also performs a reorganization of the data layout to improve storage utilization, or locality of reference, or to enable other optimizations. Such data layout reorganization often comes “for free,” especially if it can be overlapped with computation via hardware support, such as DMA. This technique is a generalization of Schreiber et al. [].
Communication Generation and DMA Optimizations When the data set is promoted from a source memory to a target memory, R-Stream generates data transfer from the source array to the target array and back. Those transfers are represented as a element-wise copy loop
nest with additional information that identifies which innermost loops transfer the data required for one task. When there is explicit bulk communication hardware between the source and target memories, R-Stream concatenates element-wise transfer operations into strided transfer (e.g., DMA) commands.
Target-Specific Optimizations: GPU While R-Stream can target many different architectures, the GPU target is an illustrative case. GPUs are manycore architectures that have multiple levels of parallelism with outer coarse-grained MIMD parallelism and inner finer-grained SIMT (Single Instruction Multiple Threads) parallelism. Furthermore, they have complex memory hierarchy with memories of varied latencies and varied characteristics. To achieve high performance on such architectures, it is very important to exploit all levels of parallelism and utilize appropriate memories in an efficient manner. There are three important aspects to be addressed by an automatically parallelizing high-level compiler for GPUs, namely, ●
To identify and utilize outer coarse-grained and inner fine-grained parallelism ● To make GPU DRAM (“global memory”) accesses to be contiguous and aligned in the parallel loop ● To make proper utilization of the on-chip memories, namely “shared memory” and “registers” R-Stream exploits the affine scheduling algorithm mentioned earlier, to have, whenever possible, coarsegrained parallelism at outer level and fine-grained parallelism at inner level with contiguously aligned memory accesses. The affine scheduler tries to find a right mix of parallelism, data locality, and data contiguity. As a result, not all global memory array accesses can be made contiguous by the schedule. In such cases, and also in cases where global memory accesses are used multiple times, R-Stream generates data transfers from global DRAM to shared memory (which resides on chip and is shared by all threads in a multiprocessor) as a element-wise copy loop nest such that the global memory accesses are contiguous in the transfer. The global memory accesses are then appropriately replaced by the shared memory accesses. As a result, R-Stream achieves data contiguity in global memory and also improves data reuse across threads.
R-Stream Compiler
R
#pragma rstream map void gaussseidel2D_9points(real_t (*A)[N], int pT, int pN) { int i, j, t; for (t=0; t
R-Stream Compiler. Fig. R-Stream’s input: Gauss-Seidel nine-point stencil
#pragma unroll for (i1 = (__maxs_32(__maxs_32(-480 * j + -64 * k + -16 * (int)b1ockIdx.x + (int)threadIdx.y + 7 >> 3, - (int)threadIdx.y + 7 >> 3), __maxs_32(-64 * i + -1440 * j + -32 * k + -48 * (int)b1ockIdx.x + - (int)threadIdx.y + 31 + 7 >> 3, 480 * j + 64 * k + 16 * (int)b1ockIdx.x + - (int)threadIdx.y + -2076 + 7 >> 3))); i1 <= _t3; i1++) { int _t6; int j1; _t6 = (__mins_32(__mins_32(480 * j + 64 * k + -8 * i1 + 16 * (int)b1ockIdx.x + - (int)threadIdx.x + - (int)threadIdx.y + 66 >> 4, -64 * i + -960 * j + 32 * k + -32 * (int)b1ockIdx.x + - (int)threadIdx.x + 2078 >> 4), __mins_32((int)threadIdx.x + 2047 >> 4, __mins_32(480 * j + 64 * k + 8 * i1 + 16 * (int)b1ockIdx.x + - (int)threadIdx.x + (int)threadIdx.y + 2 >> 4, 480 * j + 64 * k + 16 * (int)b1ockIdx.x + - (int)threadIdx.x + 33 >> 4)))); #pragma unroll for (j1 = (__maxs_32(__maxs_32(480 * j + 64 * k + -8 * i1 + 16 * (int)b1ockIdx.x + - (int)threadIdx.x + - (int)threadIdx.y + -31 + 15 >> 4, __maxs_32(-64 * i + -960 * j + 32 * k + -32 * (int)b1ockIdx.x + - (int)threadIdx.x + -30 + 15 >> 4, 64 * i + 1920 * j + 96 * k + 64 * (int)b1ockIdx.x + (int)threadIdx.x + -2107 + 15 >> 4)), __maxs_32(- (int)threadIdx.x + 15 >> 4, __maxs_32(480 * j + 64 * k + 8 * i1 + 16 * (int)b1ockIdx.x + (int)threadIdx.x + (int)threadIdx.y + -63 + 15 >> 4, 480 * j + 64 * k + 16 * (int)blockIdx.x + - (int)threadIdx.x + -46 + 15 >> 4)))): j1 <= _t6; j1++) { A_l_l[8 * i1 + (int)threadIdx.y][46 + (-480 * j + -64 * k) + (16 * j1 + (-16 * (int)b1ockIdx.x + (int)threadIdx.x))] = A_l[-31 + (64 * i + 1440 * j) + (32 * k + 8 * i1 + (48 * (int)b1ockIdx.x + (int)threadIdx.y))][16 * j1 + (int)threadIdx.x]; } } syncthreads();
R-Stream Compiler. Fig. Excerpt of R-Stream’s CUDA output: Gauss-Seidel nine-point stencil
R
R
R-Stream Compiler
Example
Bibliography
Figure shows input code to R-Stream. The input code is ANSI C, for a simple nine point Gauss Seidel stencil operated on a D array. The input style that can be raised is succinct and the user does not give directives describing the mapping, only the indication that the compiler should map the function. The compiler determines the mapping structure automatically. The total output code for this important kernel must be hundreds of lines long to achieve performance in CUDA and on GPUs; a small excerpt of the output from R-Stream in CUDA from this example is in Fig. . One can see that the compiler has formed parallelism that is implicit in the CUDA thread and block structure. The chosen schedule balances parallelism, locality, and in particular contiguity, to enable coalescing of loads. Also illustrated is the creation of local copies of the array, the generation of explicit copies into the CUDA scratchpad “shared memories.”
. nVidia CUDA Compute Unified Device Architecture Programming Guide (Version .), June . Ahmed N, Mateev N, Pingali K () Tiling imperfectly-nested loop nests. In: Supercomputing ’: proceedings of the ACM/IEEE conference on supercomputing (CDROM), IEEE Computer Society, Washington, DC, pp – . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion of control dependence to data dependence. In: Proceedings of the th ACM SIGACT-SIGPLAN symposium on principles of programming languages, New York, pp – . Ancourt C, Irigoin F () Scanning polyhedra with DO loops. In: Proceedings of the rd ACM SIGPLAN symposium on principles and practice of parallel programming, Williamsburg, VA, pp –, Apr . Barthou D, Cohen A, Collard JF () Maximal static expansion. In: Proceedings of the th ACM SIGPLAN-SIGACT symposium on principles of programming languages, New York, pp – . Bastoul C () Efficient code generation for automatic parallelization and optimization. In: Proceedings of the international symposium on parallel and distributed computing, Ljubjana, pp –, Oct . Bastoul C, Vasilache N, Leung A, Meister B, Wohlford D, Lethin R () Extended static control programs as a programming model for accelerators: a case study: targetting Clearspeed CSX with the R-Stream compiler. In: First workshop on Programming Models for Emerging Architectures (PMEA) . Bondhugula U, Hartono A, Ramanujan J, Sadayappan P () A practical automatic polyhedral parallelizer and locality optimizer. In: ACM SIGPLAN Programming Languages Design and Implementation (PLDI ’), Tucson, Arizona, June . Buck I () Brook v. specification. Technical report, Stanford University, Oct . Cifuentes C () A structuring algorithm for decompilation. In: Proceedings of the XIX Conferencia Latinoamericana de Informatica, Buenos Aires, Argentina, pp – . Cifuentes C () Structuring decompiled graphs. Technical Report FIT-TR- –, Department of Computer Science, University of Tasmania, Australia, . Also in Proceedings of the th international conference on compiler construction, , pp – . Clauss P, Loechner V () Parametric analysis of polyhedral iteration spaces. In: IEEE international conference on application specific array processors, ASAP’. IEEE Computer Society, Los Alamitos, Calif, Aug . Click C, Paleczny M () A simple graph-based intermediate representation. ACM SIGPLAN Notices, San Francisco, CA . Cytron R, Ferrante J, Rosen BK, Zadeck FK () Efficiently computing static single assignment form and the control dependence graph. ACM Trans Program Lang Syst ():– . Darte A, Schreiber R, Villard G () Lattice-based memory allocation. IEEE Trans Comput ():– . Feautrier P () Array expansion. In: Proceedings of the nd international conference on supercomputing, St. Malo, France
Related Entries Parallelization, Automatic
Bibliographic Notes and Further Reading The final report on R-Stream research for DARPA is Lethin et al. []. More information on R-Stream is available from various subsequent workshop papers by Leung et al. [], Meister et al. [], and Bastoul et al. []. Several US and International patent applications by this encyclopedia entry’s authors are pending and will publish with even more detail on the optimizations in R-Stream. Feautrier is credited with defining the polyhedral model in the late s and early s [, , ]. Darte et al. provided a detailed comparison of polyhedral techniques to classical techniques, in terms of the benefits of the exact dependence relations, in []. Much theoretical work accumulated on the polyhedral model in the decades after its invention, but it took the work of Quillere, Rajopadhye, and Wilde to unlock the technique for application by showing effective code generation algorithms []; these were later improved by Bastoul [] and Vasilache [] (the latter thesis also providing a good bibliography on the polyhedral model).
Run Time Parallelization
. Feautrier P () Parametric integer programming. RAIRORecherche Opérationnelle, ():– . Feautrier P () Dataflow analysis of array and scalar references. Int J Parallel Prog ():– . Feautrier P () Some efficient solutions to the affine scheduling problem. Part I. One-dimensional time. Int J Parallel Prog ():– . Feautrier P () Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int J Parallel Prog ():– . StreamIt Group () Streamit language specification, version .. Technical report, Massachusetts Institue of Technology, Oct . Lethin R, Leung A, Meister B, Szilagyi P, Vasilache N, Wohlford D () Final report on the R-Stream . compiler DARPA/AFRL Contract # F--C-, DTIC AFRL-RI-RS-TR--. Technical report, Reservoir Labs, Inc., May . Leung A, Meister B, Vasilache N, Baskaran M, Wohlford D, Bastoul C, Lethin R () A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In: Third Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-, Mar . Lim AW, Lam MS () Maximizing parallelism and minimizing synchronization with affine transforms. In: Proceedings of the th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, Paris, France, pp – . Loechner V () Polylib: a library for manipulating parametrized polyhedra. Technical report, University of Louis Pasteur, Strasbourg, France, Mar . Mai K, Paaske T, Jayasena N, Ho R, Dally W, Horowitz M () Smart memories: a modular reconfigurable architecture. In: Proceedings of the international symposium on Comuter architecture, pp –, June . Meister B, Leung A, Vasilache N, Wohlford D, Bastoul C, Lethin R () Productivity via automatic code generation for PGAS platforms with the R-Stream compiler. In: Workshop on asynchrony in the PGAS programming model, June . Meister B, Verdoolaege S () Polynomial approximations in the polytope model: bringing the power of quasi-polynomials to the masses. In: ODES-: th workshop on optimizations for DSP and embedded systems, Apr . Offner C, Knobe K () Weak dynamic single assignment form. Technical Report HPL--, HP Labs . Pop S, Cohen A, Silber G () Induction variable analysis with delayed abstractions. In: Proceedings of the international conference on high performance embedded architectures and compilers, Barcelona, Spain . Quilleré F, Rajopadhye S, Wilde D () Generation of efficient nested loops from polyhedra. Int J Parallel Prog (): – . Rettberg RD, Crowther WR, Carvey PP, Tomlinson RS () The Monarch parallel processor hardware design. Computer :– . Richards MA () Fundamentals of radar signal processing. McGraw-Hill, New York
R
. Sankaralingam K, Nagarajan R, Gratz P, Desikan R, Gulati D, Hanson H, Kim C, Liu H, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Yoder W, McDonald R, Keckler SW, Burger DC () The distributed microarchitecture of the TRIPS prototype processor. In: th international symposium on microarchitecture (MICRO), Los Alamitos, Calif, Dec . Schreiber R, Cronquist DC () Near-optimal allocation of local memory arrays. Technical Report HPL--, HewlettPackard Laboratories, Feb . Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffmann H, Johnson P, Lee JW, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A () The raw microprocessor: a computational fabric for software circuits and general purpose programs. Micro, Mar . Vasilache N, Bastoul C, Cohen A, Girbal S () Violated dependence analysis. In: Proceedings of the th international conference on supercomputing (ICS’), Cairns, Queensland, Australia. ACM, New York, NY, USA, pp – . Vasilache N, Cohen A, Pouchet LN () Automatic correction of loop transformations. In: th international conference on parallel architecture and compilation techniques (PACT’), IEEE Computer Society Press, Brasov, Romania, pp –, Sept . Vasilache NT () Scalable program optimization techniques in the polyhedral model. PhD thesis, Université Paris Sud XI, Orsay, Sept . Verdoolaege S, Seghir R, Beyls K, Loechner V, Bruynooghe M () Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations. In: Proceedings of the international conference on compilers, architecture, and synthesis for embedded systems, ACM Press, New York, pp – . Weise D, Crew R, Ernst M, Steensgaard B () Value dependence graph: representation without taxation. In: ACM symposium on principles of programming languages, New York, pp – . Xue J () On tiling as a loop transformation. Parallel Process Lett ():–
Run Time Parallelization Joel H. Saltz , Raja Das Emory University, Atlanta, GA, USA IBM Corporation, Armonk, NY, USA
Synonyms Parallelization Irregular array accesses arise in many scientific applications including sparse matrix solvers, unstructured mesh partial differential equation (PDE) solvers,
R
R
Run Time Parallelization
and particle methods. Traditional compilation techniques require that indices to data arrays be symbolically analyzable at compile time. A common characteristic of irregular applications is the use of indirect indexing to represent relationships among array elements. This means that data arrays are indexed through values of other arrays, called indirection arrays. Figure depicts a simple example of a loop with indirection arrays. The use of indirection arrays prevents compilers from identifying array data access patterns. Inability to characterize array access patterns symbolically can prevent compilers from generating efficient code for irregular applications. The inspector/executor strategy involves using compilers to generate code to examine and analyze data references during program execution. The results of this execution-time analysis may be used () to determine which off-processor data needs to be fetched and where the data will be stored once it is received and () to reorder and coordinate execution of loop iterations in problems with irregular loop carried dependencies, as seen in the example in Fig. . The initial examination and analysis phase is called the inspector, while the phase that uses results of analysis to optimize program execution is called the executor. Irregular applications can be divided into two subclasses: static and adaptive. Static irregular applications are those in which each object in the system interacts with a predetermined fixed set of objects. The indirection arrays, which capture the object interactions, do not change during the course of the computation (e.g., unstructured mesh PDE solver). In adaptive irregular applications, each object interacts with an evolving list of objects (e.g., molecular dynamics codes). This causes for i = 1 to n do x(ia(i)) + = y(ib(i)) end do Run Time Parallelization. Fig. Example of a loop involving indirection arrays for i = 1 to n do y(i) + = y(ia(i)) + y(ib(i)) + y(ic(i)) end do
Run Time Parallelization. Fig. Example of loop-carried dependencies determined by a subscript array
the indirection array, which is the object interaction list, to slowly change over the life of the computation. Figure shows an example of an adaptive code. In this figure, ia, ib, and ic are the indirection arrays. Arrays x and y are the data arrays containing properties associated with the objects of the system. The inspector/executor strategy can be used to parallelize static irregular applications targeted for distributed parallel machines and can achieve high performance. Additional optimizations are required to achieve high performance when adaptive irregular applications are parallelized using inspector/executor approach. The example adaptive code in Fig. shows that the indirection array ic does not change its value for every iteration of the outer loop. It typically has to be regenerated every few iterations of the outer loop, providing the opportunity for optimization. Inspector/executor strategies have also been adopted to optimize I/O performance when applications need to carry out large numbers of small non-contiguous I/O requests. The inspector/executor strategy was initially articulated in the late s and early s, as described in Mirchandaney [], Saltz [], Koelbel [], Krothapalli [], and Walker []. This article addresses use of inspector/executor methods to optimize retrieval and management of off-processor data and in parallelization of loops with loop-carried dependencies created by indirection arrays. Use of inspector/executor methods and related compiler frameworks to optimize I/O has also been described.
Inspector/Executor Methods and Distributed Memory The inspector/executor strategy works as follows: During program execution, the inspector examines data references made by a processor and determines () which off-processor data elements need to be obtained and () where the data will be stored once it is received. The executor uses the information from the inspector to gather off-processor data, carry out the actual computations, and then scatter results back to their home processors after the computational phase is completed. A central strategy of inspector/executor optimization is to identify and exploit situations that permit reuse of information obtained by inspectors. As discussed below, it is often possible to relocate preprocessing outside a set of loops or to carry out interprocedural analysis and
Run Time Parallelization
for t = 1, step for i = 1, n x(ia(i)) = x(ia(i)) + y(ib(i)) endfor if (required) then regenerate ic(:) endif for i = 1, n x(ic(i)) = x(ic(i)) + y(ic(i)) endfor endfor
R
//outer step loop //inner loop //end inner loop
//inner loop //end inner loop //end outer step loop
Run Time Parallelization. Fig. Example loop from adaptive irregular applications
for i = 1, n do x(ia(ib(i))) = .... end do
for i = 1, n do if (ic(i)) then x(ia(i)) = ... end if end do
for I =1, n do do j = ia(i), ia(i + 1) x(ia(j)) = ... end do end do
Run Time Parallelization. Fig. Examples of loops with complex access functions
transformation so that results from an inspector can be reused many times. A variety of program analysis methods and transformations have been developed to generate programs that can make use of inspector/executor strategies. In loops with simple irregular access patterns such as those in Fig. , a single inspector/executor pair can be generated by the straightforward method described in []. Many application codes contain access patterns with more complex access functions such as those depicted in Fig. . Subscripted subscripts and subscripted guards can make indexing of one distributed array dependent on values in another. To handle this kind of situation, an inspector must itself be split into an inspector-executor pair as described in []. A dataflow framework was developed to help place executor communication calls, determine when it is safe to combine communications statements, move them into less frequently executed code regions, or avoid them altogether in favor of reusing data that is already buffered locally.
Communication Schedules A communication schedule is used to fetch offprocessor elements into a local buffer before the
computation phase and to scatter computed data back to home processors. Communication schedules determine the volume of communication and number of communication startups. A variety of approaches to schedule generation have been taken. In the CHAOS system [], a hash table data structure is used to support a twophase schedule generation process. The index analysis phase examines data access patterns to determine which references are off-processor, remove duplicate off-processor references, assign local buffers for offprocessor references, and translate global indices to local indices. The results of index analyses are managed in a hash table data structure. The schedule generation phase then produces communication schedules based on hash table information. The CHAOS hash table contains a bit array used to identify which indirection arrays entered each element into the hash table. This approach facilitates efficient management and coordination of analysis information derived from programs with multiple indirection arrays and provides support for building incremental and merged communication schedules. In some cases it can be advantageous to employ communication schedules that carry out bounding box
R
R
Run Time Parallelization
or entire-array bulk read or write operations rather than using schedules that only communicate accessed array elements. These approaches differ in costs associated with both the inspector and the executor phases. The bounding box approach requires only that the inspector compute a bounding box containing all the necessary elements and no inspector is required to retrieve the entire array. A compiler and runtime system using performance models to choose between elementwise gather, bounding box bulk read, and entire-array retrieval were implemented by the Titanium group discussed in []. A variety of techniques have been incorporated into CHAOS and into efficiently supported applications such as particle codes, which manifest access patterns that change from iteration to iteration. The crucial observation is that in many such codes, the communicationintensive loops are actually carrying out a generalized reduction in which it is not necessary to control the order of elements stored. Hwang [] demonstrates that order-independence can be used to dramatically reduce schedule generation overhead.
Runtime Parallelization A variety of runtime parallelization techniques have been developed for indirectly indexed loops with data dependencies among iterations, as shown in Figs. and . Most of this work has targeted shared memory architectures due to the fine grained concurrency that typically results when such codes are parallelized. One of the earliest techniques for runtime parallelization of loops involves the use of a key field associated with each indirectly indexed array element to order accesses. The algorithm repeatedly sweeps over all loop iterations in alternating analysis and computation phases. An iteration is allowed to proceed only if all accesses to array elements y(ia(i)) and y(ib(i)) by iterations j < i have completed. These sweeps have the effect of partitioning the computation into wavefronts. Key fields are used to ensure that dependences are taken into account. This is discussed further in []. This strategy is extended by allowing concurrent reads to the same entry, as Midkiff explains, and through development of a systematic dependence analysis and program transformation framework for singly nested loops. Intra-procedural and interprocedural analysis techniques are presented in an article by Lin [] to analyze the common and
for i = 1 to n do y(ia(i)) = ... ... ... = y(ib(i)) end do Run Time Parallelization. Fig. Subscript array-determined loop carried dependences with output dependence
important cases of irregular single-indexed accesses and simple indirect array accesses. An inspector/executor approach, applicable to loops without output dependencies, can be carried out through generation of an inspector that identifies wavefronts of concurrently executable loop iterations followed by an executor that transforms the original loop L into two loops, L and L . The new outer loop L is sequential and the new inner loop L involves all loop indexes assigned to each wavefront, as described in Saltz []. This is analogous to loop skewing, as described by Wolfe [], except that the identification of parallelism is carried out during program execution. A doacross loop construct can also be used as an executor. The wavefronts generated by the inspector are used sort indices assigned to each processor into ascending wavefront order. Full/empty bit or busywaiting synchronization enforces true dependencies; computation does not proceed until the array element required for the computation is available. A doacross executor construct was introduced in [], and this concept has been greatly refined through a variety of schemes that support operation level synchronization in the work of Chen and Singh [, ]. A very fine-grained data flow inspector/executor scheme that also employed operation level synchronization was proposed and tested on the CM-, as described in Chong’s work []. Many situations occur in which runtime information might demonstrate that: () all loop iterations are independent and could execute in parallel or () that cross-iteration dependences are reductions. A kind of inspector/executor technique has been developed that involves () carrying out compiler transformations that assume parallelism or reduction parallelism, () running the generated speculative loop, and then ()
Run Time Parallelization
invoking tests to assess correctness of the speculative parallel execution, as discussed in Rauchwerger []. The compilation and runtime support framework generate tests and recovery code that need to be invoked if the tests demonstrate that the speculation was unsuccessful.
Structural Abstraction Over the course of a decade Baden explored three successive runtime systems to treat data-dependent communication patterns and workload distributions: LPAR, LPAR-X, and KeLP [, ]. The libraries were used in various applications including: structured adaptive mesh refinement for first principles simulation of real materials, genetic algorithms, cluster identification for spin models in statistical mechanics, turbulent flow, and cell microphysiology. This effort contributed the notion of structural abstraction which supports user-level geometric meta-data that admit data-dependent decompositions used in irregular applications. These also enable communication to be expressed in a high level geometric form which may be manipulated at runtime to optimize communication. KeLP was later extended to handle more general notions of sparse sets of data that logically communicate within rectangular subspaces of Zd, even if the data does not exist as rectangular sets of data, e.g., particles, as noted by Baden. A multi-tier variant of KeLP, KeLP, subsequently appeared to treat hierarchical parallelism in the then-emerging SMP cluster technologies, with support to mask communication delays [].
Loop Transformation The compiler parallelizes the loop shown in Fig. by generating the corresponding inspector/executor pair. Assume that all arrays are aligned and distributed in blocks among the processors, and the iterations of the i-loop are likewise block-partitioned. The resulting computation mapping is equivalent to that produced by the “owner computes” rule, a heuristic that maps the computation of an assignment statement to the processor that owns the left-hand-side reference. Data array y is indexed using array ia, causing a single level of indirection. The compiler generates a single executable: a copy of this executable runs on each processor. The program running on each processor determines what processor
R
for i = 1 to n do x(i) = y(ia(i)) + z(i) end do Run Time Parallelization. Fig. Simple irregular loop
it is running on, and uses that information to figure out where its data and iteration fit in the global computation. Let my_elements represent the number of iterations assigned to a processor and also the number of elements of data arrays x, y, z and indirection array ia assigned to the processor. The code generated by the compiler and executed by each processor is shown in Fig. .
Compiler Implementations Inspector/executor schemes have been implemented in a variety of compilers. Initial inspector/executor implementations in the early s were carried out in the ARF [], [] and Kali [] compilers as discussed by Saltz and Koelbel, respectively. Inspector/executor schemes were subsequently implemented in Fortran D; Fortran D [, ]; the Polaris compiler discussed by Lin []; the SUPERB [] compiler; the Vienna Fortran compiler []; the Titanium Java compiler []; compilation systems designed to support efficient implementation of OpenMP on platforms that support MPI, discussed by Basumallik []; to platforms that support Global Arrays, discussed by Liu []; and in HPF compilers described in examples by Benkner and Merlin [, ]. Many of these compiler frameworks support user-specified data and iteration partitioning methods. A framework that supports compiler based analysis about runtime data and iteration reordering transformations is described in []. Speculative parallelization techniques have been proposed to effectively parallelize loops with complex data access patterns. Various software [, , ] and hardware [, ] techniques have been proposed. The hardware techniques involve increased hardware complexity but can catch cross-iteration data dependencies immediately with minimal overheads. On the other hand, full software solutions do not require new hardware but in certain cases can lead to significant performance penalties. The first known work on speculative parallelization techniques to parallelize loops
R
R
Run Time Parallelization
//inspector code for i = 1, my_elements index_y(i) = ia(i) endfor sched_ia = generate_schedule(index_y)//call the inspector schedule generation codes call gather(y(begin_buffer), y, sched_ia) //executor code for i = 1, my_elements x(i)= y(index_y(i)) + z(i) endfor Run Time Parallelization. Fig. Transformed irregular loop
with complex data access pattern was proposed by Rauchwerger et al. []. Instead of breaking loops into inspector/executor computations, they do a speculative execution of the loop as a doall with a runtime check (the LRPD test) to assess whether there are any cross-iteration dependencies. The scheme supports back-tracking and serial re-execution of the loop if the runtime check fails. The biggest advantage of this technique is that the access pattern of the data arrays does not have to be analyzed separately and the runtime tests are performed during the actual computation of the loop. They used this technique on loops, which cannot be parallelized statically, from the PERFECT Benchmark and showed that in many cases, it leads to code that performs better than the inspector/executor strategy. The scheme can be used for automatic parallelization of complex loops, but its main limitation is that dependency violations are detected after the computation of the loop, and a severe penalty has to be paid in terms of rolling back the computation and executing it serially. Gupta et al. [] proposed a number of new runtime tests that improve the techniques presented in []; thus significantly reducing the penalties associated with misspeculation. Dang et al. [] proposed a new technique that transforms a partially parallel loop into a sequence of fully parallel loops. It is a recursive technique where in each step all remaining iterations of the irregular loop are executed in parallel. After the execution, the LRPD test is done to detect dependencies. The iterations that were correct are committed, and the recursive process starts with the remaining iterations. The limitation of the technique is that the loop iterations have to be statically block-scheduled in increasing order
of iterations between the processors, possibly causing work imbalance.
Bibliography . Baden SB, Fink SJ () A programming methodology for dualtier multicomputers. IEEE Trans Softw Eng ():– . Baden SB, Kohn SR () Portable parallel programming of numerical problems under the LPAR system. J Parallel Distrib Comput ():– . Basumallik A, Eigenmann R () Optimizing irregular sharedmemory applications for distributed-memory systems. In: Proceedings of the eleventh ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP’), New York, – Mar . ACM, New York, pp – . Benkner S, Mehrotra P, Van Rosendale J, Zima H () Highlevel management of communication schedules in HPF-like languages. In: Proceedings of the th international conference on supercomputing (ICS’), Melbourne. ACM, New York, pp – . Brezany P, Gerndt M, Sipkova V, Zima HP () SUPERB support for irregular scientific computations. In: Proceedings of the scalable high performance computing conference (SHPCC-), Williamsburg, – Apr , pp – . Chen DK, Torrellas J, Yew PC () An efficient algorithm for the run-time parallelization of DOACROSS loops. In: Proceedings of the ACM/IEEE conference on supercomputing, Washington, DC, pp – . Chong FT, Sharma SD, Brewer EA, Saltz JH () Multiprocessor runtime support for fine-grained, irregular DAGs. Parallel Process Lett :– . Cintra M, Martinez J, Torrellas J () Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In: Proceedings of th annual international symposium on computer architecture, Vancouver, pp – . Dang F, Yu H, Rauchwerger L () The R-LRPD test: speculative parallelization of partially parallel loops. In: Proceedings of the international parallel and distributed processing symposium (IPDPS), Ft. Lauderdale
Runtime System
. Das R, Saltz J, von Hanxleden R () Slicing analysis and indirect accesses to distributed arrays. Lecture notes in computer science, vol . Springer, Berlin/Heidelberg . Gupta M, Nim R () Techniques for speculative run-time parallelization of loops. In: Proceedings of the ACM/IEEE conference on supercomputing (SC’), pp – . Hwang YS, Moon B, Sharma SD, Ponnusamy R, Das R, Saltz JH () Runtime and language support for compiling adaptive irregular programs. Softw Pract Exp ():– . Koelbel C, Mehrotra P, Van Rosendale J () Supporting shared data structures on distributed memory machines. In: Symposium on principles and practice of parallel programming. ACM, New York, pp – . Krothapalli VP, Sadayappan P () Dynamic scheduling of DOACROSS loops for multiprocessors. In: International conference on databases, parallel architectures and their applications (PARBASE-), Miami, pp – . Lin Y, Padua DA () Compiler analysis of irregular memory accesses. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation. Vancouver, pp – . Liu Z, Huang L, Chapman BM, Weng TH () Efficient implementation of OpenMP for clusters with implicit data distribution. WOMPAT, Houston, pp – . Lusk EL, Overbeek RA () A minimalist approach to portable, parallel programming. In: Jamieson L, Gannon D, Douglass R (eds) The characteristics of parallel algorithms. MIT Press, Cambridge, MA, pp – . Merlin JH, Baden SB, Fink S, Chapman BM () Multiple data parallelism with HPF and KeLP. Future Gener Comput Syst ():– . Midkiff SP, Padua DA () Compiler algorithms for synchronization. IEEE Trans Comput ():– . Mirchandaney R, Saltz JH, Smith RM, Nicol DM, Crowley K () Principles of runtime support for parallel processors. In: Proceedings of the second international conference on supercomputing (ICS’), St. Malo, pp – . Ponnusamy R, Hwang YS, Das R, Saltz JH, Choudhary A, Fox G () Supporting irregular distributions using data-parallel languages. Parallel Distrib Technol Syst Appl ():– . Prvulovic M, Garzaran MJ, Rauchwerger L, Torrellas J () Removing architectural bottlenecks to the scalability of speculative parallelization. In: Proceedings, th annual international symposium on Computer architecture, pp –
R
. Rauchwerger L, Padua D () The LRPD test: speculative runtime parallelization of loops with privatization and reduction parallelization. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation (PLDI’), La Jolla, – June . ACM, New York, pp – . Saltz JH, Berryman H, Wu J () Multiprocessors and run-time compilation. Concurr Pract Exp ():– . Saltz JH, Mirchandaney R, Crowley K () Run-time parallelization and scheduling of loops. IEEE Trans Comput ():– . Singh DE, Martín MJ, Rivera FF () Increasing the parallelism of irregular loops with dependences. In: Euro-Par, Klagenfurt, pp – . Strout M, Carter L, Ferrante J () Compile-time composition of run-time data and iteration reordering. Program Lang Des Implement ():– . Su J, Yelick K () Automatic support for irregular computations in a high-level language. In: Proceedings, th IEEE International Parallel and distributed processing symposium, Atlanta, pp b, – Apr . Ujaldon M, Zapata EL, Chapman BM, Zima HP () ViennaFortran/HPF extensions for sparse and irregular problems and their compilation. IEEE Trans Parallel Distrib Syst ():– . von Hanxleden R, Kennedy K, Koelbel C, Das R, Saltz J () Compiler analysis for irregular problems in Fortran D. In: Proceedings of the fifth international workshop on languages and compilers for parallel computing. Springer, London, pp – . Wu J, Das R, Saltz J, Berryman H, Hiranandani S () Distributed memory compiler design for sparse problems. IEEE Trans Comput ():– . Walker D () The implementation of a three-dimensional PIC Code on a hypercube concurrent processor. In: Conference on hypercubes, concurrent computers, and application, Pasadena . Wolfe M () More iteration space tiling. In: Proceedings of the ACM/IEEE conference on supercomputing (Supercomputing’), pp –
R Runtime System Operating System Strategies
S Scalability Metrics Single System Image
Scalable Coherent Interface (SCI) SCI (Scalable Coherent Interface)
ScaLAPACK Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition ScaLAPACK is a library of high-performance linear algebra routines for distributed-memory messagepassing MIMD computers and networks of workstations supporting PVM [] and/or MPI [, ]. It is a continuation of the LAPACK [] project, which designed and produced analogous software for workstations, vector supercomputers, and shared-memory parallel computers.
Discussion Both LAPACK and ScaLAPACK libraries contain routines for solving systems of linear equations, least squares problems, and eigenvalue problems. The goals of both projects are efficiency (to run as fast as possible), scalability (as the problem size and number of processors grow), reliability (including error bounds), portability (across all important parallel machines), flexibility (so users can construct new routines from well-designed parts), and ease of use (by making the interface to LAPACK and ScaLAPACK look as similar as possible). Many of these goals, particularly
portability, are aided by developing and promoting standards, especially for low-level communication and computation routines. These goals have been successfully attained, limiting most machine dependencies to two standard libraries called the BLAS, or Basic Linear Algebra Subprograms [–], and BLACS, or Basic Linear Algebra Communication Subprograms [, ]. LAPACK will run on any machine where the BLAS are available, and ScaLAPACK will run on any machine where both the BLAS and the BLACS are available. The library is written in Fortran (with the exception of a few symmetric eigenproblem auxiliary routines written in C to exploit IEEE arithmetic) in a Single Program Multiple Data (SPMD) style using explicit message passing for interprocessor communication. The name ScaLAPACK is an acronym for Scalable Linear Algebra PACKage, or Scalable LAPACK. ScaLAPACK can solve systems of linear equations, linear least squares problems, eigenvalue problems, and singular value problems. ScaLAPACK can also handle many associated computations such as matrix factorizations or estimating condition numbers. Like LAPACK, the ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. The fundamental building blocks of the ScaLAPACK library are distributedmemory versions of the Level , Level , and Level BLAS, called the Parallel BLAS or PBLAS [, ], and a set of Basic Linear Algebra Communication Subprograms (BLACS) [, ] for communication tasks that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, the majority of interprocessor communication occurs within the PBLAS. So the source code of the top software layer of ScaLAPACK looks similar to that of LAPACK. ScaLAPACK contains driver routines for solving standard types of problems, computational routines to perform a distinct computational task, and auxiliary routines to perform a certain subtask or common
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
S
ScaLAPACK
low-level computation. Each driver routine typically calls a sequence of computational routines. Taken as a whole, the computational routines can perform a wider range of tasks than are covered by the driver routines. Many of the auxiliary routines may be of use to numerical analysts or software developers, so the Fortran source for these routines have been documented with the same level of detail used for the ScaLAPACK computational routines and driver routines. Dense and band matrices are provided for, but not general sparse matrices. Similar functionality is provided for real and complex matrices. However, not all the facilities of LAPACK are covered by ScaLAPACK yet. ScaLAPACK is designed to give high efficiency on MIMD distributed-memory concurrent supercomputers, such as the older ones like Intel Paragon, IBM SP series, and the Cray T series, as well as the newer ones, such IBM Blue Gene series and Cray XT series of supercomputers. In addition, the software is designed so that it can be used with clusters of workstations through a networked environment and with a heterogeneous computing environment via PVM or MPI. Indeed, ScaLAPACK can run on any machine that supports either PVM or MPI. The ScaLAPACK strategy for combining efficiency with portability is to construct the software so that as much as possible of the computation is performed by
ScaLAPACK. Fig. ScaLAPACK’s software hierarchy
calls to the Parallel Basic Linear Algebra Subprograms (PBLAS). The PBLAS [, ] perform global computation by relying on the Basic Linear Algebra Subprograms (BLAS) [–] for local computation and the Basic Linear Algebra Communication Subprograms (BLACS) [, ] for communication. The efficiency of ScaLAPACK software depends on the use of block-partitioned algorithms and on efficient implementations of the BLAS and the BLACS being provided by computer vendors (and others) for their machines. Thus, the BLAS and the BLACS form a lowlevel interface between ScaLAPACK software and different machine architectures. Above this level, all of the ScaLAPACK software is portable. The BLAS, PBLAS, and the BLACS are not, strictly speaking, part of ScaLAPACK. C code for the PBLAS is included in the ScaLAPACK distribution. Since the performance of the package depends upon the BLAS and the BLACS being implemented efficiently, they have not been included with the ScaLAPACK distribution. A machine-specific implementation of the BLAS and the BLACS should be used. If a machine-optimized version of the BLAS is not available, a Fortran reference implementation of the BLAS is available from Netlib []. This code constitutes the “model implementation” [, ]. The model implementation of the BLAS is not expected to perform as well as a specially tuned implementation on most high-performance computers – on
Scalasca
some machines it may give much worse performance – but it allows users to run ScaLAPACK codes on machines that do not offer any other implementation of the BLAS. If a vendor-optimized version of the BLACS is not available for a specific architecture, efficiently ported versions of the BLACS are available on Netlib. Currently, the BLACS have been efficiently ported on machine-specific message-passing libraries such as the IBM (MPL) and Intel (NX) message-passing libraries, as well as more generic interfaces such as PVM and MPI. The BLACS overhead has been shown to be negligible []. Refer to the URL for the blacs directory on Netlib for more details: http://www.netlib.org/blacs/ index.html Figure describes the ScaLAPACK software hierarchy. The components below the line, labeled local addressing, are called on a single processor, with arguments stored on single processors only. The components above the line, labeled global addressing, are synchronous parallel routines, whose arguments include matrices and vectors distributed across multiple processors. LAPACK and ScaLAPACK are freely available software packages provided on the World Wide Web on Netlib. They can be, and have been, included in commercial packages. The authors ask only that proper credit be given to them which is very much like the modified BSD license.
Related Entries
S
. Anderson E, Bai Z, Bischof C, Blackford LS, Demmel JW, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D, LAPACK Users’ Guide, SIAM, . Hanson R, Krogh F, Lawson CA () A proposal for standard linear algebra subprograms, ACM SIGNUM Newsl, . Lawson CL, Hanson RJ, Kincaid D, Krogh FT () Basic Linear Algebra Subprograms for Fortran Usage, ACM Trans Math Soft :– . Dongarra JJ, Du Croz J, Hammarling Richard S, Hanson J (March ) An Extended Set of FORTRAN Basic Linear Algebra Subroutines, ACM Trans Math Soft ():– . Dongarra JJ, Du Croz J, Duff IS, Hammarling S (March ) A Set of Level Basic Linear Algebra Subprograms, ACM Trans Math Soft ():– . Dongarra J, van de Geijn R () Two dimensional basic linear algebra communication subprograms, Computer Science Dept. Technical Report CS-–, University of Tennessee, Knoxville, TN (Also LAPACK Working Note #) . Dongarra J, Whaley RC () A user’s guide to the BLACS v., Computer Science Dept. Technical Report CS-–, University of Tennessee, Knoxville, TN (Also LAPACK Working Note #) . Choi J, Dongarra J, Ostrouchov S, Petitet A, Walker D, Whaley RC (May ) A proposal for a set of parallel basic linear algebra subprograms, Computer Science Dept. Technical Report CS-–, University of Tennessee, Knoxville, TN (Also LAPACK Working Note #) . Petitet A () Algorithmic Redistribution Methods for Block Cyclic Decompositions, PhD thesis, University of Tennessee, Knoxville, TN . Dongarra JJ, Grosse E (July ) Distribution of Mathematical Software via Electronic Mail, Communications of the ACM (): – . Dongarra JJ, du Croz J, Duff IS, Hammarling S () Algorithm : A set of Level Basic Linear Algebra Subprograms, ACM Trans Math Soft :– . Dongarra JJ, DU Croz J, Hammarling S, Hanson RJ () Algorithm : An extended set of FORTRAN Basic Linear Algebra Subroutines, ACM Trans Math Soft :–
LAPACK Linear Algebra, Numerical
S Scalasca
Bibliography . Geist A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V () Parallel Virtual Machine. A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, . MPI Forum, MPI: A message passing interface standard, International Journal of Supercomputer Applications and High Performance Computing, (), pp –. Special issue on MPI. Also available electronically, the URL is ftp://www.netlib.org/ mpi/mpi-report.ps . Snir M, Otto SW, Huss-Lederman S, Walker DW, Dongarra JJ () MPI: The Complete Reference, MIT Press, Cambridge, MA
Felix Wolf Aachen University, Aachen, Germany
Synonyms The predecessor of Scalasca, from which Scalasca evolved, is known by the name of KOJAK.
Definition Scalasca is an open-source software tool that supports the performance optimization of parallel programs by
S
Scalasca
measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes. Scalasca targets mainly scientific and engineering applications based on the programming interfaces MPI and OpenMP, including hybrid applications based on a combination of the two. The tool has been specifically designed for use on large-scale systems including IBM Blue Gene and Cray XT, but is also well suited for small- and medium-scale HPC platforms.
To address these challenges, Scalasca has been designed as a diagnostic tool to support application optimization on highly scalable systems. Although also covering single-node performance via hardwarecounter measurements, Scalasca mainly targets communication and synchronization issues, whose understanding is critical for scaling applications to performance levels in the petaflops range. A distinctive feature of Scalasca is its ability to identify wait states that occur, for example, as a result of unevenly distributed workloads.
Discussion
Functionality
Introduction Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is expanding from generation to generation. As a consequence, supercomputing applications are required to harness much higher degrees of parallelism in order to satisfy their enormous demand for computing power. However, with today’s leadership systems featuring more than a hundred thousand cores, writing efficient codes that exploit all the available parallelism becomes increasingly difficult. Performance optimization is therefore expected to become an even more essential software-process activity, critical for the success of many simulation projects. The situation is exacerbated by the fact that the growing number of cores imposes scalability demands not only on applications but also on the software tools needed for their development. Making applications run efficiently on larger scales is often thwarted by excessive communication and synchronization overheads. Especially during simulations of irregular and dynamic domains, these overheads are often enlarged by wait states that appear in the wake of load or communication imbalance when processes fail to reach synchronization points simultaneously. Even small delays of single processes may spread wait states across the entire machine, and their accumulated duration can constitute a substantial fraction of the overall resource consumption. In particular, when trying to scale communication-intensive applications to large processor counts, such wait states can result in substantial performance degradation.
To evaluate the behavior of parallel programs, Scalasca takes performance measurements at runtime to be analyzed postmortem (i.e., after program termination). The user of Scalasca can choose between two different analysis modes: ●
Performance overview on the call-path level via profiling (called runtime summarization in Scalasca terminology) ● In-depth study of application behavior via event tracing In profiling mode, Scalasca generates aggregate performance metrics for individual function call paths, which are useful for identifying the most resourceintensive parts of the program and assessing processlocal performance via hardware-counter analysis. In tracing mode, Scalasca goes one step further and records individual performance-relevant events, allowing the automatic identification of call paths that exhibit wait states. This core feature is the reason why Scalasca is classified as an automatic tool. As an alternative, the resulting traces can be visualized in a traditional time-line browser such as VAMPIR to study the detailed interactions among different processes or threads. While providing more behavioral detail, traces also consume significantly more storage space and therefore have to be generated with care. Figure shows the basic analysis workflow supported by Scalasca. Before any performance data can be collected, the target application must be instrumented, that is, probes must be inserted into the code which carry out the measurements. This can happen at different levels, including source code, object code, or library.
Scalasca
S
Measurement library Instr. target application
Profile report
Local event traces
Parallel waitstate search
Report explorer
Trace browser
Instrumented executable
Instrumenter/ compiler/linker
Wait-state report
Report manipulation
Optimized measurement configuration
Source modules
Scalasca. Fig. Schematic overview of the performance data flow in Scalasca. Gray rectangles denote programs and white rectangles with the upper right corner turned down denote files. Stacked symbols denote multiple instances of programs or files, in most cases running or being processed in parallel. Hatched boxes represent optional third-party components
Before running the instrumented executable on the parallel machine, the user can choose between generating a profile or an event trace. When tracing is enabled, each process generates a trace file containing records for its process-local events. To prevent traces from becoming too large or inaccurate as a result of measurement intrusion, it is generally recommended to optimize the instrumentation based on a previously generated profile report. After program termination, Scalasca loads the trace files into main memory and analyzes them in parallel using as many cores as have been used for the target application itself. During the analysis, Scalasca searches for wait states, classifies detected instances by category, and quantifies their significance. The result is a waitstate report similar in structure to the profile report but enriched with higher-level communication and synchronization inefficiency metrics. Both profile and waitstate reports contain performance metrics for every combination of function call path and process/thread and can be interactively examined in the provided analysis report explorer (Fig. ) along the dimensions of performance metric, call tree, and system. In addition, reports can be combined or manipulated to allow comparisons or aggregations, or to focus the analysis on specific extracts of a report. For example, the difference between two reports can be calculated to assess the effectiveness of an optimization or a new report can be generated after eliminating uninteresting phases (e.g., initialization).
Instrumentation Preparation of a target application executable for measurement and analysis requires it to be instrumented to notify the measurement library, which is linked to the executable, of performance-relevant execution events whenever they occur at runtime. On all systems, a mix of manual and automatic instrumentation mechanisms is offered. Instrumentation configuration and processing of source files are achieved by prefixing selected compilation commands and the final link command with the Scalasca instrumenter, without requiring other changes to optimization levels or the build process, as in the following example for the file foo.c: > scalasca -instrument mpicc -c foo.c
S Scalasca follows a direct instrumentation approach. In contrast to interrupt-based sampling, which takes periodic measurements whenever a timer expires, Scalasca takes measurements when the control flow reaches certain points in the code. These points mark performance-relevant events, such as entering/leaving a function or sending/receiving a message. Although instrumentation points have to be chosen with care to minimize intrusion, direct instrumentation offers advantages for the global analysis of communication and synchronization operations. In addition to pure direct instrumentation, future versions of Scalasca will combine direct instrumentation with sampling in
S
Scalasca
Scalasca. Fig. Interactive performance-report exploration for the Zeus MP/ code [] with Scalasca. The left pane lists different performance metrics arranged in a specialization hierarchy. The selected metric is a frequently occurring wait state called Late Sender, during which a process waits for a message that has not yet been sent (Fig. a). The number and the color of the icon to the left of the label indicate the percentage of execution time lost due to this wait state, in this case .%. In the middle pane, the user can see which call paths are most affected. For example, .% of the .% is caused by the call path zeusmp() → transprt() → ct() → MPI_Waitall(). The right pane shows the distribution of waiting times for the selected combination of wait state and call path across the virtual process topology. The display indicates that most wait states occur at the outer rim of a spherical region in the center of the three-dimensional Cartesian topology
profiling mode to better control runtime dilation, while still supporting global communication analyses [].
Measurement Measurements are collected and analyzed under the control of a workflow manager that determines how the application should be run, and then configures measurement and analysis accordingly. When tracing is requested, it automatically configures and executes the parallel trace analyzer with the same number of processes as used for measurement. The following examples demonstrate how to request measurements from MPI application bar to be executed with , ranks, once
in profiling and once in tracing mode (distinguished by the use of the “-t” option). > scalasca -analyze mpiexec / -np 65536 bar <arglist> > scalasca -analyze -t mpiexec / -np 65536 bar <arglist>
Call-Path Profiling Scalasca can efficiently calculate many execution performance metrics by accumulating statistics during measurement, avoiding the cost of storing them with events for later analysis. For example, elapsed times and hardware-counter metrics for source regions
Scalasca
(e.g., routines or loops) can be immediately determined and the differences accumulated. Whereas trace storage requirements increase in proportion to the number of events (dependent on the measurement duration), summarized statistics for a call-path profile per thread have a fixed storage requirement (dependent on the number of threads and executed call paths). In addition to call-path visit counts, execution times, and optional hardware counter metrics, Scalasca profiles include various MPI statistics, such as the numbers of synchronization, communication and file I/O operations along with the associated number of bytes transferred. Each metric is broken down into collective versus point-to-point/individual, sends/writes versus receives/reads, and so on. Call-path execution times separate MPI message-passing and OpenMP multithreading costs from purely local computation, and break them down further into initialization/finalization, synchronization, communication, file I/O and thread management overheads (as appropriate). For measurements using OpenMP, additional thread idle time and limited parallelism metrics are derived, assuming a dedicated core for each thread. Scalasca provides accumulated metric values for every combination of call path and thread. A call path is defined as the list of code regions entered but not yet left on the way to the currently active one, typically starting from the main function, such as in the call chain main() → foo() → bar(). Which regions actually appear on a call path depends on which regions have been instrumented. When execution is complete, all locally executed call paths are combined into a global dynamic call tree for interactive exploration (as shown in the middle of Fig. , although the screen shot actually visualizes a wait-state report).
Wait-State Analysis In message-passing applications, processes often require access to data provided by remote processes, making the progress of a receiving process dependent upon the progress of a sending process. If a rendezvous protocol is used, this relationship also applies in the opposite direction. Collective synchronization is similar in that its completion requires each participating process to have reached a certain point. As a consequence, a significant fraction of the time spent in communication and synchronization routines can often be attributed to wait
S
states that occur when processes fail to reach implicit or explicit synchronization points in a timely manner. Scalasca provides a diagnostic method that allows the localization of wait states by automatically searching event traces for characteristic patterns. Figure shows several examples of wait states that can occur in message-passing programs. The first one is the Late Sender pattern (Fig. a), where a receiver is blocked while waiting for a message to arrive. That is, the receive operation is entered by the destination process before the corresponding send operation has been entered by the source process. The waiting time lost as a consequence is the time difference between entering the send and the receive operations. Conversely, the Late Receiver pattern (Fig. b) describes a sender that is likely to be blocked while waiting for the receiver when a rendezvous protocol is used. This can happen for several reasons. Either the MPI implementation is working in synchronous mode by default, or the size of the message to be sent exceeds the available MPI-internal buffer space and the operation is blocked until the data is transferred to the receiver. The Late Sender / Wrong Order pattern (Fig. c) describes a receiver waiting for a message, although an earlier message is ready to be received by the same destination process (i.e., messages received in the wrong order). Finally, the Wait at N×N pattern (Fig. d) quantifies the waiting time due to the inherent synchronization in n-to-n operations, such as MPI_Allreduce(). A full list of the wait-state types supported by Scalasca including explanatory diagrams can be found online in the Scalasca documentation [].
Parallel Wait-State Search To accomplish the search in a scalable way, Scalasca exploits both distributed memory and parallel processing capabilities available on the target system. After the target application has terminated and the trace data have been flushed to disk, the trace analyzer is launched with one analysis process per (target) application process and loads the entire trace data into its distributed memory address space. Future versions of Scalasca may exploit persistent memory segments available on systems such as Blue Gene/P to pass the trace data to the analysis stage without involving any file I/O. While traversing the traces in parallel, the analyzer performs a replay of the application’s original communication. During the replay, the analyzer
S
processes
Scalasca
processes
S
MPI_Send()
waiting MPI_Send()
waiting MPI_Recv()
MPI_Recv()
a
time
b
MPI_Send()
time
Late Receiver processes
Late Sender processes
MPI_Allreduce() waiting
MPI_Send() MPI_Allreduce() waiting MPI_Recv()
MPI_Allr.
MPI_Recv()
waiting time
c
Late Sender / Wrong Order
time
d
Wait at N × N
Scalasca. Fig. Four examples of wait states (a–d) detected by Scalasca. The combination of MPI functions used in each of these examples represents just one possible case. For instance, the blocking receive operation in wait state (a) can be replaced by a non-blocking receive followed by a wait operation. In this case, the waiting time would occur during the wait operation
identifies wait states in communication and synchronization operations by measuring temporal differences between local and remote events after their time stamps have been exchanged using an operation of similar type. Detected wait-state instances are classified and quantified according to their significance for every call path and system resource involved. Since trace processing capabilities (i.e., processors and memory) grow proportionally with the number of application processes, good scalability can be achieved even at previously intractable scales. Recent scalability improvements allowed Scalasca to complete trace analyses of runs with up to , cores on a -rack IBM Blue Gene/P system []. A modified form of the replay-based trace analysis scheme is also applied to detect wait states occurring in MPI- RMA operations. In this case, RMA communication is used to exchange the required information between processes. Finally, Scalasca also
provides the ability to process traces from hybrid MPI/OpenMP and pure OpenMP applications. However, the parallel wait-state search does not yet recognize OpenMP-specific wait states, such as barrier waiting time or lock contention, previously supported by its predecessor.
Wait-State Search on Clusters without Global Clock To allow accurate trace analyses on systems without globally synchronized clocks, Scalasca can synchronize inaccurate time stamps postmortem. Linear interpolation based on clock offset measurements during program initialization and finalization already accounts for differences in offset and drift, assuming that the drift of an individual processor is not time dependent. This step is mandatory on all systems without a global clock, such as Cray XT and most PC or compute blade clusters. However, inaccuracies and drifts
Scalasca
varying over time can still cause violations of the logical event order that are harmful to the accuracy of the analysis. For this reason, Scalasca compensates for such violations by shifting communication events in time as much as needed to restore the logical event order while trying to preserve the length of intervals between local events. This logical synchronization is currently optional and should be performed if the trace analysis reports (too many) violations of the logical event order.
Future Directions Future enhancements of Scalasca will aim at further improving both its functionality and its scalability. In addition to supporting the more advanced features of OpenMP such as nested parallelism and tasking as an immediate priority, Scalasca is expected to evolve toward emerging programming models and architectures including partitioned global address space (PGAS) languages and heterogeneous systems. Moreover, optimized data management and analysis workflows including in-memory trace analysis will allow Scalasca to master even larger processor configurations than it does today. A recent example in this direction is the substantial reduction of the trace-file creation overhead that was achieved by mapping large numbers of logical process-local trace files onto a small number of 9000
S
physical files [], a feature that will become available in future releases of Scalasca. In addition to keeping up with the rapid new developments in parallel hardware and software, research is also undertaken to expand the general understanding of parallel performance in simulation codes. The examples below summarize two ongoing projects aimed at increasing the expressive power of the analyses supported by Scalasca. The description reflects the status of March .
Time-Series Call-Path Profiling As scientific parallel applications simulate the temporal evolution of a system, their progress occurs via discrete points in time. Accordingly, the core of such an application is typically a loop that advances the simulated time step by step. However, the performance behavior may vary between individual iterations, for example, due to periodically reoccurring extra activities or when the state of the computation adjusts to new conditions in so-called adaptive codes. To study such timedependent behavior, Scalasca can distinguish individual iterations in profiles and event traces. Figure shows the distribution of point-to-point messages across the iteration-process space in the MPI-based PEPC [] particle simulation code. Obviously, during later stages of the simulation the majority of messages are sent by only a small collection of processes with time-dependent 1024
8000
896
7000
768
6000 Process
640 Count
5000 4000 3000
384
2000
256
1000
128 0
0 0
a
S
512
200
400
600 800 Iteration #
1000
1200
0
b
200
400
600 800 Iteration #
1000
1200
Scalasca. Fig. Gradual development of a communication imbalance along , timesteps of PEPC on , processors. (a) Minimum (bottom), median (middle), and maximum (top) number of point-to-point messages sent or received by a process in an iteration. (b) Number of messages sent by each process in each iteration. The higher the number of messages the darker the color on the value map
S
Scalasca
constituency, inducing wait states on other processes (not shown). However, even generating call-path profiles (as opposed to traces) separately for thousands of iterations to identify the call paths responsible may exceed the available buffer space – especially when the call tree is large and more than one metric is collected. For this reason, a runtime approach for the semantic compression of a series of call-path profiles based on incrementally clustering single-iteration profiles was developed that scales in terms of the number of iterations without sacrificing important performance details []. This method, which will be integrated in future versions of Scalasca, offers low runtime overhead by using only a condensed version of the profile data when calculating distances and accounts for processdependent variations by making all clustering decisions locally.
Identifying the Root Causes of Wait-State Formation In general, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Merely knowing the locations of wait states in the program is often insufficient to understand the reason for their occurrence. Building on earlier work by Meira, Jr. et al. [], the replay-based wait-state analysis was extended in such a way that it attributes the waiting times to their root causes [], as exemplified in Fig. . Typically, these root causes are intervals during which a process performs some additional activity not performed by its peers, for example as a result of insufficiently balancing the load. However, excess workload identified as the root cause of wait states usually cannot simply be removed. To achieve a better balance, optimization hypotheses drawn from such an analysis typically propose the redistribution of the excess load to other processes instead. Unfortunately, redistributing workloads in complex message-passing applications can have surprising side effects that may compromise the expected reduction of waiting times. Given that balancing the load statically or even introducing a dynamic load-balancing scheme constitute major code changes, such procedures should ideally be performed only if the prospective performance gain is likely to materialize. Other recent
Scalasca. Fig. In the lorentz_d() subroutine of the Zeus MP/ code, several processes primarily arranged on a small hollow sphere within the virtual process topology are responsible for wait states arranged on the enclosing hollow sphere shown earlier in Fig. . Since the inner region of the topology carries more load than the outer region, processes at the rim of the inner region delay those farther outside. The darkness of the color indicates the amount of waiting time induced by a process but materializing farther away. In this way, it becomes possible to study the root causes of wait-state formation
work [] therefore concentrated on determining the savings we can realistically hope for when redistributing a given delay – before altering the application itself. Since the effects of such changes are hard to quantify analytically, they are simulated in a scalable manner via a parallel real-time replay of event traces after they have been modified to reflect the redistributed load.
Related Entries
Intel Thread Profiler MPI (Message Passing Interface) OpenMP
Scalasca
OpenMP Profiling with OmpP Performance Analysis Tools Periscope PMPI Tools Synchronization TAU Vampir
Bibliographic Notes and Further Reading Scalasca is available for download including documentation under the New BSD license at www. scalasca.org. Scalasca emerged from the KOJAK project, which was started in at the Jülich Supercomputing Centre in Germany to study the automatic evaluation of parallel performance data, and in particular, the automatic detection of wait states in event traces of parallel applications. The wait-state analysis first concentrated on MPI [] and later on OpenMP and hybrid codes [], motivating the definition of the POMP profiling interface [] for OpenMP, which is still used today even beyond Scalasca by OpenMP-enabled profilers such as ompP (OpenMP Profiling with OmpP) and TAU. A comprehensive description of the initial trace-analysis toolset resulting from this effort, which was publicly released for the first time in under the name KOJAK, is given in []. During the following years, KOJAK’s wait-state search was optimized for speed, refined to exploit virtual process topologies, and extended to support MPI- RMA communication. In addition to the detection of wait states occurring during a single run, KOJAK also introduced a framework for comparing the analysis results of different runs [], for example, to judge the effectiveness of optimization measures. An extensive snapshot of this more advanced version of KOJAK including further literature references is presented in []. However, KOJAK still analyzed the traces sequentially after the process-local trace data had been merged into a single global trace file, an undesired scalability limitation in view of the dramatically rising number of cores employed on modern parallel architectures. In , after the acquisition of a major grant from the Helmholtz Association of German Research Centres, the Scalasca project was started in Jülich as the successor to KOJAK with the objective of improving
S
the scalability of the trace analysis by parallelizing the search for wait states. A detailed discussion of the parallel replay underlying the parallel search can be found in []. Variations of the scalable replay mechanism were applied to correct event time stamps taken on clusters without global clock [], to simulate the effects of optimizations such as balancing the load of a function more evenly across the processes of a program [], and to identify wait states in MPI- RMA communication in a scalable manner []. Moreover, the parallel trace analysis was also demonstrated to run on computational grids consisting of multiple geographically dispersed clusters that are used as a single coherent system []. Finally, a very recent replay-based method attributes the costs of wait states in terms of resource waste to their original cause []. Since the enormous data volume sometimes makes trace analysis challenging, runtime summarization capabilities were added to Scalasca both as a simple means to obtain a performance overview and as a basis to optimally configure the measurement for later trace generation. Scalasca integrates both measurement options in a unified tool architecture, whose details are described in []. Recently, a semantic compression algorithm was developed that will allow Scalasca to take time-series profiles in a space-efficient manner even if the target application performs large numbers of timesteps []. Major application studies with Scalasca include a survey of using it on leadership systems [], a comprehensive analysis of how the performance of the SPEC MPI benchmarks evolves as their execution progresses [], and the investigation of a gradually developing communication imbalance in the PEPC particle simulation []. Finally, a recent study of the SweepD benchmark demonstrated performance measurements and analyses with up to , processes []. From until , KOJAK and later Scalasca was jointly developed together with the Innovative Computing Laboratory at the University of Tennessee. During their lifetime, the two projects received funding from the Helmholtz Association of German Research Centres, the US Department of Energy, the US Department of Defense, the US National Science Foundation, the German Science Foundation, the German Federal Ministry of Education and Research, and the European Union. Today, Scalasca is a joint project between Jülich
S
S
Scalasca
and the German Research School for Simulation Sciences in nearby Aachen. The following individuals have contributed to Scalasca and its predecessor: Erika Ábrahám, Daniel Becker, Nikhil Bhatia, David Böhme, Jack Dongarra, Dominic Eschweiler, Sebastian Flott, Wolfgang Frings, Karl Fürlinger, Christoph Geile, Markus Geimer, MarcAndré Hermanns, Michael Knobloch, David Krings, Guido Kruschwitz, André Kühnal, Björn Kuhlmann, John Linford, Daniel Lorenz, Bernd Mohr, Shirley Moore, Ronal Muresano, Jan Mußler, Andreas Nett, Christian Rössel, Matthias Pfeifer, Peter Philippen, Farzona Pulatova, Divya Sankaranarayanan, Pavel Saviankou, Marc Schlütter, Christian Siebert, Fengguang Song, Alexandre Strube, Zoltán Szebenyi, Felix Voigtländer, Felix Wolf, and Brian Wylie.
Bibliography . Becker D, Rabenseifner R, Wolf F, Linford J () Scalable timestamp synchronization for event traces of message-passing applications. Parallel Comput ():– . Becker D, Wolf F, Frings W, Geimer M, Wylie BJN, Mohr B () Automatic trace-based performance analysis of metacomputing applications. In: Proceedings of the international parallel and distributed processing symposium (IPDPS), Long Beach, CA, USA. IEEE Computer Society, Washington, DC . Böhme D, Geimer M, Wolf F, Arnold L () Identifying the root causes of wait states in large-scale parallel applications. In: Proceedings of the th international conference on parallel processing (ICPP), San Diego, CA, USA. IEEE Computer Society, Washington, DC, pp – . Frings W, Wolf F, Petkov V () Scalable massively parallel I/O to task-local files. In: Proceedings of the ACM/IEEE conference on supercomputing (SC), Portland, OR, USA, Nov . Geimer M, Wolf F, Wylie BJN, Ábrahám E, Becker D, Mohr B () The Scalasca performance toolset architecture. Concurr Comput Pract Exper ():– . Geimer M, Wolf F, Wylie BJN, Mohr B () A scalable tool architecture for diagnosing wait states in massively-parallel applications. Parallel Comput ():– . Gibbon P, Frings W, Dominiczak S, Mohr B () Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In: Proceedings of the conference on parallel computing (ParCo), Málaga, Spain, Sept (NIC series), vol . John von Neumann-Institut für Computing, Jülich, pp – . Hayes JC, Norman ML, Fiedler RA, Bordner JO, Li PS, Clark SE, ud-Doula A, Mac Low M-M () Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. Astrophys J Suppl ():– . Hermanns M-A, Geimer M, Mohr B, Wolf F () Scalable detection of MPI- remote memory access inefficiency patterns.
.
.
.
.
.
.
.
.
.
.
.
. .
In: Proceedings of the th European PVM/MPI users’ group meeting (EuroPVM/MPI), Espoo, Finland. Lecture notes in computer science, vol . Springer, Berlin, pp – Hermanns M-A, Geimer M, Wolf F, Wylie BJN () Verifying causality between distant performance phenomena in large-scale MPI applications. In Proceedings of the th Euromicro international conference on parallel, distributed, and network-based processing (PDP), Weimar, Germany. IEEE Computer Society, Washington, DC, pp – Jülich Supercomputing Centre and German Research School for Simulation Sciences. Scalasca parallel performance analysis toolset documentation (performance properties). http://www. scalasca.org/download/documentation/ Meira W Jr, LeBlanc TJ, Poulos A () Waiting time analysis and performance visualization in Carnival. In: Proceedings of the SIGMETRICS symposium on parallel and distributed tools (SPDT’), Philadelphia, PA, USA. ACM Mohr B, Malony A, Shende S, Wolf F () Design and prototype of a performance tool interface for OpenMP. J Supercomput ():– Song F, Wolf F, Bhatia N, Dongarra J, Moore S () An algebra for cross-experiment performance analysis. In: Proceedings of the international conference on parallel processing (ICPP), Montreal, Canada. IEEE Computer Society, Washington, DC, pp – Szebenyi Z, Gamblin T, Schulz M, de Supinski BR, Wolf F, Wylie BJN () Reconciling sampling and direct instrumentation for unintrusive call-path profiling of MPI programs. In: Proceedings of the international parallel and distributed processing symposium (IPDPS), Anchorage, AK, USA. IEEE Computer Society, Washington, DC Szebenyi Z, Wolf F, Wylie BJN () Space-efficient time-series call-path profiling of parallel applications. In: Proceedings of the ACM/IEEE conference on supercomputing (SC), Portland, OR, USA, Nov Szebenyi Z, Wylie BJN, Wolf F () SCALASCA parallel performance analyses of SPEC MPI applications. In: Proceedings of the st SPEC international performance evaluation workshop (SIPEW), Darmstadt, Germany. Lecture notes in computer science, vol . Springer, Berlin, pp – Szebenyi Z, Wylie BJN, Wolf F () Scalasca parallel performance analyses of PEPC. In: Proceedings of the workshop on productivity and performance (PROPER) in conjunction with Euro-Par, Las Palmas de Gran Canaria, Spain, August . Lecture notes in computer science, vol . Springer, Berlin, pp – Wolf F () Automatic Performance Analysis on Parallel Computers with SMP Nodes. PhD thesis, RWTH Aachen, Forschungszentrum Jülich. ISBN --- Wolf F, Mohr B () Specifying performance properties of parallel applications using compound events. Parallel Distrib Comput Pract ():– Wolf F, Mohr B () Automatic performance analysis of hybrid MPI/OpenMP applications. J Syst Archit (–):– Wolf F, Mohr B, Dongarra J, Moore S () Automatic analysis of inefficiency patterns in parallel applications. Concurr Comput Pract Exper ():–
Scan for Distributed Memory, Message-Passing Systems
. Wylie BJN, Geimer M, Mohr B, Böhme D, Szebenyi Z, Wolf F () Large-scale performance analysis of SweepD with the Scalasca toolset. Parallel Process Lett ():– . Wylie BJN, Geimer M, Wolf F () Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Sci Program (–):–
Scaled Speedup Gustafson’s Law
Scan for Distributed Memory, Message-Passing Systems Jesper Larsson Träff University of Vienna, Vienna, Austria
S
typically an n-element vector, and the given associative, binary operator ⊕ is typically an element-wise operation. The required prefix sums could be computed by arranging the input items in a linear sequence and “scanning” through this sequence in order, carrying along the corresponding inclusive (or exclusive) prefix sum. In a distributed memory setting, this intuition would be captured by arranging the nodes in a linear array, but much better algorithms usually exist. The inclusive scan operation is illustrated in Fig. . The scan operation is included as a collective operation in many interfaces and libraries for distributed memory systems, for instance MPI [], where both an inclusive (MPI_Scan) and an exclusive (MPI_Exscan) general scan operation is defined. These operations work on vectors of consecutive or non-consecutive elements and arbitrary (user-defined) associative, possibly also commutative operators.
Algorithms Synonyms All prefix sums; Prefix; Parallel prefix sums; Prefix reduction
Definition Among a group of p consecutively numbered processing elements (nodes) each node has a data item xi for i = , . . . p − . An associative, binary operator ⊕ over the data items is given. The nodes in parallel compute all p (or p − ) prefix sums over the input items: node i computes the inclusive prefix sum ⊕ij= xj (or, for i > , the exclusive prefix sum ⊕i− j= xj ).
Discussion For a general discussion, see the entry on “Reduce and Scan.” For distributed memory, message-passing parallel systems, the scan operation plays an important role for load-balancing and data-redistribution operations. In this setting, each node possesses a local data item xi ,
On distributed memory parallel architectures parallel algorithms typically use tree-like patterns to compute the prefix sums in a logarithmic number of parallel communication rounds. Four such algorithms are described in the following. Algorithms also exist for hypercube and mesh- or torus-connected communication architectures [, , ]. The p nodes are consecutively numbered from to p − . Each node i has an input data item xi . Nodes must communicate explicitly by send and receive operations. For convenience a fully connected model in which each node can communicate with all other nodes at the same cost is assumed. Only one communication operation per node can take place at a time. The data items xi are assumed to all have the same size n = ∣xi ∣. For scalar items, n = , while for vector items n is the number of elements. The communication cost of transferring a data item of size n is assumed to be O(n) (linear).
Linear Array The simplest scan algorithm organizes the nodes in a linear array. For i > node i waits for a partial sum
Before Node 0 x (0)
Node 1 x (1)
After Node 2 x (2)
Node 0
Node 1
Node 2
⊕0j =0x ( j )
⊕1j=0x ( j )
⊕2j=0x ( j )
Scan for Distributed Memory, Message-Passing Systems. Fig. The inclusive scan operation
S
S
Scan for Distributed Memory, Message-Passing Systems
⊕i− j= xj from node i − , adds its own item xi to arrive at the partial sum ⊕ij= xj , and sends this to node i + , unless i = p − . This algorithm is strictly serial and takes p − communication rounds for the last node to finish its prefix computation, for a total cost of O(pn). This is uninteresting in itself, but the algorithm can be pipelined (see below) to yield a time to complete of O(p + n). When n is large compared to p this easily of implementable algorithm is often the fastest, due to its simplicity.
Binary Tree The binary tree algorithm arranges the nodes in a balanced binary tree T with in-order numbering. This numbering has the property that the nodes in the subtree T(i) rooted at node i have consecutive numbers in the interval [s, . . . , i, . . . , e], where s and e denote the first (start) and last (end) node in the subtree T(i), respectively. The algorithm consists of two phases. It suffices to describe the actions of a node in the tree with parent, right and left children. In the up-phase, node i first receives the partial result ⊕i− j=s xj from its left child and adds its own item xi to get the partial result ⊕ij=s xj . This value is stored for the down-phase. Node i then receives the partial result ⊕ej=i+ xj from its right child and computes the partial result ⊕ej=s xj . Node i sends this value upward in the tree without keeping it. In the down-phase, node i receives the partial result ⊕s− j= xj from its parent. This is first sent down to the left child and then added to the stored partial result ⊕ij=s xj to form the final result ⊕ij= xj for node i. This final result is sent down to the right child. With the obvious modifications, the general description covers also nodes that need not participate in all of these communications: Leaves have no children. Some nodes may only have a leftmost child. Nodes on the path between root and leftmost leaf do not receive data from their parent in the down-phase. Nodes on the path between rightmost child and root do not send data to their parent in the up-phase. Since the depth of T is logarithmic in the number of nodes, and data of size n are sent up and down the tree, the time to complete the scan with the binary tree algorithm is O(n log p). This is unattractive for large n. If the nodes can only either send or receive in a communication operation but otherwise work simultaneously, the number of communication operations per node is
at most , and the number of communication rounds is ⌈log p⌉.
Binomial Tree The binomial tree algorithm likewise consists of an upphase and a down-phase, each of which takes k rounds where k = ⌊log p⌋. In round j for j = , . . . , k − of the up-phase each node i satisfying i ∧ (j+ − ) ≡ j+ − (where ∧ denotes “bitwise and”) receives a partial result from node i − j (provided ≤ i − j ). After sending to node i, node i − j is inactive for the remainder of the up-phase. The receiving nodes add the partial results, and after round j have a partial result of the form ⊕iℓ=i−j+ + xℓ . The down-phase counts rounds downward from k to . Node i with i ∧ (j − ) ≡ j − sends its partial result to node i + j− (provided i + j− < p) j− which can now compute its final result ⊕i+ ℓ= x ℓ . The communication pattern is illustrated in Fig. . The number of communication rounds is ⌊log p⌋, and the time to complete is O(n log p). Each node is either sending or receiving data in each round, with no possibility for overlapping of sending and receiving due to the computation of partial results.
Simultaneous Trees The number of communication rounds can be improved by a factor of two by the simultaneous binomial tree algorithm. Starting from round k = , in round k, node i sends a computed, partial result ⊕ij=i−k + xj to node i+k
(provided i + k < p) and receives a likewise computed, partial result from node i − k (provided i − k ≥ ). Before the next round, the previous and received partial
Round 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Processor
Scan for Distributed Memory, Message-Passing Systems. Fig. The communication pattern for the binomial tree algorithm for p =
Scan, Reduce and
Round 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Processor
Scan for Distributed Memory, Message-Passing Systems. Fig. The communication patterns for the simultaneous binomial tree algorithm for p =
results are added. It is easy to see that after round k, node i’s partial result is ⊕ij=max(,i−k+ +) xj . Node i terminates when both i − k < (nothing to receive) and i + k ≥ p (nothing to send). Provided that the nodes can simultaneously send and receive an item in the same communication operation this happens after ⌈log p⌉ communication rounds for a total time of O(n log p). This is again not attractive for large n. This algorithm dates back at least to [] and is illustrated in Fig. . For large n the running time of O(n log p) is not attractive due to the logarithmic factor. For the binary tree algorithm pipelining can be employed, by which the running time can be improved to O(n + log p). Neither the binomial nor the simultaneous binomial tree algorithms admit pipelining. A pipelined implementation breaks each data item into a certain number of blocks N each of roughly equal size of n/N. As soon as a block has been processed and sent down the pipeline, the next block can be processed. The time of such an implementation is proportional to the depth of the pipeline plus the number of blocks times the time to process one block. From this an optimal block size can be determined, which gives the running time claimed for the binary tree algorithm. For very large vectors relative to the number of nodes, n ≫ p, a linear pipeline seems to be the best practical choice due to very small constants. Pipelining can only be applied if the binary operator can work independently on the blocks. This is trivially the case if the items are vectors and the operator works element-wise on these.
Message Passing Interface (MPI) Reduce and Scan
Bibliographic Notes and Further Reading The parallel prefix problem was early recognized as a fundamental parallel processing primitive, and its history goes back at least as far as the early s. The simultaneous binomial tree algorithm is from [], but may have been discovered earlier as well. The binary tree algorithm easily admits pipelining but has the drawback that leaf nodes are only using either send or receive capabilities, and that simultaneous send-receive communication can be only partially exploited. In [] it was shown how to overcome these limitations, yielding theoretically almost optimal scan (as well as reduce and broadcast) algorithms with (two) binary trees. Algorithms for distributed systems in the LogP performance model were presented in []. An algorithm for multiport message-passing systems was given in []. For results on meshes and hypercubes see, for instance [, , ].
Bibliography . Akl SG () Parallel computation: models and methods. Prentice-Hall, Upper Saddle River . Daniel Hillis W, Steele GL Jr () Data parallel algorithms. Commun ACM ():– . Lin Y-C, Yeh C-S () Efficient parallel prefix algorithms on multiport message-passing systems. Inform Process Lett :– . Mayr EW, Plaxton CG () Pipelined parallel prefix computations, and sorting on a pipelined hypercube. J Parallel Distrib Comput :– . MPI Forum () MPI: A message-passing interface standard. Version .. www.mpiforum.org. Accessed Sept . Sanders P, Speck J, Träff JL () Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput :– . Santos EE () Optimal and efficient algorithms for summing and prefix summing on parallel machines. J Parallel Distrib Comput ():– . Ziavras SG, Mukherjee A () Data broadcasting and reduction, prefix computation, and sorting on reduces hypercube parallel computer. Parallel Comput ():–
Related Entries Collective Communication
S
Scan, Reduce and Reduce and Scan
S
S
Scatter
Scatter Collective Communication
Scheduling Job Scheduling Task Graph Scheduling Scheduling Algorithms
Scheduling Algorithms Patrice Quinton ENS Cachan Bretagne, Bruz, France
Synonyms Execution ordering
Definition Scheduling algorithms aim at defining when operations of a program are to be executed. Such an ordering, called a schedule, has to make sure that the dependences between the operations are met. Scheduling for parallelism consists in looking for a schedule that allows a program to be efficiently executed on a parallel architecture. This efficiency may be evaluated in term of total execution time, utilization of the processors, power consumption, or any combination of this kind of criteria. Scheduling is often combined with mapping, which consists in assigning an operation to a resource of an architecture.
Discussion Introduction Scheduling is an essential design step to execute an algorithm on a parallel architecture. Usually, the algorithm is specified by means of a program from which dependences between statements or operations (also called tasks) can be isolated: an operation A depends
on another operation B if the evaluation of B must precede that of A for the result of the algorithm to be correct. Assuming the dependences of an algorithm are already known, scheduling has to find out a partial order, called a schedule, which says when each statement can be executed. A schedule may be resourceconstrained, if the ordering must be such that some resource – for example, processor unit, memory, etc. – is available for the execution of the algorithm. Scheduling algorithms are therefore classified according to the precision of the dependence analysis, the kind of resources needed, the optimization criteria considered, the specification of the algorithm, the granularity size of the tasks, and the kind of parallel architecture that is considered for execution. Scheduling can also happen at run-time – dynamic scheduling – or at compile-time – static scheduling. In the following, basic notions related to task graphs are first recalled, then scheduling algorithms for loops are presented. The presentation separates monodimensional and higher-dimensional loops. Task Graph scheduling, a topic that is covered in another essay of this site, is not treated here.
Task Graphs and Scheduling Task graphs and their scheduling is the basis of many situations of parallel computations and has been the subject of a number of researches. Task graphs are a means of representing dependences between computations of a program: vertices V of the graph are elementary computations, called tasks, and oriented edges E represent dependences between computations. A task graph is therefore an acyclic, directed multigraph. Assume that each task T has an estimated integer value d(T) representing its execution duration. A schedule for a task graph is a function σ from the vertex set V of the graph to the set of positive integers, such that σ (u) + d(u) ≤ σ (v), where d represents the duration associated to the evaluation of task u, and v depends on u. Since it is acyclic, a task graph always admits a schedule. Scheduling a task graph may be constrained by the number of processors p available to run the tasks. It is assumed that each task T, when ready to be evaluated, is allocated to a free processor, and occupies this processor
Scheduling Algorithms
during its duration d(T), after which the processor is released and becomes idle. When p is unlimited, we say that the schedule is resource free, otherwise, it is resource constrained. The total computation time for executing a task graph on p processors, is also called its makespan. If the number of processors is illimited, that is, p = +∞, it can be shown that finding out the makespan amounts to a simple traversal of the task graph. When the number of processors is bounded, that is, p < +∞, it can be shown that the problem becomes NPcomplete. It is therefore required to rely on heuristics. In this case, a simple greedy heuristics is efficient: try to schedule at a given time as many tasks as possible on processors that are available at this time. A list schedule is a schedule such that no processor is deliberately kept idle; at each time step t, if a processor P is available, and if some task T is free, that is to say, when all its predecessors have been executed, then T is scheduled to be executed on processor P. It can be shown that a list schedule allows a solution within % of the optimal makespan to be found. List scheduling plays a very important rôle in scheduling theory and is therefore worth mentioning. A generic list scheduling algorithm looks as follows: . Initialization a. Assign to all free tasks a priority level (the choice of the priority function depends on the problem at hand). b. Set the time step t to . . While there remain tasks to execute: a. If the execution of a task terminates at time t, suppress this task from the predecessor list of all its successors. Add those tasks whose predecessor tasks has become empty to the free tasks, and assign them a priority. b. If there are q available processors, and r free tasks, execute the min(q, r) tasks of highest priority. c. Increase the time step t by one.
Scheduling One-Dimensional Loops A particular field of interest for scheduling is how to execute efficiently a set of compuations that re-execute a large amount of time, also called cyclic scheduling.
S
Such a problem is encountered when scheduling a onedimensional loop. Consider for example, the following loop: do k = , step , to N − A : a(k) = c(k − ) B : b(k) = a(k − ) × b(k − ) C : c(k) = b(k) + endo This loop could be modeled as a task graph where tasks would be A(k), B(k), and C(k), k = , ..., N − . But if N is large, this graph becomes difficult to handle, hence the idea of finding out a generic way to schedule the operations by following the stucture of the loop. Scheduling such a loop consists in finding out a time instant for each instruction to be executed, while respecting the dependences between these instructions. Call operation the execution of an instruction for a given value of index k, and denote A(k), B(k), and C(k) these operations. Notice that A(k) depends on the value c(k − ), which is computed by operation C(k − ) during the previous iteration of the loop. Similarly, B(k) depends on B(k − ) and A(k − ); finally, C(k) depends on B(k), an operation of the same loop iteration. To model this situation, the notion of reduced dependence graph, which gather all information needed to schedule the loop, is introduced: its vertices are operation names (e.g., A, B, C etc.), and its edges represent dependences between operations. To each vertex v, associate an integer value d(v) that represents its execution duration, and to each edge, associate a positive integer w that is if the dependence lies within an iteration, and the difference between the iterations number otherwise. Fig. represents the reduced dependence graph of the above loop. With this representation, a schedule becomes a function σ from V × N to N, where N denotes the set of integers; σ (v, k) gives the time at which instruction v of iteration k can be executed. Of course, a schedule must meet the dependence constraints, which can be expressed in the following way: for all edge e = (u, v) of the dependence graph, it is needed that σ (v, k+w(e)) ≥ σ (u, k) + d(u).
S
S
Scheduling Algorithms
4 C 1
4 C 0
2 A
0
3
2
3
B
A
B
2 1
Scheduling Algorithms. Fig. Reduced dependence graph of the loop
Scheduling Algorithms. Fig. Reduced dependence graph without nonzero edges. This graph is necessarily acyclic, and a simple traversal provides a schedule
d(C)
For this kind of problem, the performance of a schedule is defined as the average cycle time λ defined as max{σ (v, k) + d(v) ∣v ∈ V, ≤ k < N} . N→∞ N In order to define a generic schedule for such a loop, it is interesting to consider schedules of the form σ (v, k) = λk+c(v) where λ is a positive integer and c(v) is a constant integer associated to instruction v. Constant λ is called the initiation interval of the loop, and it represents the number of cycles between the beginning of two successive iterations. Finding out a schedule is quite easy: by restricting the reduced dependence graph to the intra-iteration dependences (i.e., dependences whose value w is ), one obtains an acyclic graph that can be scheduled using a list scheduling algorithm. Let then λ be the makespan of such schedule, and let ls(v) be the schedule assigned to vertex v by the list scheduling; then, a schedule of the form σ (v, k) = λk + ls(v) is a solution to the problem. Fig. shows, for example, the task graph obtained when removing nonzero inter-iteration dependencies. One can see that the makespan of this graph is , and therefore, a schedule of the form σ (v, k) = k + ls(v) is a solution. Figure shows a possible execution, on two processors, with an initiation interval of . However, one can do better in general, as shall be seen. Again, the cyclic scheduling problem can be stated with or without resources. Assuming, for the sake of simplicity, that instructions are carried out by a set of p identical processing units, the number of available processors constitutes a resource limit. First, one can give a lower bound for the initiation interval. Let C be a cycle of the reduced dependence graph, and let d(C) be its duration (the sum of the durations of its instructions) and w(C) be its iteration weight, i.e., the sum of the iteration weights of its edges. λ = lim
Then, λ ≥ w(C) . The intuition behind this bound is that the duration of the tasks of the cycle have to be spread on the number of iterations spanned by this cycle, and as cycles have all the same duration λ, one cannot do better than this ratio. Applying this bound to this example, it can be seen that λ ≥ = (cycle A → B → C → A) and λ ≥ (self dependence of B). Can the value λ = be reached? The answer depends again on the resource constraints. When an unlimited number of processors is available, one can show that the lower bound can be reached by solving a polynomial algorithm (essentially, Belman-Ford longest path algorithm). For this example, this would lead to the schedule σ (B, k) = k, σ (C, k) = k+, and σ (A, k) = k+, which meets the constraints and allows the execution of the loop on processors (see Fig. ). Solving the problem for limited resources is NPcomplete. Another lower bound is related to the number of processors available. If p processors are available, then pλ ≥ ∑v∈V d(v). Intuitively, the total execution time available during one iteration on all processors (pλ) cannot be less than the total execution time needed to evaluate all tasks of an iteration. Several heuristics have been proposed in the literature to approach this bound: loop compaction, loop shifting, modulo scheduling, etc. All these methods rely upon list scheduling, as presented in the Sect. “Task graphs and Scheduling”. Notice, finally, that there exits guaranteed heuristics for solving this problem. One such heuristics combines loop compaction, loop shifting and retiming to obtain such a result.
Scheduling Multidimensional Loops Scheduling unidimensional loops is based on the idea of avoiding unrolling the loop and therefore, of keeping
Scheduling Algorithms
A, k−2
A, k−1 B, k−1
S
A, k
C, k−1
C, k
B, k
7k
7k + 3
7k + 5
Initiation interval
Scheduling Algorithms. Fig. Execution of the loop with an initiation interval of . The corresponding schedule is σ(A, k) = k + , σ(B, k) = k, and σ(C, k) = k +
C, k−1 A, k−2
A, k−1
C, k
C, k−2 B, k−1
A, k
B, k
3k
3k + 3 Initiation interval
Scheduling Algorithms. Fig. Execution of the loop with an optimal initiation interval of . The corresponding schedule is σ(A, k) = k + , σ(B, k) = k, and σ(C, k) = k +
the model compact and tractable, independently on the loop bound N. To this end, the schedule is chosen to be a linear expression of the loop index. The same idea can be pushed further: instead of unrolling a multiple index loop, one tries to express the schedule of an operation by a linear (or affine) function of the loop indexes. This technique can be applied to imperative loops, but it interacts deeply with the detection of dependences in the loop, making the problem more difficult. Another way of looking at this problem is to consider algorithms expressed as recurrence equations, where parallelism is directly exposed. Consider the following recurrences for the matrix multiplication algorithm C = AB where A, B, and C are square matrices of size N: ≤ i, j, k ≤ N → C(i, j, k) = C(i, j, k − ) + a(i, k) × b(k, j) () ≤ i, j ≤ N → c(i, j) = C(i, j, N).
()
Each one of these equations is a single assignment, and it expresses a recurrence, whose indexes belong to the set defined by the left-hand inequality. The first recurrence () defines how C is computed, where k represents the iteration index; the second recurrence () says that value c(i, j) is obtained as the result of the evaluation of C(i, j, k) for k = N. An easy way to schedule these computations for parallelism is to find out a schedule t that is an affine function of all the indexes of these calculations. Assume that C has a schedule of the form t(i, j, k) = λ i+ λ j+ λ k+α; also, assume that the evaluation of equations take at least one cycle, one must have, from Eq. : t(i, j, k) > t(i, j, k − ), ∀i, j, k s.t. ≤ i, j, k ≤ N or once simplified, λ > . This defines a set of possible values for t, with simplest form t(i, j, k) = k (coefficient α can be chosen arbitrarily equal to .) Once a schedule is found, it is possible to allocate the computations of a recurrence equation on a parallel architecture. A simple way of doing it is to use a linear allocation function: for the matrix multiplication example, one could take i and j as the number of the processors, which would result in a parallel matrix multiplication executed on a square grid-connected array of size N. But many other allocations are possible. This simple example contains all the ingredients of loop parallelization techniques, as will be seen now. First, notice that this kind of recurrence is particular: it is said to be uniform, as left-hand side operations depend uniformly on right-hand side expressions through a dependence of the form (, , ) (such a dependence is a simple translation of the left-hand side index). In general, one can consider affine recurrences
S
S
Scheduling Algorithms
where any right-hand side index is an affine function of the left-hand side index. Second, the set where the indexes must lie is an integral convex polyhedron, whose bounds may be parameterized by some symbolic values (here N for example). This is not surprising: most of the loops that are of interest for parallelism share the property that array indexes are almost always affine, and that the index space of a loop can be modeled with a polyhedron. Scheduling affine recurrences in general can be also done by extending the method. Assume that a recurrence has the form ∀z ∈ P, V(z) = f (U(a(z)), ...) where U and V are variables of a program, z is a n-dimensional integer vector, P is a polyhedron, and a(z) is an affine mapping of z. The difference with the previous case is that the number of dependences may now be infinite: depending on the a function and the size of P, one may have a large, potentially infinite, number of dependences between instances of V(z) and U(a(z)), for which the schedule t must meet t(V(z)) > t(U(a(z))). However, properties of polyhedra may be used. A given polyhedron P may be represented equivalently by a set of inequalities, or by its set of generators, that is to say, its vertices, rays, and lines. Generators are in finite number, and it turns out that it is necessary and sufficient to check the property of t on vertices and rays, and this leads to a finite set of linear inequalities involving the coefficients of the t function. This is the basis of the so-called vertex method for scheduling recurrences. Another, equivalent method, called the Farkas method, involves directly the inequalities instead of the generator system of the polyhedron. In general, such methods lead to high complexity algorithms, but in practice, they are tractable: a typical simple algorithm leads to a few hundreds of linear constraints. Without going into the details, these methods can be extended to cover interesting problems. First of all, it is not necessary to assume that all variables of the recurrences have the same schedule. One may instead suppose that variable V is scheduled with a function tV (i, j, ...) = λV, i + λ V, j + ... + αV without making the problem more difficult. Secondly, one may also suppose that the schedule has several dimensions: if operation
⎛ t (z) ⎞ V, ⎟, then the time is V(z) is scheduled at time ⎜ ⎜ ⎟ ⎝ tV, (z) ⎠ understood as a lexicographical ordering. This allows modeling situations where more than one dimension of the index space are evaluated sequentially on a single processor, thus allowing architectures of various dimensions to be found. Very few researches have yet explored scheduling loops under constraints, as was the case for cyclic scheduling. A final remark: cyclic scheduling is a particular case of multidimensional scheduling, where the function has the form σ (v, k) = λk + α v . Indeed, the α constant describes the relative time of execution of operation v(k) during an iteration. In a similar way for higherlevel recurrences, αV plays the same rôle and allows one to refine the pipeline structure of one processor.
From Loops to Recurrences The previous section (Sect. Scheduling Multi-Dimensional Loops) has shown how a set of recurrence equations can be scheduled using linear programming techniques. Here it is sketched how loops expressed in a some imperative language can be translated into single-assignment form that is equivalent to recurrence equations. Consider the following example: for i := 0 to 2n do {S1}: c[i] := 0.; for i :=0 to n do for j := 0 to n do {S2}: c[i+j] := c[i+j] + a[i]*b[j]; which computes the product of two polynomials of size n+, where the coefficients of the input polynomials are in arrays a and b and the result in c. This program contains two loops, each one with one statement, labeled S1 and S2 respectively. Each statement represents as many elementary operations as loop indexes surrounding it. Statement S1 corresponds to n + operations, denoted by (S1, (i)), and statement S2 corresponds to operations denoted (S2, (i, j)). The left-hand side element of statement S2, c[i+j], writes several times in the same memory cell and its right-hand side makes use of the content of the same
Scheduling Algorithms
memory cell. In order to rewrite these loops as recurrence equations, one need to carry out a data flow analysis to find out which statement made the last modification to c[i+j]. Since both statement modify c, it may be either S2 itself, or S1. Consider first the case when S2 is the source of the operation that modifies c[i+j] for operation (S2, (i, j)). Then this operation is (S2, (i′ , j′ )) where i′ + j′ = i + j, and indexes i′ and j′ must satisfy the limit constraints of the loops, that is, ≤ i′ ≤ n and ≤ j′ ≤ n. There is another condition, also called sequential predicate: (S2, (i′ , j′ )) must be the operation that was the last to modify cell c[i+j] before (S2, (i, j)) uses it. But (i′ , j′ ) is smaller than (i, j) in lexicographic order, and thus, (i′ , j′ ) is the lexical maximum of (i, j). All these conditions being affine inequalities, it turns out that finding out the source index can be done using parameterized linear programming, resulting in a so-called quasi-affine selection tree (quast) whose conditions on indices are also linear. One would thus find out that (i′ , j′ ) = if i > ∧ j < n then (i − , j + ) else where represents the situation when (i′ , j′ ) was defined by a statement that precedes the loops. Now if S1 is the source of the operation, this operation is (S1, i′ ) where i′ = i + j, and index i′ must lay inside the bounds, thus ≤ i′ ≤ n. Moreover, this operation must be executed before, which is obviously always true. Thus, it can be seen that (i′ ) = (i + j). Combining both expressions, one finds that the source of (i, j) is given by if i > ∧ j < n then (S, i − , j + ) else (S, i + j). Once the source of each operation is found, rewriting the program as a system of recurrence equations is easy. For each statement S, a variable S(i, j, ...) with as many indexes as surrounding loops is introduced, and the right hand side equation is rewritten by replacing expressions by their source translation. An extra equation can be provided to express the value of output variables of the program, here, c for example. ≤ i ≤ n → S(i) = .
()
≤ i ≤ n ∧ ≤ j ≤ n → S(i, j) = if i > ∧ j < n
()
then S(i − , j + ) + a(i) ∗ b( j) else S(i + j) + a(i) ∗ b( j) ≤i≤n → c(i) = if j = then S(i, j) else S(n, j )
() ()
S
The exact data flow analysis is a means of expressing implicit dependences that result from the encoding of the algorithm as loops. Although other methods exist in order to schedule loops without single assignement, this method is deeply related to scheduling techniques.
Scheduling Today and Future Directions Loop scheduling is a central process of autoparallelization techniques, either implicit – when loops are rewritten after dependence analysis and then compiled, – or explicit, when the scheduling function is calculated and explicitly used during the parallelization process. Compilers incorporate more and more often sophisticated dependence and analysis tools to produce efficient code for target parallel architecture. Mono-dimensional loop scheduling is both used to massage loops for instruction-level parallelism, but also to design special-purpose hardware for embedded calculations. Finally, multi-dimensional scheduling is included in some autoparallelizers to produce code for parallel processors. The need to produce efficient compilers is a limitation to the use of complex scheduling methods, as it has been shown that most practical problems are NP-complete. Progress in the knowledge of scheduling algorithms has made more and more likely for methods, previously rejected because of their complexity, to become common ingredients of parallelization tools. Moreover, the design of parallel programs tolerates a slightly longer production time than compiling everydays’ program does, in particular because many applications of parallelism target embedded systems. Thus, the spreading of parallel computers – for example, GPU or multi-core – will certainly make these techniques useful.
Bibliographic Notes and Further Reading Darte, Robert, and Vivien [] wrote a reference book on scheduling for parallelism; it provides an in-depth coverage of the techniques presented here, in particular task scheduling and cyclic scheduling. It also provides a survey and comparison of various methods for dependence analysis. This entry borrows a large part of its material from this book.
S
S
SCI (Scalable Coherent Interface)
El-Rewini, Lewis, and Ali give in [] a survey on scheduling in general and task scheduling in particular. In [], Benoit and Robert give complexity results for scheduling problems that are related to recent parallel platforms. Lam [] presents software pipelining, one of the most famous applications of cyclic scheduling to loops. Theoretical results on cyclic scheduling can be found in []. Karp, Miller, and Winograd [] were among the first authors to consider scheduling for parallel execution, in a seminal paper on recurrence equations. Lamport [] published one of the most famous papers on loop parallelization. Loop dependence analysis and loop transformations for parallelization are extensively presented by Zima and Chapman in []. Mauras, Quinton, Rajopadhye, and Saouter presented in [] the vertex method to schedule affine recurrences, whereas Feautrier ([] and []) describes the Farkas method for scheduling affine loop nests. Data flow analysis is presented in great detail by Feautrier in [].
. Lamport L (Feb ) The parallel execution of DO loops. Commun ACM ():– . Mauras C, Quinton P, Rajopadhye S, Saouter Y (September ) Scheduling affine parameterized recurrences by means of variable dependent timing functions. In: Kung SY, Schwartzlander EE, Fortes JAB, Przytula KW (eds) Application Specific Array Processors, Princeton University, IEEE Computer Society Press, pp – . Zima H, Chapman B () Supercompilers for Parallel and Vector Computers. ACM Press, New York
SCI (Scalable Coherent Interface) Hermann Hellwagner Klagenfurt University, Klagenfurt, Austria
Synonyms International standard ISO/IEC :(E) and IEEE Std , edition
Definition Bibliography . Benoit A, Robert Y (Oct ) Complexity results for throughput and latency optimization of replicated and data-parallel workflows. Algorithmica ():– . Darte A, Robert Y, Vivien F () Scheduling and automatic parallelization. Birkhäuser, Boston . El-Rewini H, Lewis TG, Ali HH () Task scheduling in parallel and distributed systems. Prentice Hall, Englewood Cliffs, New Jersey . Feautrier P (February ) Dataflow analysis of array and scalar references. Int J Parallel Program ():– . Feautrier P (Oct ) Some efficient solutions to the affine scheduling problem, Part I, one dimensional time. Int J Parallel Program ():– . Feautrier P (December ) Some efficient solutions to the affine scheduling problem, Part II, multidimensional time. Int J Parallel Program ():– . Hanen C, Munier A () Cyclic Scheduling on parallel processors: An overview. In: Chrétienne P, Coffman EG Jr, Lenstra K, Lia Z (eds) Scheduling theory and its applications. John Wiley & Sons . Karp RM, Miller RE, Winograd S (July ) The organization of computations for uniform recurrence equations. J Assoc Comput Machin ():– . Lam MS () Software pipelining: an effective scheduling technique for VLIW Machines. In: SIGPLAN’ conference on programming language, design and implementation, Atlanta, GA. ACM Press, pp –
Scalable Coherent Interface (SCI) is the specification (standardized by ISO/IEC and the IEEE) of a highspeed, flexible, scalable, point-to-point-based interconnect technology that was implemented in various ways to couple multiple processing nodes. SCI supports both the message-passing and shared-memory communication models, the latter in either the cachecoherent or non-coherent variants. SCI can be deployed as a system area network for compute clusters, as a memory interconnect for large-scale, cache-coherent, distributed-shared-memory multiprocessors, or as an I/O subsystem interconnect.
Discussion Introduction SCI originated in an effort of bus experts in the late s to define a very high performance computer bus (“Superbus”) that would support a significant degree of multiprocessing, i.e., number of processors. It was soon realized that backplane bus technology would not be able to meet this requirement, despite advanced concepts like split transactions and sophisticated implementations of the latest bus standards and products. The committee thus abandoned the bus-oriented view
SCI (Scalable Coherent Interface)
and developed novel, distributed solutions to overcome the shared-resource and signaling problems of buses, while retaining the overall goal of defining an interconnect that offers the convenient services well known from centralized buses [, ]. The resulting specification of SCI, approved initially in [] and finally in the late s [], thus describes hardware and protocols that provide processors with the shared-memory view of buses. SCI also specifies related transactions to read, write, and lock memory locations without software protocol involvement, as well as to transmit messages and interrupts. Hardware protocols to keep processor caches coherent are defined as an implementation option. The SCI interconnect, the memory system, and the associated protocols are fully distributed and scalable: an SCI network is based on point-to-point links only, implements a distributed shared memory (DSM) in hardware, and avoids serialization in almost any respect.
Goals of SCI’s Development Several ambitious goals guided the specification process of SCI [, , ]. High performance. The primary objective of SCI was to deliver high communication performance to parallel or distributed applications. This comprises: – High sustained throughput; – Low latency; – Low CPU overhead for communication operations. The performance goals set forth were in the range of Gbit/s link speeds and latencies in the low microseconds range in loosely-coupled systems, even less in tightlycoupled multiprocessors.
S
– Technological scalability, i.e., use of the same mechanisms in large-scale and small-scale as well as tightly-coupled and loosely-coupled systems, and the ability to readily make use of advances in technology, e.g., high-speed links; – Economic scalability, i.e., use of the same mechanisms and components in low-end, high-volume and high-end, low-volume systems, opening the chance to leverage the economies of scale of mass production of SCI hardware; – No short-term practical limits to the addressing capability, i.e., an addressing scheme for the DSM wide enough to support a large number of nodes and large node memories. Coherent memory system. Caches are crucial for modern microprocessors to reduce average access time to data. This specifically holds for a DSM system with NUMA (non-uniform memory access) characteristics where remote accesses can be roughly an order of magnitude more expensive than local ones. To support a convenient programming model as known, e.g., from symmetric multiprocessors (SMPs), the caches should be kept coherent in hardware. Interface characteristics. The SCI specification was intended to describe a standard interface to an interconnect to enable multiple devices from multiple vendors to be attached and to interoperate. In other words, SCI should serve as an “open distributed bus” connecting components like processors, memory modules, and intelligent I/O devices in a high-speed system.
Main Concepts of SCI
Scalability. SCI was devised to address scalability in many respects, among them []:
In the following, the main concepts and features of SCI will be summarized and the achievements, as represented by several implementations of SCI networks, will be assessed.
– Scalability of performance (aggregate bandwidth) as the number of nodes grows; – Scalability of interconnect distance, from centimeters to tens of meters, depending on the media and physical layer implementation, but based on the same logical layer protocols; – Scalability of the memory system, in particular of the cache coherence protocols, without a practical limit on the number of processors or memory modules that could be handled;
Point-to-point links. An SCI interconnect is defined to be built only from unidirectional, point-to-point links between participating nodes. These links can be used for concurrent data transfers, in contrast to the oneat-a-time communication characteristics of buses. The number of the links grows as nodes are added to the system, increasing the aggregate bandwidth of the network. The links can be made fast and their performance can scale with improvements in the underlying technology. They can be implemented in a bit-parallel manner
S
S
SCI (Scalable Coherent Interface)
(for small distances) or in a bit-serial fashion (for larger distances), with the same logical layer protocols. Most implementations use parallel links over distances of a few centimeters or meters. Sophisticated signaling technology. The data transfer rates and lengths of shared buses are inherently limited due to signal propagation delays and signaling problems on the transmission lines. The unidirectional, point-topoint SCI links avoid such signaling problems. Since there is only a single transmitter and a single receiver rather than multiple devices, the signaling speed can be increased significantly. High speeds are also fostered by low-voltage differential signals. Furthermore, SCI strictly avoids back-propagating signals, even reverse flow control on the links, in favor of high signaling speeds and scalability. Flow control information becomes part of the normal data stream in the reverse direction, leading to the requirement that an SCI node must at least have one outgoing link and one incoming link. Already in the mid s, SCI link implementation speeds reached Mbyte/s in system area networks (distances of a few meters, -bit parallel links) and Gbyte/s in closely-coupled, cache-coherent sharedmemory multiprocessors; transfer rates of Gbyte/s
Workstation
have also been demonstrated over a distance of about m, using parallel fiber-optic links; see []. Nodes. SCI was designed to connect up to k nodes. A node can be a complete workstation or server machine, a processor and its associated cache only, a memory module, I/O controllers and devices, or bridges to other buses or interconnects, as illustrated exemplarily in Fig. . Each node is required to have a standard interface to attach to the SCI network, as described in [, , ]. In most SCI systems implemented so far, nodes are complete machines, often even multiprocessors. Topology independence. In principle, SCI networks with complex topologies could be built; investigations into this area are described in several chapters of []. However, the standard anticipates simple topologies to be used. For small systems, for instance, the preferred topology is a small ring (a so-called ringlet); for larger systems, topologies like a single switch connecting multiple ringlets, rings of rings, or multidimensional tori are feasible; see Fig. . Most SCI systems implemented used single rings, a switch, multiple rings, or two-dimensional tori. Fixed addressing scheme. SCI uses the -bit fixed addressing scheme defined by the Control and Status Processor and cache
Memory
SCI node
SCI ringlet
SCI-bus bridge
SCI-LAN converter
Node
Switch
Ring of rings
SCI (Scalable Coherent Interface). Fig. Simple SCI network topologies
2D torus
SCI (Scalable Coherent Interface)
Register (CSR) Architecture standard (IEEE Std ) []. The -bit SCI address is divided into two fixed parts: the most significant address bits specify the node ID (node address) so that an SCI network can comprise up to k nodes; the remaining bits are used for addressing within the nodes, in compliance with the CSR Architecture standard. Hardware-based distributed shared memory (DSM). The SCI addressing scheme spans a global, -bit address space; in other words, a physically addressed, distributed shared memory system. The distribution of the memory is transparent to software and even to processors, i.e., the memory is logically shared. A memory access by a processor is mediated to the target memory module by the SCI hardware. The major advantage of this feature is that internode communication can be effected by simple load and store operations by the processor, without invocation of a software protocol stack. The instructions accessing remote memory can be issued at user level; the operating system need not be involved in communication. This results in very low latencies for SCI communication, typically in the low microseconds range. A major implementation challenge, however, is how to integrate the SCI network (and, thus, access to the system-wide SCI DSM) with the memory architecture of a standard single-processor workstation or a multiprocessor node. The common solutions, attaching SCI to the I/O bus or to the memory bus, will be outlined below. Bus-like services. To complete the hardware DSM, SCI defines transactions to read, write, and lock memory locations, functionality well-known from computer buses. In addition, message passing and global time synchronization are supported, both as defined by the CSR Architecture; interrupts can be delivered remotely as well. Broadcast functionality is also defined. Transactions can be tagged with four different priorities. In order to avoid starvation of low-priority nodes, fair protocols for bandwidth allocation and queue allocation have been developed. Bandwidth allocation is similar in effect to bus arbitration in that it assigns transfer bandwidth (if scarce) to nodes willing to send. Queue allocation apportions space in the input queues of heavily loaded, shared nodes, e.g., memory modules or switch ports, which are targeted by many nodes
S
simultaneously. Since the services have to be implemented in a fully distributed fashion, the underlying protocols are rather complex. Split transactions. Like multiprocessor buses, SCI strictly splits transactions into request and response phases. This is a vital feature to avoid scalability impediments; it makes signaling speed independent of the distance a transaction has to travel and avoids monopolizing network links. Transactions therefore have to be self-contained and are sent as packets, containing a transaction ID, addresses, commands, status, and data as needed. A consequence is that multiple transactions can be outstanding per node. Transactions can thus be pumped into the network at a high rate, using the interconnect in a pipeline fashion. Optional cache coherence. SCI defines distributed cache coherence protocols, based on a distributeddirectory approach, a multiple readers – single writer sharing regime, and write invalidation. The memory coherence model is purposely left open to the implementer. The standard provides optimizations for common situations such as pair-wise sharing that improve performance of frequent coherence operations. The cache coherence protocols are designed to be implemented in hardware; however, they are highly sophisticated and complex. The complexity stems from a large number of states of coherent memory and cache blocks, correspondingly complex state transitions, and the advanced algorithms that ensure atomic modifications of the distributed-directory information (e.g., insertions, deletions, invalidations). The greatest complication arises from the integration of the SCI coherence protocols with the snooping protocols typically employed on the nodes’ memory buses. An implementation is highly challenging and incurs some risks and potentially high costs. Not surprisingly, only a few companies have done implementations, as described below. The cache coherence protocols are provided as options only. A compliant SCI implementation need not cover coherence; an SCI network even cannot participate in coherence actions when it is attached to the I/O bus as is the case in compute clusters. Yet, a common misconception was that cache coherence was required functionality at the core of SCI. This
S
S
SCI (Scalable Coherent Interface)
misunderstanding has clearly hindered SCI’s proliferation for non-coherent uses, e.g., as a system area network. Reliability in hardware. In order to enable high-speed transmission, error detection is done in hardware, based on a -bit CRC code which protects each SCI packet. Transactions and hardware protocols are provided that allow a sender to detect failure due to packet corruption, and allow a receiver to notify the sender of its inability to accept packets (due to a full input queue) or to ask the sender to re-send the packet. Since this happens on a per-packet basis, SCI does not automatically guarantee in-order delivery of packets. This may have a considerable impact on software which would rely on a guaranteed packet sequence. An example is a message passing library delivering data into a remote buffer using a series of remote write transactions and finally updating the tail pointer of the buffer. Some SCI hardware provides functionality to enforce a certain memory access order, e.g., via memory barriers []. Various time-outs are provided to detect lost packets or transmission errors. Hardware retry mechanisms or software recovery protocols may be implemented based on transmission-error detection and isolation mechanisms; these are however not part of the standard. As a consequence, SCI implementations differed widely in the way errors are dealt with. The protocols are designed to be robust, i.e., they should, e.g., survive the failure of a node with outstanding transactions. Among other mechanisms, error containment and logging procedures, ringlet maintenance functions and a packet time-out scheme are specified. Robustness is particularly important for the cache coherence protocols which are designed to behave correctly even if a node fails amidst the modification of a distributed-directory entry. Layered specification. The SCI specification is structured into three layers. At the physical layer, three initial physical link models are defined: a parallel electrical link operating at Gbyte/s over short distances (meters); a serial electrical link that operates at Gbit/s over intermediate distances (tens of meters); and a serial optical link that operates at Gbit/s over long distances (kilometers).
Although the definitions include the electrical, mechanical, and thermal characteristics of SCI modules, connectors, and cables, the specifications were not adhered to in implementations. SCI systems typically used vendor-specific physical layer implementations, incompatible to others. It is fair to say, therefore, that SCI has not become the open, distributed interconnect system that the designers had envisaged to create. The logical layer specifies transaction types and protocols, packet types and formats, packet encodings, the standard node interface structure, bandwidth and queue allocation protocols, error processing, addressing and initialization issues, and SCI-specific CSRs. The cache coherence layer provides concepts and hardware protocols that allow processors to cache remote memory blocks while still maintaining coherence among multiple copies of the memory contents. Since an SCI network no longer has a central resource (i.e., a memory bus) that can be snooped by all attached processors to effect coherence actions, distributeddirectory-based solutions to the cache coherence problem had to be devised. At the core of the cache coherence protocols are the distributed sharing lists shown in Fig. . Each shared block of memory has an associated distributed, doublylinked sharing list of processors that hold a copy of the block in their local caches. The memory controller and the participating processors cooperatively and concurrently create and update a block’s sharing list, depending on the operations on the data. C code. A remarkable feature of the SCI standard is that major portions are provided in terms of a “formal” specification, namely, C code. Text and figures are considered explanatory only, the definitive specification are the C listings. Exceptions are packet formats and the physical layer specifications. The major reasons for this approach are that C code is (largely) unambiguous and not easily misunderstood and that the specification becomes executable, as a simulation. In fact, much of the specification was validated by extensive simulations before release.
Implementations and Applications of SCI SCI was originally conceived as a shared-memory interconnect, but SCI’s flexibility and performance potential
SCI (Scalable Coherent Interface)
Memory Control
Processors P_A
P_B
P_C
P_D
RAM
3 Bits
S
CPUs Caches
M State
Fwd ID
16 Bits
Data (64 Bytes)
16 Bits 7 Bits
Fwd ID C State
Back ID 16 Bits Mem ID 16 Bits
Address offset
(~4% overhead)
48 Bits
Data (64 Bytes)
Cache-coherent memory block Non-coherent memory block 64K nodes, 64-byte memory/cache blocks assumed
(~7% overhead)
SCI (Scalable Coherent Interface). Fig. Sharing list and coherence tags of SCI cache coherence protocols
also for other applications was soon realized and leveraged by industry. In the following, a few of these “classical” applications of SCI are introduced and examples of commercial systems that exploited SCI technology are provided.
System Area Network for Clusters In compute clusters, an SCI system area network can provide high-performance communication capabilities. In this application, the SCI interconnect is attached to the I/O bus of the nodes (e.g., PCI) by a peripheral adapter card, very similar to a LAN; see Fig. . In contrast to “standard” LAN technology and most other system area networks, though, the SCI cluster network, by virtue of the common SCI address space and associated transactions, provides hardware-based, physical distributed shared memory. Figure shows a high-level view of the DSM. An SCI cluster thus is more tightly coupled than a LAN-based cluster, exhibiting the characteristics of a NUMA parallel machine. The SCI adapter cards, together with the SCI driver software, establish the DSM as depicted in Fig. . (This description pertains to the solutions taken by the pioneering vendor, Dolphin Interconnect Solutions, for their SBus-SCI and PCI-SCI adapter cards [, ].) A node that is willing to share memory with other nodes (e.g., A), creates shared memory segments in its physical memory and exports them to the SCI network (i.e., SCI address space). Other nodes (e.g., B) import these DSM segments into their I/O address space. Using on-board address translation tables (ATTs), the SCI
adapters maintain the mappings between their local I/O addresses and the global SCI addresses. Processes on the nodes (e.g., i and j) may further map DSM segments into their virtual address spaces. The latter mappings are conventionally being maintained by the processors’ MMUs. Once the mappings have been set up, inter-node communication may be performed by the participating processes at user level, by simple CPU load and store operations into DSM segments mapped from remote memories. The SCI adapters translate I/O bus transactions that result from such memory accesses into SCI transactions, and vice versa, and perform them on behalf of the requesting processor. Thus, remote memory accesses are both transparent to the requesting processes and do not need intervention by the operating system. In other words, no protocol stack is involved in remote memory accesses, resulting in low communication latencies. In the s, the prevailing commercial implementations of such SCI cluster networks were the SBus-SCI and PCI-SCI adapter cards (and associated switches) offered by Dolphin Interconnect Solutions []. These cluster products were used by other companies to build key-turn cluster platforms; examples were Sun Microsystems, offering high-performance, highavailability server clusters, and Scali Computers and Siemens providing clusters for parallel computing based on MPI; see []. Similar SCI adapters are still offered at the time of writing and are being used in embedded systems, e.g., medical devices such as CT scanners [].
S
S
SCI (Scalable Coherent Interface)
Memory image
Node
Node architecture
P&C
M Memory bus
SCI global address space
I/O bridge I/O bus SCI adapter SCI links
SCI interconnect
SCI (Scalable Coherent Interface). Fig. SCI cluster model Node A, process i
Node B, process j
Local memory segment
Mapped, remote memory segment
Virtual address spaces Map
Map
MMU
Physical I/O address spaces Export
Import
ATT
SCI physical address space
SCI (Scalable Coherent Interface). Fig. Address spaces and address translations in SCI clusters
In addition, there were research implementations of adapter cards, among them one developed at CERN as a prototype to investigate SCI’s feasibility and performance for demanding data acquisition systems in highenergy physics, i.e., in one of the LHC experiments; see []. The description of an SCI cluster interconnect given above does not address many of the low-level problems and functionality that the implementation has to cover. For instance, issues like the translation of -bit node addresses to -bit SCI addresses and vice versa, the choice of the shared segment size, error detection and handling, high-performance data transfers between node hardware and SCI adapter, and the design and implementation of low-level software (SCI drivers) as well as message-passing software (e.g., MPI) represent a spectrum of research and development problems. Several contributions in [] describe these problems and corresponding solutions in some detail. An important property of such SCI cluster interconnect adapters is worth pointing out here. Since an SCI cluster adapter attaches to the I/O bus of a node,
it cannot directly observe, and participate in, the traffic on the memory bus of the node. This therefore precludes caching and coherence maintenance of memory regions mapped to the SCI address space. In other words, remote memory contents are basically treated as non-cacheable and are always accessed remotely. Standard SCI cluster interconnect hardware does not implement cache coherence capabilities therefore. This property raises a performance concern: remote accesses (round-trip operations such as reads) must be used judiciously since they are still an order of magnitude more expensive than local memory accesses (NUMA characteristics). The basic approach to deal with the latter problem is to avoid remote operations that are inherently round-trip, i.e., reads, as far as possible. Rather, remote writes are used which are typically buffered by the SCI adapter and therefore, from the point of view of the processor issuing the write, experience latencies in the range of local accesses, several times faster than remote read operations. Again in [], several chapters describe how considerations like this influence the design and
SCI (Scalable Coherent Interface)
Memory Interconnect for Cache-Coherent Multiprocessors The use of SCI as a cache-coherent memory interconnect allows nodes to be even more tightly coupled than in a non-coherent cluster. This application requires SCI to be attached to the memory bus of a node, as shown in Fig. . At this attachment point, SCI can participate in and “export,” if necessary, the memory and cache coherence traffic on the bus and make the node’s memory visible and accessible to other nodes. The nodes’ memory address ranges (and the address mappings of processes) can be laid out to span a global (virtual) address space, giving processes transparent and coherent access to memory anywhere in the system. Typically, this approach is adopted to connect multiple bus-based commodity SMPs to form a large-scale, cache-coherent (CC) shared-memory system, often termed a CC-NUMA machine. There were three notable examples of SCI-based CC-NUMA machines: the HP/Convex Exemplar series [], the Sequent NUMA-Q multiprocessor [], and the Data General AViiON scalable servers (see []). The
P&C
P&C
I/O Subsystem Interconnect SCI can be used to connect one or more I/O subsystems to a computing system in novel ways. The shared SCI address space can include the I/O nodes which then are enabled to directly transfer data between the peripheral devices (in most cases, disks) and the compute nodes’ memories using DMA; software needs not be involved in the actual transfer. Remote peripheral devices in a cluster, for instance, thus can become accessible like local devices, resulting in an I/O model similar to SMPs; remote interrupt capability can also be provided via SCI. High bandwidth and low latency, in addition to the direct remote memory access capability, make SCI an interesting candidate for an I/O network. There were two commercial implementations of SCI-based I/O system networks. One was the GigaRing channel from SGI/Cray [], the other one the
P&C
P&C
P&C
System bus M
I/O
latter two systems comprised bus-based SMP nodes with Intel processors, while the Exemplar used HP PARISC processors and a non-blocking crossbar switch within the nodes. The inter-node memory interconnects were proprietary implementations of the SCI standard, with specific adaptations and optimizations incorporated to ease implementation and integration with the node architecture and to foster overall performance. The major challenge in building a CC-NUMA machine is to bridge the cache coherence mechanisms on the intra-node interconnect (e.g., the SMP bus running a snooping protocol) and the inter-node network (SCI). Since it is well documented, the Sequent NUMA-Q machine is used as a case study in [] to illustrate the essential issues in building such a bridge and making the protocols interact correctly.
implementation of efficient message-passing libraries on top of SCI. Finally, several contributions in [] deal with how to overcome the limitations of non-coherent SCI cluster hardware, in particular for the implementation of shared-memory and shared-object programming models. Techniques from software DSM systems, e.g., replication and software coherence maintenance, are applied to provide a more convenient abstraction, e.g., a common virtual address space spanning all the nodes in the SCI cluster.
P&C
S
SCI
SCI
SCI
SCI interconnect
SCI (Scalable Coherent Interface). Fig. SCI based CC-NUMA Multiproccessor Model
P&C
SCI
S
S
SCI (Scalable Coherent Interface)
external I/O subsystem interconnect of the Siemens RM Enterprise Servers, based on Dolphin’s cluster technology; see again [].
latest developments on SCI are explored in more detail in the final chapter of [].
Concluding Remarks
Related Entries
SCI addresses difficult interconnect problems and specifies innovative distributed structures and protocols for a scalable DSM architecture. The specification covers a wide spectrum of bus, network, and memory architecture problems, ranging from signaling considerations up to distributed directory-based cache coherence mechanisms. In fact, this wide scope of SCI has raised criticism that the standard is actually “several standards in one” and difficult to understand and work with. This lack of a clear profile and the wide applicability of SCI have probably contributed to its relatively modest acceptance in industry. Furthermore, the SCI protocols are complex (albeit well devised) and not easily implemented in silicon. Thus, implementations are quite complex and therefore expensive. In an attempt to reduce complexity and optimize speed, many of the implementers adopted the concepts and protocols which they regarded as appropriate for their application, and left out or changed other features according to their needs. This use of SCI led to a number of proprietary, incompatible implementations. As a consequence, the goal of economic scalability has not been satisfactorily achieved in general. Further, SCI has clearly also missed the goal of evolving into an “open distributed bus” that multiple devices from different vendors could attach to and interoperate. However, in terms of the technical objectives, predominantly high performance and scalability, SCI has well achieved its ambitious goals. The vendors that have adopted and implemented SCI (in various flavors), offered innovative high-throughput, low-latency interconnect products or full-scale cache-coherent shared-memory multiprocessing systems, as described above. SCI had noticeable influence on further developments, e.g., interconnect standard projects like SyncLink, a high-speed memory interface, and Serial Express (SerialPlus), an extension to the SerialBus (IEEE ) interconnect. SCI also played a role in the debate on “future I/O systems” in the late s, which then led to the InfiniBand™Architecture. The
Buses and Crossbars Cache Coherence Clusters Distributed-Memory Multiprocessor InfiniBand Myrinet NOW Nonuniform Memory Access (NUMA) Machines Quadrics Shared-Memory Multiprocessors
Bibliographic Notes and Further Reading SCI was standardized and deployed in the s. David B. Gustavson, one of the main designers, provides firsthand information on SCI’s origins, development, and concepts [, , ]; even the standard document is stimulating material [, ]. Gustavson also maintained a Website that actively traced and documented the activities around SCI and related standards []. In the late s, the interest in SCI largely disappeared. However, [] was published as a comprehensive (and somehow concluding) summary of the technology, research and development projects, and achievements around SCI. For many projects and products, the corresponding chapters in [] are the only documentation still available in the literature; in particular, information on products has disappeared from the Web meanwhile, but is still available to some extent in this book for reference. This entry of the encyclopedia is a condensed version of the introductory chapter of [].
Bibliography . Brewer T, Astfalk G () The evolution of the HP/Convex Exemplar. In: Proceedings of COMPCON, San Jose . Dolphin Interconnect Solutions () http://www.dolphinics.com. Accessed October . Gustavson DB, () The Scalable Coherent Interface and related standards projects. IEEE Micro ():–
Semantic Independence
. Gustavson DB () The many dimensions of scalability. In: Proceedings of the COMPCON Spring, San Francisco . Gustavson DB Li Q () The Scalable Coherent Interface (SCI). IEEE Comm Mag :– . Hellwagner H, Reinefeld A (eds) () SCI: Scalable Coherent Interface. Architecture and software for high-performance compute clusters. LNCS vol , Springer, Berlin . IEEE Std - () IEEE standard Control and Status Register (CSR) Architecture for microcomputer buses . IEEE Std - () IEEE standard for Scalable Coherent Interface (SCI) . International Standard ISO/IEC :(E) and IEEE Std , Edition () Information technology - Scalable Coherent Interface (SCI). ISO/IEC and IEEE . Lovett T, Clapp R () STiNG: a CC-NUMA computer system for the commercial marketplace. In: Proceedings of the rd Internationl Symposium on Computer Architecture (ISCA), May , Philadelphia . SCIzzL: the local area memory port, local area multiprocessor, Scalable Coherent Interface and Serial Express users, developers, and manufacturers association. http://www.scizzl.com. Accessed October . Scott S () The GigaRing channel. IEEE Micro ():–
Semantic Independence Martin Fränzle , Christian Lengauer Carl von Ossietzky Universität, Oldenburg, Germany University of Passau, Passau, Germany
Definition Two program fragments are semantically independent if their computations are mutually irrelevant. A consequence for execution is that they can be run on separate processors in parallel, without any synchronization or data exchange. This entry describes a relational characterization of semantic independence for imperative programs.
Discussion Introduction The most crucial analysis for the discovery of parallelism in a sequential, imperative program is the dependence analysis. This analysis tells which program operations must definitely be executed in the order prescribed by the source code (see the entries on dependences and dependence analysis). The conclusion is that all others can be executed in either order or in parallel.
S
The converse problem is that of discovering independence. Program fragments recognized as independent may be executed in either order or in parallel, all others must be executed in the order prescribed by the source code. The search for dependences is at the basis of an automatic program parallelizer (see the entries on autoparallelization and on the polyhedron model). The determination of the independence of two program fragments is a software engineering technique that can help to map a specific application program to a specific parallel or distributed architecture or it can tell something specific about the information flow in the overall program. While the search for dependences or independences has to be based on some form of analysis of the program text, the criteria underlying such an – in any Turingcomplete programming language necessarily incomplete – analysis should be semantic. The reason is that the syntactic form of the source program will make certain (in)dependences easy to recognize and will occlude others. It is therefore much wiser to define independence in a model which does not entail such an arbitrary bias. An appropriate model for the discovery of independence between fragments of an imperative program is the relation between its input states and its output states.
Central Idea and Program Example Program examples adhere to the following notational conventions: ●
The value sets of variables are specified by means of a declaration of the form var x, y, ... : {, ..., n}, for n ≥ (that is, they are finite). ● If the value set is {, }, Boolean operators are used with their usual meaning, where the number is identified with false and with true. ● Boolean operators bind more tightly than arithmetic ones. The initial paper on semantic independence [] contains a number of illustrative examples. The most complex one, Example , shall serve here as a guide: var x, y, z : {, }; (x, y) := (x ∧ z, y ∧ ¬z) α : (x, y) := (x ∧ ¬z, y ∧ z) α :
S
S
Semantic Independence
The question is whether the effect of these two distinct multiple assigments can be achieved by two independent operations which could be executed mutually in parallel. At first glance, it does not seem possible, since x and y are subject to update and are both shared by both statements. However, this is due to the particular syntactic form of the program. One should not judge independence by looking at the program text but by looking at a semantic model. One suitable model is a graph of program state transitions. The graph for the program above is depicted in Fig. . The nodes are program states. A program state consists of the three binary values of variables x, y, and z. State transitions are modelled by arrows between states. Solid arrows depict transitions of statement α , dashed arrows transitions of statement α . The question to be answered in order to determine independence on any platform is: is there a transformation of the program’s state space which leads to variables that can be accessed independently? It is not sufficient to require updates to be independent but allow reading to be shared, because the goal is to let semantically independent program fragments access disjoint pieces of memory. This obviates a cache coherence protocol in a multiprocessor system and improves the
utilization of the cache space. Some special-purpose platforms, for example, systolic arrays, disallow shared reading altogether; see the entry on systolic arrays. Consequently, all accesses – writing and reading – must be independent. For the example, there is such a transformation of the variables x, y, and z to new variables V and W: var V : {, }, W : {, , , }; V = (x ∧ ¬z) ∨ (y ∧ z) W = ∗ z + (x ∧ z) ∨ (y ∧ ¬z) The transformation is depicted in Fig. . The figure shows the same state graph as Fig. but imposes two partitionings on the state space. The set ρ of the two amorphously encased partitions yields the two-valued variable V. As denoted in the figure, each partition models one distinct value of the variable. The set ρ of the four rectangularly encased partitions yields the fourvalued variable W. The partitionings must – and do – satisfy certain constraints to be made precise below. The semantically equivalent program that results is: var V : {, }, W : {, , , }; α′ : V := ′ α : W := ∗ (Wdiv )
r1
011
001
V
=0 W=2
011
001
000
010
101
000
010
111
101
111
W=3
100 100
110
Semantic Independence. Fig. State relations of the program example
W=0
110 W=1
V
=
1
r2
Semantic Independence. Fig. State relations of the program example, partitioned
Semantic Independence
As is plainly evident, the new statements α′ and α ′ access distinct variable spaces and are, thus, trivially independent. In Fig. , this property is reflected by the fact that α (and, thus, α′ ) transitions are exclusively inside partitions of partitioning ρ and α (and, thus, α ′ ) transitions exclusively inside partitions of partitioning ρ . The notion of semantic independence illustrated above and formalized below specifies a transformation of the state space that induces two partitionings, each covering one update in the transformed state space, such that both updates can be performed in parallel, without any need for shared access to program variables: they are completely independent in the sense that no information flows between them. If the parallel updates depend on a common context, that is, they constitute only a part of the overall program, the transformations between the original and the transformed state space must become part of the implementation. Then, the transformation from the original to the transformed space may require multiple read accesses to variables in the original state space and the backtransformation may require multiple read accesses to variables in the transformed state space. Depending on the abilities of the hardware, the code for these transformations may or may not be parallelizable (see the section on disjoint parallelism and its limitations).
Formal Definition The first objective of this subsection is the independence of two imperative program fragments, as illustrated in the example. The generalization to more than two program fragments as well as to the interior of a single fragment follows subsequently.
Independence Between Two Statements A transformation of the state space can be modelled as a function
S
η is supposed to expose binary independence. This leads to a partitioning of state space Y, that is, Y should be the Cartesian product of two nontrivial, that is, neither empty nor singleton state spaces A and B: Y = A × B. Desired are two new program fragments R′ ⊆ Y × Y and S′ ⊆ Y × Y, such that R′ operates solely on A and S′ operates solely on B. Given two relations r ⊆ A × A and s ⊆ B × B denote their product relation as follows: r ⊗ s = {((a, b), (a′ , b′ )) ∣ (a, a′ ) ∈ r, (b, b′ ) ∈ s} ⊆ (A × B) × (A × B) Also, write IM for the identity relation on set M, that is, the relation {(m, m) ∣ m ∈ M}. Then, the above requirement that R′ and S′ operate solely on A and B, respectively, amounts to requiring that R′ = (r ⊗ IB ) and S′ = (IA ⊗ s). for appropriate relations r ⊆ A × A and s ⊆ B × B. Furthermore, the composition of R′ and S′ should exhibit the same input–output behavior on the image η(X) ⊆ Y of the transformation η as the composition of R and S on X. Actually, this requirement has been strengthened to a mutual correspondence in order to avoid useless parallelism in which either R′ or S′ does the full job of both R and S: R′ must behave on its part of η(X) like R on X and S′ like S. With relational composition denoted as left-to-right juxtaposition, that is, r s = {(x, y) ∣ ∃z : (x, z) ∈ r ∧ (z, y) ∈ s}, the formal requirement is: R η = η R′ and S η = η S′ .
from a state space X to a state space Y. In this subsection, consider two program fragments, R and S, which are modelled as relations on an arbitrary (finite, countable, or uncountable) state space X:
In order to guarantee a correspondence between the resulting output state in Y and its equivalent in X, there is one more requirement: η’s relational inverse η− = {(y, x) ∣ y = η(x)}, when applied to η(X), must not incur a loss of information of a poststate generated by program fragments R or S. This is expressed by the two equations:
R ⊆ X × X and S ⊆ X × X.
R = η R′ η− and S = η S′ η−
η:X→Y
S
S
Semantic Independence
Figure depicts the previous two requirements as commuting diagrams, on the left for relation R and on the right for relation S. The dashed arrows represent the former and the dashed-dotted arrows the latter requirement. Together they imply: R = R η η − and S = S η η− . It is neither a consequence that η− is a function when confined to the images of R or S, nor that it is total on these images. However, if η η− contains two states (x, y) and (x, y′ ), then y and y′ are R-equivalent (and S-equivalent) in the sense that (z, y) ∈ R iff (z, y′ ) ∈ R, for each z ∈ X (and likewise for S), that is, y and y′ occur as poststates of exactly the same prestates. If R (or S) is a total function, which is the case for deterministic programs, then η− restricted to the image of R (or S, respectively) must be a total function. Putting all requirements together, one arrives at the following definition of semantic independence between two imperative program fragments.
and, likewise, S by s due to S = S η η− = η (IA ⊗ s) η− . That is, after applying state transformation η, an effect equivalent to R can be obtained by executing r, with the result to be extracted by transforming the state space back via η− . The gain is that r only operates on the first component A of the transformed state space, not interfering with s which, in turn, simulates S on the second component B. An immediate consequence of Definition is that semantically independent relations commute with respect to relational composition, which models sequential composition in relational semantics. Take again R and S: RS
R η η− = R,
S η = η (IA ⊗ s),
S η η− = S.
()
R = R η η− = η (r ⊗ IB ) η−
=
S η (r ⊗ IB ) η−
=
S R η η−
=
SR
η X
Y=AxB
r
R
lB
η−1
Y=AxB
lA
S
η
η X
R η (IA ⊗ s) η−
However, the converse is not true: semantic independence is more discriminative than commutativity. Intuitively, it has to be, since semantic independence
η X
=
= η (IA ⊗ s) (r ⊗ IB ) η−
and
Under the conditions of Eq. , R is simulated by r in the sense that
R S η η−
= η (r ⊗ IB )(IA ⊗ s) η−
Definition (Semantic independence []) Two relations R and S on the state set X are called semantically independent if there are nontrivial (i.e., cardinality > ) sets A and B, relations r on A and s on B, and a function η : X → A × B such that R η = η (r ⊗ IB ),
=
Y=AxB
X
Semantic Independence. Fig. Two requirements on semantic independence
Y=AxB η−1
s
Semantic Independence
must ensure truly parallel execution, without the risk of interference, while commutativity may hold for interfering program fragments. A case of two program fragments which commute but are not semantically independent was provided as Example in the initial paper []: var x : {, , , }; α : x := (x + ) mod x := (x + ) mod α : To prove the dependence, consider the projection of a Cartesian product to its first component: ↓ : A × B → A ↓ ((a, b)) = a Since α corresponds to a transition relation that is a total and surjective function on X = {, , , }, η ↓ is necessarily a bijection between X and A whenever η satisfies Eq. . Consequently, relation S′ , which implements α in the transformed state space, cannot leave the first component A of the transformed state space A×B unaffected, that is, it cannot be of the form (IA ⊗ s), since α is not the identity on X. As pointed out before, the example in Fig. is semantically independent in the sense of Definition . Taking A = {, }, that is, as the domain of variable V from the transformation on page , and B = {, , , }, that is, as the domain of variable W, and taking η(x, y, z) = ((x ∧ ¬z) ∨ (y ∧ z), ∗ z + (x ∧ z) ∨ (y ∧ ¬z)), the two relations s = {(a, ) ∣ a ∈ A} and r = {(b, ∗ (b div )) ∣ b ∈ B} satisfy Eq. . These two relations s and r are the relational models of the assignments α′ and α′ on page .
Mutual Independence Between More than Two Fragments Semantic independence is easily extended to arbitrarily many program fragments. To avoid subscripts, consider only three program fragments given as relations R, S, T on the state set X, which illustrates the general case sufficiently. These relations are called semantically independent if they correspond to relations r, s, t on three nontrivial sets A, B, C, respectively, such that there is a function η :X →A×B×C
S
that can be used to represent R, S and T by the product relations r ⊗ I B ⊗ I C , I A ⊗ s ⊗ IC , I A ⊗ I B ⊗ t respectively. For R, this representation has the form R η = η (r ⊗ IB ⊗ IC ). As before, the requirement is that R η η− = R. Relations S and T are represented analogously and are subject to corresponding requirements. These conditions are considerably stronger than requiring the three pairwise independences of R with S T, of S with R T, and of T with R S. The individual checks of the latter may use up to three different transformations η to η in Eq. . This freedom of choice does not capture semantic independence adequately.
Independence Inside a Single Statement Up to this point, an underlying assumption has been that a limit of the granularity of parallelization is given by the imperative program fragments considered. At the finest possible level of granularity, these are individual assigment statements. If some such assignment statement contains an expression whose evaluation one would like to parallelize, one must break it up into several assignments to intermediate variables and demonstrate their independence. One can go one step further and ask for the potential parallelism within a single statement. The question then is whether there exists a transformation that permits a slicing of the statement into various independent ones in order to obtain a parallel implementation. An appropriate condition for semantic independence within a single statement given as a relation R ⊆ X ×X is that X can be embedded into a nontrivial Cartesian product A × B via a mapping η : X → A × B such that () R η = η (r ⊗ s), R η η − = R for appropriate relations r ⊆ A×A and s ⊆ B×B. The generalization to more than two slices is straightforward.
Disjoint Parallelism and Its Limitations One notable consequence of the presented notion of semantic independence is that, due to accessing disjoint pieces of memory, the concurrent execution of the transformed statements neither requires shared reading nor shared writing. These operations are eliminated
S
S
Semantic Independence
by the transformation or, if the context requires a state space representation in the original format X rather than the transform Y, localized in code preceding the fork of the concurrent statements or following their join. This is illustrated with our initial example, where the shared reading and writing of x and y as well as the shared reading of z present in the original statements α : α :
(x, y) := (x ∧ z, y ∧ ¬z) (x, y) := (x ∧ ¬z, y ∧ z)
disappear in the semantically equivalent form α′ : α ′ :
V := W := ∗ (W div )
If the context requires the format of X, one will have to include the transformations η and η− in the implementation. For the example, this means that the code var V : {, }, W : {, , , }; V := (x ∧ ¬z) ∨ (y ∧ z) β : W := ∗ z + (x ∧ z) ∨ (y ∧ ¬z) β : (or any alternative code issuing the same side effects on V and W) has to be inserted before the concurrent statements α′ and α ′ and that the code γ : x := if (W div ) = then (W mod ) else V γ : y := if (W div ) = then (W mod ) else V z := W div γ : endvar V, W has to be appended behind their join. The keyword endvar indicates that, after the backtransformation, variables V and W are irrelevant to the remainder of the program. (The backtransformation γ of z from the transformed to the original state space could be omitted, since the original statements do not have a side effect on z, that is, γ turns out not to modify z). Whether implementations such as β and β of the projections η ↓ and η ↓ , which prepare the state spaces for the two parallel statements to be sparked, can themselves be executed in parallel depends on the ability to perform shared reading. Since these transformations have no side effects on the common state space X and only local ones on A or B, respectively, yet require shared reading on X, they can be executed in parallel if and only if the architecture supports shared reading on X – yet not necessarily on A or B, which need not
be accessible to the respective statement executed in parallel. Conversely, a parallel execution of the reverse transformation does, in general, require shared reads on A and B, yet no shared reads or writes of the individual variables spanning X, as illustrated by the statements γ , γ , and γ above.
Practical Relevance In the parallelization of algorithms, the mapping η transforming the state space is often fixed, depending on the application, and, thus, left implicit. This leads to fixed parallelization schemes which can be applied uniformly, which makes the search for η and the calculation of the simulating relations r and s from the original state transformers R and S unnecessary. Instead, it suffices to justify the parallelization scheme once and for all. It is worth noting that, for many practical schemes, this can be done via the concept of semantic independence. Examples are the implementation of ∗-LISP on the Connection Machine [] and Google’s closely related MapReduce framework [], which are based on a very simple transformation, or the use of residue number systems (RNS) in arithmetic hardware and signal processing [], which exploits a much more elaborate transformation. In the former two cases, transformation η maps a list of elements to the corresponding tuple of its elements, after which all elements are subjected to the same mapping, say f , in parallel. The implementation of η distributes the list elements to different processors. Then, f is applied independently in parallel to each list element and, finally, by the implementation of η− , the local results are collected in a new list. In contrast, RNS, which has gained popularity in signal processing applications during the last two decades, transforms the state space rather intricately following the Chinese remainder theorem of modular arithmetic. In its simplest form, one considers assignments of the form x := y ⊙ z, where the variables are bounded-range, nonnegative integers and the operation ⊙ can be addition, subtraction, or multiplication. RNS transforms the variables in such assignments to vectors of remainders with co-primal bases as follows. (The numbers of a co-primal base have no common positive factors other than .)
Semantic Independence
Let x, y, z be variables with ranges N
Related Entries Connection Machine Dependences Dependence Analysis Parallelization, Automatic Polyhedron Model Systolic Arrays
Bibliographic Notes and Further Reading Semantic independence is a generalization of the syntactic independence criterion proposed by Arthur
S
J. Bernstein in []. This criterion states that independent program fragments are allowed to read shared variables at any time but must not update any shared variable. Lengauer introduced a notion of independence based on Hoare logic in the early s [, ]. It combines a semantic with a syntactic requirement. The semantic part of the independence relation is called full commutativity and permits an arbitrary interleaving of the program fragments at the statement level, the syntactic part is called non-interference and ensures the absence of shared updates within each statement. The independence relation based on a program state graph, as proposed by Best and Lengauer in [], is entirely semantic. Von Stengel [] ported the model to universal algebra and Fränzle et al. [] generalized it shortly thereafter to the notion presented here. Whether a general, automatic, and efficient method of finding η and constructing r and s in any practical programming language exists remains an open question. Zhang [] has ported the above notion of semantic independence to security. He adopts Definition as the definition of a set of mutually secure transitions. Transformation η defines the local views of users and r and s are the locally visible effects of the globally secure transitions R and S. The transformations of the state space, which make independence visible, reflect the same principle as the coordinate transformations of the iteration space in polyhedral loop parallelization. There are other correspondences between the two models, for example, the way in which one exposes parallelism in a single assignment statement. For a more detailed comparison, see the entry on the polyhedron model.
Bibliography . Bernstein AJ () Analysis of programs for parallel processing. IEEE Trans Electr Comput EC-():– . Best E, Lengauer C () Semantic independence. Sci Comput Program ():– . Dean J, Ghemawat S () MapReduce: simplified data processing on large clusters. Commun ACM, ():–, January . Fränzle M, von Stengel B, Wittmüss A () A generalized notion of semantic independence. Info Process Lett ():– . Hillis WD () The connection machine. MIT Press, Cambridge, MA
S
S
Semaphores
. Lengauer C () A methodology for programming with concurrency: the formalism. Sci Comput Program ():– . Lengauer C, Hehner ECR () A methodology for programming with concurrency: an informal presentation. Sci Comput Program ():– . Omondi A, Premkumar B () Residue number systems – theory and implementation, volume of Advances in computer science and engineering. Imperial College Press, United Kingdom . von Stengel B () An algebraic characterization of semantic independence. Info Process Lett ():– . Zhang K () A theory for system security. In: th IEEE computer security foundations workshop. IEEE Computer Society Press, June , pp –
Semaphores Synchronization
Sequential Consistency Memory Models
Server Farm Clusters Distributed-Memory Multiprocessor
Shared Interconnect Buses and Crossbars
Shared Virtual Memory Software Distributed Shared Memory
Shared-Medium Network Buses and Crossbars
Shared-Memory Multiprocessors Luis H. Ceze University of Washington, Seattle, WA, USA
Synonyms Multiprocessors
Definition A shared-memory multiprocessor is a computer system composed of multiple independent processors that execute different instruction streams. Using Flynns’s classification [], an SMP is a multiple-instruction multiple-data (MIMD) architecture. The processors share a common memory address space and communicate with each other via memory. A typical sharedmemory multiprocessor (Fig. ) includes some number of processors with local caches, all interconnected with each other and with common memory via an interconnection (e.g., a bus). Shared-memory multiprocessors can either be symmetric or asymmetric. Symmetric systems implies that all processors that compose the system are identical. Conversely, asymmetric systems have different types of processors sharing memory. Most multicore chips are single-chip symmetric shared-memory multiprocessors [] (e.g., Intel Core Duo).
Discussion Communication between processors in shared-memory multiprocessors happens implicitly via read and write operations to common memory address. For example, a data producer processor writes to memory location A, and a consumer processor reads from the same memory location: P1 -----... A = 5 ... ... ...
P2 ----... ... ... tmp = A ...
Shared-Memory Multiprocessors
Processor 0
S
Processor N-1 ...
Data Cache
Data Cache
Interconnection network
Memory
Shared-Memory Multiprocessors. Fig. Conceptual overview of a typical shared-memory multiprocessor
Processor P wrote to location A and later processor P read from A and therefore read the value “” into local variable “tmp,” written by P. One way to think about the execution of a parallel program in a sharedmemory multiprocessor is as some global interleaving of the memory operations of all threads. This global interleaving of instructions is nondeterministic, as there are no guarantees that if a given program is executed multiple times with the same input will lead to the same interleaving and therefore the same result. Since communication happens nondeterministic, programs need to explicitly ensure that when communication happens, the data communicated is in a consistent state. This is done via synchronization, e.g., using mutual exclusion to prevent two threads from manipulating the same piece of data simultaneously. One of the key components of a shared-memory multiprocessor is the cache coherence protocol. The cache coherence protocol ensures that all caches in each processor hold consistent data. The main job of the coherence protocol is to keep track of when cache lines are written to and propagating changes when that happens. Therefore, the granularity of data movement between processors is a cache line. There are many trade-offs in how cache coherence protocols are implemented. For example, whether updates should be propagated as they happen or if they should only be propagated when a remote processor tries to read the corresponding data. Another very important aspect of shared-memory multiprocessors is the memory consistency model, which determines when (what order) memory operations are made visible to processors in the system [].
For the actual physical memory organization (and the interconnection network), it is common to have multiple partitions physically spread over the processors. This means that each partition is closer to some processors than others. Therefore, some regions of memory have faster access than others. Such organization is called non-uniform memory access, or NUMA. ccNUMA refers to NUMA multiprocessors with private caches per processor and cache coherence. Given the general slower improvement in singlethread performance, processor manufacturers are now scaling raw compute performance in the forms of more cores (or processors) on a single chip, forming a symmetric shared-memory multiprocessor. Another recent trend points towards using specialized processors to further improver performance and energy efficiency, leading towards asymmetric shared-memory multiprocessor system.
Related Entries Cache Coherence Cedar Multiprocessor Locality of Reference and Parallel Processing Memory Models Race Conditions Race Detection Techniques
Bibliographic Notes and Further Reading There are many notable shared-memory multiprocessor systems worth reading about, starting from mainframes
S
S
SHMEM
in the s (e.g., Burroughs B), many machines built in academia (e.g., UIUC CEDAR, Stanford Flash, Stanford DASH), and many systems manufactured and sold by Sequent and DEC (now defunct), as well as IBM, Sun, SGI, Fujitsu, HP, among several others.
programming language. Dataflow-C is a single assignment language that can compile C-like source programs to highly parallel executable dataflow machine codes (Fig. ).
Architecture Bibliography . Flynn M () Some computer organizations and their effectiveness. IEEE Trans Comput C-: . Olukotun et al () The case for a single-chip multiprocessor. In: Proceedings of the th international symposium architectural support for programming languages and operating systems (ASPLOS VII), Cambridge, MA, October . Adve S, Gharachorloo K () Shared memory consistency models: a tutorial. IEEE Comput Soc ():–
SHMEM OpenSHMEM - Toward a Unified RMA Model
SIGMA- Kei Hiraki The University of Tokyo, Tokyo, Japan
Synonyms Dataflow supercomputer
Definition SIGMA- is a large-scale computer based on fine-grain dataflow architecture designed to show feasibility of fine-grain dataflow computer to highly parallel computation over conventional von Neumann computers. SIGMA- project was stated on and the processing element (PE) started working on . SIGMA- was design and built at the Electrotechnical Laboratory (ETL for short), Ministry of International Trade and Industry, Japan. SIGMA- is still the largest scale dataflow computer so far built and it achieved more Mflops as a maximum measured performance of the total system. As for the language for SIGMA-, ETL developed Dataflow-C language as a subset of C
Organization The global organization of the SIGMA- is shown in Fig. . The SIGMA- consists of processing elements, structure elements, local networks, a global network, maintenance microprocessors, a service processor, and a host computer (Figs. and ). The processing elements and structure elements are divided into groups, each of which consists of four processing elements, four structure elements, and a local network. All groups are connected via the global network. The global network consists of a twostage omega network with a new adaptive load distribution mechanism. Its transfer rate is G bytes per second. A local network is a ten-by-ten crossbar packet switching network, eight ports of which are used for communication between processing elements and structure elements within a group and two to interface the global network. The transfer rate of a local network is M bytes per second. The whole system is synchronous and operates under a single clock (Fig. ). Figure illustrates a processing element and a structure element. A processing element consists of an input buffer, a matching memory with a waiting matching function, an instruction memory, and a link register file. A processing element executes all the SIGMA- instructions except structure-handling instructions such as read, white, allocate, and deallocate used to access the structure memory in a structure element. A processing element works as a two-stage pipeline: the firing stage and the execution stage. In the firing stage, operand matching and instruction fetching are performed simultaneously. Input packets, stored in the input buffer, are sent to the matching memory. The matching memory enables the execution of instructions with two input operands. Successfully matched packets are sent to the execution stage with the instructions fetched from the instruction memory. If the match fails, the packet is stored in the matching memory and the instruction fetched is discarded. Chained hashing
SIGMA-
S
PE PE PE
SE
PE
SE
PE
SE
PE
SE
PE
SE
PE
SE
PE
SE
PE
SE
PE
Crossbar-Switch
Crossbar-Switch
PE
Crossbar-Switch
SIGMA-. Fig. SIGMA- system
SE SE SE SE
Global Network (Two Stage Omega)
PE SE
an instruction-level dataflow processing element a structure processing element
SIGMA-. Fig. Global architecture of the SIGMA-
hardware is used to speed up the operand matching operation, where a first-in-first-out queue is attached to each key to enable multiple entries with the same key Table . The link register is used to identify a parallel activity in a program and hold a base address of a procedure. In the execution stage, instructions are executed in parallel with packet identifier generation, where a packet identifier keeps the destination address (next instruction address) as well as a parallel activity identifier (a procedure identifier and an iteration counter). These
S operations are similar to those in a conventional computer. After combining the result value of the execution unit with the packet identifier, an output packet is produced and sent to the input buffer of the same or another processing element via the local network. If an array is treated as a set of scalar values, the matching memory provides automatic synchronization and space management. However, heavy packet traffic causes a serious bottleneck at the matching memory, since each element of the array must be handled separately. This is the reason additional structure elements
S
SIGMA-
SIGMA-. Fig. SIGMA- Processing element (optional)
SIGMA-. Fig. SIGMA- Structure element (optional)
SIGMA-
S
SIGMA-. Fig. SIGMA- Local network (optional) Local Network
Local Network
Input Buffer (8K)
Matching Memory (64K)
Input Buffer (8K)
Instruction Unit (64K)
Execution Unit
Structure Memory (256K)
Synchronization Flag
Destination Unit
Structure Processing Unit
Local Network
Local Network
PE
SE
SIGMA-. Fig. Processing element (PE) and SE of the SIGMA-
are introduced in the SIGMA-. A structure element consists of an input buffer, a structure memory, two flag logic functions, and a structure controller with a register file. The flag logic function is capable of test
and setting all flags in an area arbitrarily specified in a single clock. The structure controller is responsible for waiting queue control and memory management. A structure element handles arrays, which are the most
S
S
SIGMA-
SIGMA-. Table Hardware specification of the SIGMA- Technology:
CMOS Gate-array, SRAM, DRAM
Clock
MHz
Input buffer
K words(bits/word)
Matching memory
K words(bits/word)
Instruction memory
K words(bits/word)
Structure memory
K words(bits/word)
Total amount of memory M Byte Total no. of gates
,, gates
Total no. of ICs
,
Total no. of PE/SE
+
important data structure involved in numerical computations. Each structure cell is composed of a -bit data (data type and payload data) word and two flag bits used for read and write synchronization. Multiple read before write operations are realized by a waiting queue attached to each cell. The buddy system implemented in microprogramming is used to increase the speed of allocating and deallocating structures. The architectural features of the SIGMA- are designed to achieve high speed by decreasing the parallel processing overheads in dataflow computers. For example, the short pipeline architecture enables quick response to small-sized programs with low parallelism. It was anticipated that the matching memory and the communication network would play a significant role in the system design from the viewpoint of synchronization and data transmission overhead. Besides the newly proposed mechanisms, high-speed and compact gate-array LSI elements have been developed for this purpose. Testing, debugging, and maintenance are performed by a special purpose parallel architecture []. This maintenance architecture uses a set of conventional von Neumann microprocessors, since maintenance operations generally need history sensitivity utilizing side effects. The architectural features of the SIGMA- are summarized as follows: . A short (two-stage) pipeline in a processing element, reducing the gap between the maximum and average performance . Chained hashing hardware for the matching memory, giving efficient and high-speed synchronization
. An array-oriented structure element, minimizing structure handling overhead . A hierarchical network with a dynamic load distribution function, reducing performance degradation caused by load imbalance . A maintenance architecture for testing and debugging, providing high-speed input and output operations and performing precise performance measurements
Packet and Instruction Architecture The SIGMA- adopts packet communication architecture. Processing elements and structure elements communicate using fixed length packets. Input and output to the host computer and maintenance system are also in packet form. Therefore, hardware initialization and hardware maintenance are carried out in packet form. As shown in Fig. , a packet contains a destination processing element number, a packet identifier and tagged data, and miscellaneous control information. The -bit packet is divided into two -bit segments, a processing element number field ( bits), and a cancel bit. A packet is transferred in two consecutive segments; the first segment contains a processing element or structure element number, a destination address in the processing element or structure element ( bits), and packet type ( bits). The second segment consists of data ( bits), data type ( bits), and one cancel bit. The cancel bit is used for canceling the preceding segment when a malfunction occurs. When the cancel bit is on, the whole packet becomes invalid. A destination address consists of a procedure identifier ( bits) for specifying a parallel activity, a relative instruction address ( bits) within the activity, an iteration counter ( bits), and some control information determining the firing rule in the matching memory ( bits). The iteration counter is utilized to implement the loop construct efficiently. As in the ordinary model of dynamic dataflow, the concatenation of the procedure identifier and iteration counter is used to distinguish parallel activities in a program. The types of packets used are: result of instruction, procedure call and return, interrupt handling and system management, structure operation, and system initiation and maintenance for system resources.
SIGMA-
8
8
10
8
10
PE
ITAG
I
LN
D
4
1
FLG C
8
32
TAG
DATA
Data type Cancel bit Matching condition flag Instruction displacement Link register number (Procedure identifier)
S
Payload data value
Packet identifier
Iteration counter Packet type Destination PE/SE number
SIGMA-. Fig. Packet format 40
18
2
Immediate value
OPCODE
20
20
20
DF0
DF1
DF2
Destination field 2 (Optional) Destination field 1 (Optional) Destination field 0 NDEST (Number of destination fields) Operation Code Immediate data (Optional)
SIGMA-. Fig. Instruction format
The minimum length of an instruction is bits (one word) as illustrated in Fig. . The first bits indicate the operation to be performed ( bits) and the number of destination address fields ( bits). The next bits indicate the destination address of the result, which does not contain dynamically assigned information such as a procedure identifier and iteration counter. An immediate data operand can be located at the header part of each instruction. The maximum number of destination address fields is three, using two words. When an instruction contains an immediate data (constant) operand, another word is required. The objective of the SIGMA- instruction set design is to execute efficiently application programs that cannot be efficiently executed on a vector-type supercomputer or a parallel von Neumann computer. Low efficiency in these program executions is caused by frequent procedure calls and returns, large amount of scalar arithmetic operations, and wide fluctuations of parallelism. Fine-grain parallelism has the advantage
over coarse-grain parallelism to overcome this inefficiency. As a result, instructions on a processing element and instructions on a structure element have been implemented (Fig. ).
Software The DataFlow C (DFC) compiler is implemented using the language development tools of the UNIX operating system. The emphasis was not on efficiency but on ease of implementation. Efficiency and compactness are still to be considered. Basically, DFC is a subset of the C language with a single assignment rule. Therefore, assignment to a variable is generally allowed only once in each procedure. However, the loop construct, realized by the for statement, is an exception. A non-single assignment expression can be written as the third argument of the for statement as shown in the following summation program. The argument n of the main program is given by a trigger packet invoking the program.
S
S
SIGMA-
System Performance
SIGMA-. Fig. SIGMA- PEs and SEs (optional)
#define N 10 main(n) int n; {int i; float a[N], retval, sigma(); for(i=0;i<5;i=I + 1) a[i] = i; retval=sigma(n, a); print(retval);} float sigma(n, a) int n; float a[]; {int i; float s; for (i=0, s=0; i
The whole system operates synchronously at a MHz clock frequency. The execution time of an instruction is determined by the maximum execution time of the two pipeline stages. Single operand instructions are executed every two cycles and two operand instructions every three cycles. The firing stage normally takes three cycles. Since non-division arithmetic operations take at most two cycles, the execution stage completes most of the instruction execution in two cycles. This consideration implies that the maximum speed of a processing element is about MIPS and MFLOPS. For a structure element, each read and write instruction is carried out in two cycles. Since instructions for allocating and deallocating structures are implemented by a microprogrammed buddy system, the execution times estimated are between and cycles. The network transfers a packet every two cycles. Therefore, structure elements can utilize maximum performance of the network. Performance is measured in terms of program execution time rather than raw machine cycles. The measurement study proved that there is no difference in computing ability of a single processing element between dataflow architecture and von Neumann architecture, in case of constructing them using the same device technology. . A speed degradation of % occurs when structure elements are adopted. This is due to the extra index and address calculation for structure handling. . The single processing element performance is almost constant over the vector length, because of the short pipeline architecture. . The speedup ratio for a single processing element to four processing element organization is . without structure elements and . with structure elements. This performance evaluation shows that the average performance of the processing element organization is more than MFLOPS. For example, SIGMA- compute matrix multiplies at Mflops.
Discussion Table shows development history of major MIMD parallel computers and dataflow computers.
SIMD ISA
S
SIGMA-. Table History of dataflow parallel processors
a
Yeara
System name
#PE
DDM
PE
TI DDP
LAU
Word length Architecture
Static dataflow
PE
Static dataflow
PE
Static dataflow
Manchester dataflow machine
PE
Dynamic dataflow
OKI DDDP
PE
Dynamic dataflow
NTT Eddy
PE
Simulated dynamic dataflow
NEC NEDIPS
PE
Static dataflow
Q-p
PE
Static dataflow, Asyncronous
SIGMA- Prototype
PE
Dynamic dataflow
SIGMA-
PE
Dynamic dataflow
MIT monsoon prototype
PE
Dynamic dataflow
ETL EM
PE
Dynamic/Hybrid dataflow
MIT monsoon
PE
Dynamic dataflow
ETL EM-X
PE
Dynamic/Hybrid dataflow
Year denote the date start working in its full configuration
As shown in Table , SI- is the largest dataflow computer so far built. Illustrations, diagrams, and code examples.
SIMD (Single Instruction, Multiple Data) Machines Cray Vector Computers
Bibliography . Hiraki K, Nishida K, Sekiguchi S, Shimada T () Maintenance architecture and its LSI implementation of a dataflow computer with a large number of processors. Proceedings of International Conference on Parallel Processing, IEEE Computer Society, University Park, pp – . Hiraki K, Sekiguchi S, Shimada T () System architecture of a dataflow supercomputer. Proceedings of IEEE Region Conference, IEEE computer Society, Los Alamitos, pp – . Hiraki K, Shimada T, Nishida K () A hardware design of the SIGMA-—A data flow computer for scientific computations. Proceedings of International Conference on Parallel Processing, IEEE Computer Society, Los Alamitos, pp – . Hiraki K, Shimada T, Sekiguchi S () Empirical study of latency hiding on a fine-grain parallel processor. Proceedings of International Conference on Supercomputing, ACM, Tokyo, pp – . Sekiguchi S, Shimada T, Hiraki K () Sequential description and parallel execution language DFCII dataflow supercomputers. Proceedings of International Conference on Supercomputing, ACM, New York, pp – . Shimada T, Hiraki K, Sekiguchi S, Nishida K () Evaluation of a single processor of a prototype data flow computer SIGMA- for scientific computations. Proceedings of th Annual International Symposium on Computer Architecture, IEEE Computer Society, Chicago, pp –
Floating Point Systems FPS-B and Derivatives Flynn’s Taxonomy Fujitsu Vector Computers Illiac IV MasPar MPP NEC SX Series Vector Computers Vector Extensions, Instruction-Set Architecture (ISA)
S SIMD Extensions Vector Extensions, Instruction-Set Architecture (ISA)
SIMD ISA Vector Extensions, Instruction-Set Architecture (ISA)
S
Single System Image
Single System Image Rolf Riesen , Arthur B. Maccabe IBM Research, Dublin, Ireland Oak Ridge National Laboratory, Oak Ridge, TN, USA
Synonyms Distributed process management
Definition A Single System Image (SSI) is an abstraction that provides the illusion that a multicomputer or cluster is a single machine. There are individual instances of the Operating Systems (OSs) running on each node of a multicomputer, processes working together are spread across multiple nodes, and files may reside on multiple disks. An SSI provides a unified view of this collection to users, programmers, and system administrators. This unification makes a system easier to use and more efficient to manage.
Discussion Introduction Multicomputers consist of nodes, each with its own memory, CPUs, and a network interface. In the case of clusters, each node is a stand-alone computer made of commodity, off-the-shelf parts. Instead of viewing this collection of computers as individual systems, it is easier and more economical if users, programmers, and system administrators can treat the collection as a single machine. For example, users want to submit a single job to the system, even if it consists of multiple processes working in parallel. The task of an SSI is to provide the illusion of a single machine under the control of the users and system administrators. The SSI is a layer of abstraction that provides functions and utilities to use, program, and manage the system. An SSI consists of multiple pieces. Some of them are implemented as individual utilities, other pieces are found in system libraries, there are additional processes running to hold up the illusion, OSs are modified, and in some systems hardware enables some of the abstraction. No SSI provides a complete illusion. There are always ways to circumvent the abstraction, and sometimes it is necessary to interact with specific
components of the system directly. Many SSI in daily use provide a limited abstraction that is enough to make use of the system practical, but do not add too much overhead or impede scalability. An OS consists of a kernel, such as Linux, and a set of utilities and tools, such as a graphical user interface and tools to manipulate files. On a single computer, for example, a desktop workstation, the OS manages resources, such as CPUs, memory, disks, and other peripherals. It does that by providing a set of abstractions such as processes and files that allow users to interact with the system in a consistent manner independent of the specific hardware used in that system. For everyday tasks, it should not matter what particular CPU is installed, or whether files reside on a local or a network-attached disk drive. The goal of an SSI is to provide a similar experience to users of a parallel system. That is not that difficult to achieve in a Symmetric Multi-Processor (SMP) where all memory is shared and all CPUs have, more or less, equally fast access to it. The OS for an SMP has additional features to manage the multiple CPUs, but the view it presents to a user is mostly the same that a user of a single-CPU system sees. For example, when launching an application, a user of an SMP does not need to specify which CPU should run it. The OS will select the least busy CPU and schedules the process to run there. There are additional functions that an SMP OS provides, for example, to synchronize processes, pin them to specific CPUs, and allow them to share memory. Nevertheless, the abstraction is still that of a single, whole machine. In contrast, in a distributed memory system, such as a large-scale supercomputer or a cluster, each node runs its own copy of the OS. It is possible that not all nodes have the same capabilities; some may have access to a Graphic Processing Unit (GPU), the CPUs and memory sizes may be different, and some nodes may have connections to wide-area networks that are not available to all nodes in the system. The task of an SSI is to unify these individual components and present them to the user as if they were a single computer. This is difficult and sometimes not desired. Accessing data in the memory or disk of a remote node is much more time consuming than accessing data locally. It therefore makes sense for a user to specify that a process be run on the node where the data currently resides. It also makes
Single System Image
sense to give users the ability to specify that a process be run on a node that meets minimum memory requirements or has a certain type of CPU or GPU installed. On the other hand, for many tasks, or in a homogeneous system, the SSI should hide the complexity of the underlying system and present a view that lets users interact with it as if it were a large SMP. An SSI is meant to make interaction with a parallel system easier and more efficient for three different groups of users: Application users who use the system to run applications such as a database or a simulation, programmers who create applications for parallel systems, and administrators who manage these systems. Each group has its own requirements for, and uses of, an SSI; and the view an SSI presents to each group is different. Of course, there is some overlap since there are tasks that need to be done by members of more than one group. The following sections discuss an SSI from these three different viewpoints. Functions that overlap are discussed in earlier viewpoints.
Application User’s View A user of a parallel system uses it for an application such as computing the path of an asteroid, the weekly payroll, or maintaining the checking accounts of a bank. This type of user does this by running an application which provides the necessary functionality. In some cases, like the banking example, the user never directly interacts with the system. Those users only see the application interface, for example, the banking application that presents information about customers and their account balances. Parallel systems used in science and engineering cater to users who each have their own applications to solve their particular scientific problems. These users do interact with the system and an SSI assists them in making these interactions smooth and reliable. Large systems that are shared among many users usually dedicate several nodes as login nodes. Smaller clusters sometimes let users log into any of the nodes, though this is less efficient for running parallel applications. An SSI gathers these login nodes under the umbrella of a single address. Users remotely log into that address and the SSI chooses the least busy node as the actual login node. Once logged in, users need the ability to launch serial or parallel jobs (applications) and control them.
S
An SSI chooses the nodes to run the individual processes on and provides utilities to the user to obtain status information, stop, suspend, and resume jobs. While an application is running, users need to interact with it. The SSI transparently directs user input to the application, independent of which node it is running on. Similarly, output from the application is funneled (if it comes from multiple processes) back to the user. This is the same as on any computing system. However, in a distributed system, the SSI needs to provide that functionality even if a process migrates to another node. Input and output data for an application is often stored in files. Users must have the ability to see these files and manipulate them, for example, copy, move, and delete them. This requires a shared file system, since the application must be able to access the same files. Since the user, the application, and the files themselves may all reside on different nodes, a distributed file system is needed that presents a single root to all nodes in the system. That means the full path name of a file, that is, the hierarchy of directories (folders) above it and the file name itself, must be the same when viewed from any node. Peripherals should be visible as if they were connected to the local node. The SSI must provide a view of that peripheral and transfer data to and from it, if the device is not local.
Programmer’s View Programmers of parallel applications have many of the same requirements as application users. They also need to launch applications, to test them for example, and need access to the file system. The same is true for writing and compiling applications. For debugging, the SSI may need to route requests from the debugger to processes on another node to suspend them and to inspect their memory and registers. If this functionality is embedded in the debugger itself, then this is no longer a requirement for the SSI, but it does require services that are part of the OS. Programmers do have additional requirements that an SSI provides. The programs they write need to access low-level system services provided by the OS. That is done through system calls that allow applications to trap into an OS kernel to obtain a specific service. Typical services provided are calls to open a file for reading and
S
S
Single System Image
writing, creating and starting a new process, and interact through Inter-Process Communication (IPC) with another process. In a distributed system, these functions become more complex. A process creation request in a desktop system is an operation carried out by the local OS. In a distributed system managed by an SSI, process creation may mean interaction with the OS of another node. If the original process and the new process need to communicate, then data has to move between nodes instead of just being copied locally into another memory location. For an application that is not aware of the distributed nature of the system it is running on, the SSI must provide this functionality so that it is transparent to the application. Many applications consist of multiple processes that work together by sharing memory locations. On a single system, that is an efficient way to move data from one process to another and lets multiple processes read the same data. If these processes, in a distributed system, do not have access to the same memory anymore, this becomes more difficult and expensive, than it was before. Some SSI provide the illusion of a shared memory space. This is enabled by specialized hardware or through Distributed Shared Memory (DSM) implemented in software.
Administrator’s View An administrator of a parallel system is responsible for managing user accounts, keeping system software up to date, replacing failed components, and in general have the system available for its intended use. Many of the administrators tasks are similar to the ones a user performs, for example, manipulating files, but with a higher privilege level. In addition, an SSI should provide a single administrative domain. That means once the system administrator has logged in and has been authenticated by the system, the administrator should be able to perform all tasks for the entire system from that node. The SSI enables killing, suspending, and resuming any process, independent of its location. A system administrator further has the ability to trigger the migration of a process. This is sometimes necessary to vacate a portion of the system so it can be taken down for maintenance or repair. To do this, it must be possible to “down” network links, disks, CPUs, and whole nodes. Once down, they
wont be allocated and used again until they have been fixed or are done with maintenance. System administrators also need the ability to monitor nodes and links to troubleshoot problems or look for early signs of failures. An SSI can help with these tasks by providing utilities to search, filter, and rotate logs as an aggregate so an administrator does not need to login or access each node individually. Often, each node has its own copy of system files, libraries, and the OS installed on a local disk. This increases performance and enhances scalability. An SSI provides functions to keep the versions of all these files synchronized and allows a system administrator to update the system software on all nodes with a single command.
Implementation of an SSI The previous section looked at the functionality an SSI provides to users, programmers, and system administrators. This section provides a little bit more detail on how these abstractions are implemented and at what level of the system hierarchy. Figure shows the layers of a system that contain SSI functions. User level is a CPU mode of operation where hardware prevents certain resource accesses. Applications and utilities run at this level. At the kernel level, all (local) resources are fully accessible. The OS kernel runs at this privilege level and manages resources on behalf of the applications running above it. At the bottom of the hierarchy is the hardware. Some SSI features can only be implemented at that level. If the necessary hardware is missing, that feature will not be available or only very inefficiently in the form of a software layer. Shared memory on a system without hardware support for it is such an example.
Applications Runtime / utilities
Unpriviledged (user) level
Kernel
Priviledged (kernel) level
Hardware
Single System Image. Fig. Privilege and abstraction layers in a typical system
Single System Image
One level above the hardware resides the OS kernel. Some SSI functionality, for example, process migration, needs to be implemented at that level. Above the OS is the runtime layer and system utilities. This layer is sometimes called middle-ware. A batch and job control system which determines when and were (on which nodes) a job should run can be found at this level. Of course, utilities at that layer need the support of the OS and ultimately the underlying hardware to do their job. Above that are the applications that make use of the system. Some SSI functionality, for example, application-level checkpoint and restart can be implemented at that level, for example, in a library. Table shows various SSI features and at what level they are implemented. Some can be implemented at more than one level. Only the levels that have been traditionally used, and where a feature is most efficient, are marked.
Shared Memory Sharing memory that is distributed among several nodes requires hardware support to work well. Implementations in software of so-called Distributed Shared Memory (DSM) systems have been done, but they all suffer from poor performance and scalability.
Single System Image. Table Features of an SSI and at what layer of the system hierarchy they are most commonly implemented Implementation layer Feature
HW Kernel Runtime
Shared memory
●
Up/down sys components
●
●
●
Debugger support
●
●
Process migration
●
Stdin/stdout
●
File system
●
System calls
●
Checkpoint/restart
●
●
Batch system
●
Login load leveling
●
Sys log management
●
SW maintenance
●
App.
●
S
Management of System Components In larger systems, it is desirable to mark components as being down so they wont be used anymore. Marking a node down, for example, should prevent the job scheduling system from allocating that node. In some systems, it is possible to mark links as being down and have the system route messages around that link. This is useful for system maintenance. Components that need repair wont be used until the system is brought down during the regular maintenance period, when all failed components can be repaired at the same time. Depending on the sophistication of this management feature, it requires hardware and kernel support. Simply telling the node allocator to avoid certain nodes can be done at user level in the runtime system. Debugger Support A debugger is a user-level program like any other application. However, in order to signal and control processes, and to inspect their memory and registers, the debugger needs support from the OS. Some SSI modify kernels such that standard Unix facilities, for example, signals and process control, are extended to any process in the system; independent of which node they are currently running on. In that case, the debugger does not need to change, although adding support to debug multiple processes at the same time makes it a more useful tool. If the OS is not capable of providing these services, they can be implemented in the debugger with the help of runtime system features such as daemons and remote procedure calls (RPC). Process Migration Migrating a process to another node is useful for load balancing and as a precautionary measure before a node is marked down. Migrating a process requires that all data structures, such as the Process Control Block (PCB), which the OS maintains for this process, migrate in addition to the code and user-level data structures the process has created. A completely transparent migration also requires that file handles to open files and those used for IPC are migrated as well. That, in return, requires that some information remains in the local OS about where the process has migrated to. (Or that all connections to the migrating process are updated before the migration.) For example, another
S
S
Single System Image
process on the local node which was communicating with the migrated process via IPC will continue to do so. When data arrives for the migrated process, the OS needs to forward the data to the new location. If a process migrates multiple times, forwarding becomes more expensive and convoluted.
Stdin/Stdout The default channels for input to and output from a process under UNIX are called stdin and stdout. If a user is logged into one node and starts a process on another, then the OS transfers data between these two nodes on behalf of the user. If an application consists of multiple processes, the OS ensures that output from all of them gets funneled through the stdout channel to the user. Data from the stdin channel usually does not get replicated and only the first process that reads it, will receive a copy. The OS must maintain these data channels even if processes migrate to other nodes, otherwise the illusion of a single system image would be broken. File System In a system managed by an SSI, all processes should see the same directory (folder) and file hierarchy. This feature, called shared root, is what is available on a desktop system where multiple processes can access files on the local disk. In a parallel system, files are often distributed among multiple disks to enhance performance and scalability. Distributing files like that complicates things, since metadata (information about files and directories) may not be located on the same device as the files themselves. Yet, metadata must be kept up-to-date to preserve file semantics. For example, two processes appending data to the end of a file must not overwrite each other’s data. That requires that the file pointer always points to the true end of the file. Some files which the OS frequently accesses should be local to the node. However, some files, such as/etc/ passwd, should be the same on all nodes and a system administrator should have the ability to update only one of them and then propagate the changes to all copies. On the other hand, some (pseudo) files that describe local devices for example, need to be unique on each node. If a process migrates and had one of these files open, it must continue to interact with the device on the original node, even though a file with the same name exists on the new node as well. Yet for other files, temporary files
for example, it may be desirable that they are only visible locally. File systems that can handle all these cases are complex and need to be tightly integrated with the OS.
System Calls System calls are functions an application uses to interact with the OS. Calls for opening, closing, reading, and writing a file are typical. So are calls to create and control other processes. An SSI that wants to maintain these calls unchanged in a distributed system requires changes to the OS kernel. For example, process creation may not be strictly local anymore, if the system tries to balance load by creating new processes on less busy nodes. Checkpoint/Restart Long-running applications often do not finish, due to an interruption by a faulty component or a software problem. This is especially true for parallel applications, where it is more likely that at least one of the many nodes it runs on will fail. To combat this problem, an application occasionally writes intermediate data and its current state to disk. When a failure occurs, the application is restarted on a different node, or on the same node after it has been rebooted or repaired. The application then reads the checkpoint data written earlier and continues its computation. This can be done by the application itself, in the runtime system, or in the OS kernel. The latter has the nice property that application writers do not need to worry about it, but it has the potential for waste, since the kernel cannot know what data the application will need to restart. Therefore, it has to checkpoint the complete application state and all of its data. Checkpoint/restart needs to be integrated with the SSI system, even if checkpointing is done at the application level. Batch System Many parallel systems use a batch scheduler to increase throughput of jobs through the system and to allow it to continue processing unattended, for example, at night or on weekends. Users submit scripts that describe the application they want to run, how much time they require, and how many nodes the application needs. The batch scheduler uses the job priority and knowledge about available nodes to select and run the jobs currently enqueued for processing.
Single System Image
A batch system is one of the most common SSI components in a cluster. It allows the user to treat the collection of machines as a single system and use a single command to submit jobs to it. A batch system provides additional commands to view status information about the currently enqueued and running jobs and to manage them. All these commands treat jobs as if they were a single process on the local system.
Login Load Leveling Parallel systems usually have more than one node where users can log in. In a partitioned system, there is a service partition comprised of the nodes dedicated to user login and common tasks such as editing files and compiling applications. Parallel jobs are then run on nodes that have been assigned to the compute partition. Often, nodes in the service partition are of a different type; they may have more memory or local disks attached. Smaller clusters may not use partitioning, though that is less efficient. In either case, the SSI should present a single address to the outside world. When a user logs in at that address, the SSI picks one of the service nodes for that user to log in. The choice is usually made based on the current work load of each node in the service partition. As far as the user is concerned, there is only one login node. System Log Management A lot of information about the current state of the system is stored in log files that are unique to each node of a parallel system. System administrators consult these logs to diagnose problems and to detect and prevent faults before they occur (e.g., warnings about the rising temperature inside a CPU). An SSI should provide functions to aggregate these individual files and let system administrators search through, purge, or archive them with a single command. Software Maintenance The software versions of the system utilities, libraries, and applications installed on each node must be the same system-wide or at least be compatible with each other. As with system log management, an SSI should provide functions to system administrators that allow them to upgrade and check software packages on multiple nodes with a single command.
S
Imperfect SSI The previous sections listed the features an SSI must provide and a glance at what implementing them entails. Many of these features are independent components and useful even when none of the other components are available. In fact, no SSI running on more than a couple hundred of nodes implements all features described thus far. The reason for that is scalability. Some features work well when used on eight nodes, but become unusably slow when extended much beyond that number. Among the most costly features described earlier is process migration, especially if an SSI tries to achieve complete transparency. Once a process starts interacting with other processes on a node and obtains file handles to node-local resources, it becomes difficult to migrate that process. After migration, a forwarding mechanism must be in place to route information to the new location of the migrated process. If the original node retains information about the new location of the migrated process, a failure of that node will disrupt the flow of information. If no information is kept at the original location, then all processes that are communicating with the migrated process must be told about the new location. If migration is frequent, and interaction among processes is high, this becomes very costly. Furthermore, it does not always work, for example, if the migrated process was using a peripheral that is only available on the original node. Another high-cost SSI item is shared memory. Even when supported by hardware, the illusion of a single, large memory starts to fall apart as the number of nodes (CPUs) increases. Not all is lost, though. Some systems deploy additional hardware to assist with some SSI features. For example, some systems have an additional network and disks dedicated to system software maintenance. Furthermore, complete transparency is not always required, and users are willing to use different commands and utilities when dealing with parallel processes, in return for better performance and scalability. For example, using a command such as qsub to submit a parallel job does not seem a high burden compared to the transparent method of just naming the application on the command line to start it. The UNIX command ps is used to obtain process status information. In a transparent SSI system, it would return that information even if the process is not local, or return information
S
S
Single System Image
about many processes, if they are part of a parallel application. In contrast, users often only want to know whether their job is running and whether it is making progress. The batch system has a command that can answer at least the former question and a ps output for hundreds of processes scattered throughout the system would not provide any additional useful information. Therefore, even a partial SSI can be of benefit. Application users benefit most from a batch scheduling system. This also assists system administrators who also need a way to efficiently manage the hardware and software of the system. Most other functionality, especially less scalable features, are not strictly necessary to make a parallel system useful. That may mean additional utilities that expose the distributed aspects of the system and modifications to applications to allow them to work in parallel and with greater efficiency.
Future Directions Desktop systems with two or four CPU cores are common now. High-performance systems will be the first to receive CPUs with tens of cores, but those CPUs will soon be common. Desktop OSs will adapt to these systems and provide SSI features to manage and use the cores in a single system. Users will expect to see these features in parallel systems as well, which will drive the development of SSI features and integration into parallel OSs. At the very high-end, exascale systems are on the horizon. Implementing SSI for these systems will be a challenge due to the size of these systems. Displaying the status of a hundred-thousand nodes in a readable fashion without endless scrollbars is only feasible with intelligent software that zooms in on interesting events and features of the system. Commercial tools for applications that span that many nodes do not exist yet. Simple tools have a better chance at scaling to these future systems, but managing and debugging such large applications and systems will remain a challenge.
Related Entries Checkpointing Clusters Fault Tolerance Operating System Strategies Scalability
Bibliographic Notes and Further Reading Overviews of SSI are presented in [], Chapter of [], and []. The descriptions of actual systems are out of date or the systems are not in use anymore. Nevertheless, the fundamental features and issues described are still valid. Many papers discuss specific aspects of SSI. For example, [] and [] discuss partitioning and present job launch mechanisms. In [], the authors discuss specific SSI challenges for petascale systems. SSI tools to make system administration easier are presented in [, , ]. Process migration is discussed in detail in [] and specific implementations appear in MOSIX [], BProc [], and IBM’s WPAR [], among others. The survey in [] is a little bit dated, but it gives a good description of so-called worms, which are processes specifically designed to migrate. Implementations of SSI, most of them partial, abound even before the term SSI was coined. Examples include LOCUS [], Plan [], MOSIX [], and Kerrighed [].
Bibliography . Barak A, La’adan O () The MOSIX multicomputer operating system for high performance cluster computing. Future Gener Comput Syst (–):– . Brightwell R, Fisk LA, Greenberg DS, Hudson T, Levenhagen M, Maccabe AB, Riesen R () Massively parallel computing using commodity components. Parallel Comput (–):– . Buyya R, Cortes T, Jin H () Single system image. Int J High Perform Comput Appl ():– . Frachtenberg E, Petrini F, Fernandez J, Pakin S, Coll S () STORM: lightning-fast resource management. In: Supercomputing ’: proceedings of the ACM/IEEE conference on supercomputing, Baltimore. IEEE Computer Society, Los Alamitos, pp – . Greenberg DS, Brightwell R, Fisk LA, Maccabe AB, Riesen R () A system software architecture for high-end computing. In: SC’: high performance networking and computing: proceedings of the ACM/IEEE SC conference, San Jose, Raleigh, – Nov . ACM/IEEE Computer Society, New York . Hendriks E () BPROC: the Beowulf distributed process space. In: ICS ’: proceedings of the th international conference on supercomputing, pp –. ACM, New York . Hendriks EA, Minnich RG () How to build a fast and reliable node cluster with only one disk. J Supercomput ():– . Hwang K, Jin H, Chow E, Wang C-L, Xu Z () Designing SSI clusters with hierarchical check pointing and single I/O space. IEEE Concurr ():–
Sisal
. Kharat S, Mishra R, Das R, Vishwanathan S () Migration of software partition in UNIX system. In: Compute ’: proceedings of the first Bangalore annual compute conference, Bangalore. ACM, New York, pp – . Milojiˇci´c DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S () Process migration. ACM Comput Surv ():– . Morin C, Gallard P, Lottiaux R, Vallée G () Towards an efficient single system image cluster operating system. Future Gener Comput Syst ():– . Ong H, Vetter J, Studham RS, McCurdy C, Walker B, Cox A () Kernel-level single system image for petascale computing. SIGOPS Oper Syst Rev ():– . Pfister GF () In search of clusters: the ongoing battle in lowly parallel computing, nd edn. Prentice-Hall, Upper Saddle River . Pike R, Presotto D, Thompson K, Trickey H, Winterbottom P () The use of name spaces in Plan . SIGOPS Oper Syst Rev ():– . Skjellum A, Dimitrov R, Angaluri SV, Lifka D, Coulouris G, Uthayopas P, Scott SL, Eskicioglu R () Systems administration. Int J High Perform Comput Appl (): – . Smith JM () A survey of process migration mechanisms. SIGOPS Oper Syst Rev ():– . Walker B, Popek G, English R, Kline C, Thiel G () The LOCUS distributed operating system. In: SOSP ’: proceedings of the ninth ACM symposium on operating systems principles, Bretton Woods. ACM, New York, pp –
Singular-Value Decomposition (SVD) Eigenvalue and Singular-Value Problems
Sisal John Feo Pacific Northwest National Laboratory, Richland, WA, USA
Definition Streams and Iterations in a Single Assignment Language (Sisal) was a general-purpose applicative language developed for shared-memory and vector supercomputer systems. It provided an hierarchical intermediate form, parallel runtime system, optimizing compiler,
S
and programming environment. The language was strongly typed. It supported both array and stream data structures, and had both iterative and parallel loop constructs.
Discussion Introduction Streams and Iterations in a Single Assignment Language (Sisal) was a general-purpose applicative language defined by Lawrence Livermore National Laboratory, Colorado Sate University, University of Manchester and Digital Equipment Corporation in the early s []. Lawrence Livermore and Colorado State developed the language over the next decades. They maintained the language definition, compiler and runtime system, programming tools, and provided education and user support services. The language version used widely in the s and s was Sisal . []. Three Sisal user conferences were held in , , and [, ]. The Sisal Language Project had four major accomplishments: () IF, an intermediate form used widely in research []; () OSC, an optimizing compiler that preallocated memory for most data structures and eliminated unnecessary coping and reference count operations []; () a highly efficient, threaded runtime system that supported most shared-memory, parallel computer systems [], and () execution performance and scalability equivalent to imperative languages [, ].
Language Definition Sisal’s applicative semantics and keyword-centric syntax proved to be a simple, easy-to-use, easy-to-read parallel programming language that facilitated algorithm development and simplified compilation. Since the value of any expression depended only on the values of its subexpressions and not on the order of evaluation, Sisal programs defined the data dependencies among operations, but left the scheduling of operations, the communication of data values, and the synchronization of concurrent operations to the compiler and runtime system. Sisal was strongly typed. Programmers specified the types of only function parameters; the compiler inferred all other types. All functions were mathematically sound. A function had access to only its arguments and there were no side effects. Each function invocation was independent and functions did not retain state
S
S
Sisal
between invocations. Sisal programs were referentially transparent and single-assignment. Since a name was bound to one value and not a memory location, there was no aliasing. In general, a Sisal program defined a set of mathematical expressions where names stood for specific values, and computations progressed without state making the transformation from source code to dataflow graph trivial. Sisal supported the standard scalar data types: boolean, character, integer, real, and double precision. It also included the aggregate types: array, stream, record, and union. All arrays were one-dimensional; multidimensional arrays were arrays of arrays. The type, size, and lower bound of an array was not defined explicitly, rather it was a consequence of execution. Arrays were created by gathering elements, “modifying” existing arrays, or catenations. A stream was a sequence of values of uniform type. In Sisal, stream elements were accessible in order only; there was no random access. A stream could have only one producer, but many consumers. By definition, Sisal streams were nonstrict – each element was available as soon as it was produced. The Sisal runtime system supported the concurrent execution of the producer and consumer(s). The two major language constructs were for initial and for expressions. The former expression substituted for sequential iteration in conventional languages, but retained single assignment semantics. The rebinding of loop names to values was implicit and occurred between iterations. Loop names prefixed with old referred to previous values. A returns clause defined the results and arity of the expression. Each result was either the final value of some loop name or a reduction of the values assigned to a loop name during loop execution. The following for initial expression returns the prefix sum of A for initial i := 0; x := A[i] while i < n repeat i := old i + 1; x := old x + A[i] returns array of x end for
Sisal supported seven intrinsic reductions: array of, stream of, catenate, sum, product, least, and greatest. The order of reduction was determinate. The for expression provided a way to specify independent iterations. The semantics of the expression did not allow references to values defined in other iterations. The expression was controlled by a range expression that could be simple, a dot product of ranges, or a cross product of ranges. The next two expressions compute the dot and cross product of two arrays of size n and m for i in 0 to n - 1 dot j in 0 to m - 1 x := A[i] * B[j] returns sum of x end for for i in 0 to n - 1 cross j in 0 to m - 1 x := A[i] * B[j] returns array of x end for Note that the expression does not indicate how the parallel iterations should be scheduled. The management of parallel threads was the responsibility of the compiler and runtime systems and was highly tuned for each architecture. This separation of responsibility greatly simplified Sisal programs, improved portability, and made the language easy to use.
Build-in-Place Analysis The Sisal compile process comprised eight phases shown in Fig. . The two most important phases concerned build-in-place (IFMEM) and update-in-place analysis (IFUP). IF was a superset of IF. It was not applicative as it included operations (AT-nodes) that directly referenced and manipulated memory. Prior to Sisal, functional programming languages were considered very inefficient. Since the size and structure of aggregate data types are a consequence of execution, without analysis aggregates must be built one element at a time possibly requiring elements to be copied many times as the aggregate grows in size. Moreover, strict adherence to single-assignment semantics requires construction of a new array whenever a single element is modified.
Sisal
Parser
IF1 graph
IF1 optimization
IF2 build-in-place
IF2 update-in-place
IF2 partition
Code Generation
CC compile
Sisal. Fig. OSC phases
Build-in-place analysis attacked the incremental construction problem []. The algorithm had twopasses. Pass visited nodes in data flow order, and built, where possible, an expression to calculate an array’s size at run time. The expression was a function of the semantics of the array constructor and the size expressions of its inputs. Determining the size of an array built during loop execution required an expression to calculate the number of loop iterations before the loop executes. Deriving this expression was not possible for all loops. Pass concluded by pushing, in the order of encounter, the nodes for the size expression onto a stack. Pass removed nodes from the stack, inserted them into the data flow graph, and appropriately wired memory references among them by inserting edges transmitting pointers to memory. If a node was the parent of an already converted node, it could build its result directly into the memory allocated to its child, thus eliminating the intermediate array. If the node was not the parent of an already converted node, it built its result in a new memory location. Note that the ordering of nodes on the stack guaranteed processing of children before parents.
S
Update-in-Place Analysis Update-in-place analysis attacked the aggregate update problem []. OSC introduced dependences in the data flow graph to schedule readers of an aggregate object before writers. It used reference counts to identify the last use of an object. If the last use was a write, then the write was executed in place. By re-ordering nodes and aggregating reference copy operations, OSC eliminated up to % of explicit reference count operations in larger programs without reducing parallelism. Consequently, the inefficiencies of reference counting largely disappeared and the benefits of updating aggregate objects were fully enjoyed by Sisal programs. OSC considered an array as two separately referencecounted objects: a dope vector defining the array’s logical extent and the physical space containing its constituents. As array elements are computed and stored, dope vectors cycle between dependent nodes to communicate the current status of the regions under construction. As a result, multiple dope vectors might reference the same or different regions of the physical space and the individual participants in the construction would produce new dope vectors to communicate the current status of the regions to their predecessors. Without update-in-place optimization, each stage in construction would copy a dope vector.
Runtime System The runtime software supported the parallel execution of Sisal programs, provided general purpose dynamic storage allocation, implemented operations on major data structures, and interfaced with the operating system for input/output and command line processing []. The Sisal runtime made modest demands of the host operating system. At program initiation a command line option specified the number of operating system processes (workers) to instantiate for the duration of the program. The runtime system maintained two queues of threads: the For Pool and the Ready List. A worker was always in one of three modes of operation: executing a thread from the For Pool, executing a thread from the Ready Pool, or idle. The runtime system allocated stacks for threads on demand, but every effort was made to reuse previously allocated stacks and thus reduce allocation and deallocation overhead.
S
S
Sisal
The For Pool maintained slices of for expressions. A worker would pick up a slice, execute it, and return to the pool for another slice. The worker that executed the last slice of an expression placed the parent thread on the Ready List, and returned to the For Pool to look for slices from other expressions. By persisting in executing slices of for expressions, a worker avoided deallocating and allocating its stack. The Ready List maintained all other threads made executable by events, such as the main program or function calls. If a worker found no work on either Pool, it examined the Storage Deallocation List for deferred storage deallocation work. Sisal relied on dynamic storage allocation for many data values such as arrays and streams, and for internal objects such as thread descriptors and execution stacks. It used a two-level allocation mechanism that was fast, efficient, and parallelizable. A standard boundary tag scheme was augmented with multiple entry points to a circular list of free blocks. Both the free list search and deallocation of blocks were parallelized, even though the former sometimes completely removed blocks from the list and the latter coalesced physically adjacent but logically distant blocks. To increase speed, an exact-fit caching mechanism was interposed before the boundary tag scheme. It used a working set of different sizes of recently freed blocks.
Sisal attracted a wide user community that developed many different types of parallel applications, including benchmark codes, mathematical methods, scientific simulations, systems optimization, and bioinformatics. Table lists the execution times of four applications and Table gives compilation statistics for each program. Columns – give the number of static arrays built, preallocated, and built-in-place; columns – list the number of copy, conditional copy and reference count operations after optimization, and the number of artificial dependency edges introduced by the compiler. It is instructive to examine two of the applications. SIMPLE was a two-dimensional Lagrangian hydrodynamics code developed at Lawrence Livermore National Laboratory to simulate the behavior of a fluid in a sphere. The hydrodynamic and heat condition equations are solved by finite difference methods. A tabular ideal gas equation determines the relation between state variables. The implementation of SIMPLE in Sisal . was straightforward and highly parallel. The compiler preallocated and built all arrays in place ( of them), eliminated all absolute copy operations, marked copy operations for run time check, and eliminated , out of , reference count operations. The conditional copy operations were a result of row sharing. Since they always tested false at runtime, no copy operations actually occured. For iterations of a × grid problem, the Sisal and Fortran versions of SIMPLE on one processor executed in ,. and ,. s, respectively. On ten processors, the Sisal code realized a speedup of .. An equivalent of . processors
Performance As noted above, most Sisal programs executed as fast as their imperative language counterparts. Consequently, Sisal. Table Execution times Program
Fortran
GaussJordan
Sisal (#p)
. s
Sisal (#p)
. s ()
Speedup
. s ()
.
RICARD
. h
. h
. h ()
.
SIMPLE
. s
. s()
. ()
.
Simulated annealing
. s
. s ()
. s ()
.
Sisal. Table OSC optimizations Program GaussJordan
Arrays
PreAllocated
Built inplace
Copy
CCopy
Ref counts
Artificial edges
RICARD
SIMPLE
Simulated annealing
Small-World Network Analysis and Partitioning (SNAP) Framework
was lost in allocating and deallocating two-dimensional arrays. The implementation of true multi-dimensional arrays was proposed in both Sisal . [] and Sisal . [These language definitions were defined near the end of the project, but never implemented.] Simulated annealing is a generic Monte Carlo optimization technique for solving many difficult combinatorial problems. In the case of the school timetable problem, the objective is to assign a set of tuples to a fixed set of time slots (periods) such that no critical resource is scheduled more than once in any period. Each tuple is a record of four fields: class, room, subject, and teacher. Classes, rooms, and teachers are critical resources; subjects are not. At each step of the procedure, a tuple is chosen at random and moved to another period. If the new schedule has equivalent or lower cost, the move is accepted. If the new schedule has higher cost, the move is accepted with a certain exponential probability. If the move is not accepted, the tuple is returned to its original period. The procedure can be paralleilized by simultaneously choosing one tuple from each nonempty period and applying the move criterion to each. The accepted moves are then carried out one at a time. OSC preallocated memory for all the arrays, and built all but four of the arrays in place. It removed all absolute copy operations, marked four copy operations for run time check, and removed all but reference count operations. The four arrays not built in pace resulted from adding a tuple to a period. Although the compiler did not mark the new periods for buildin-place, the periods were rarely copied. When an element was removed from the high-end of an array, the array’s logical size shrunk by one, but its physical size remained constant. Thus, when an element was added to a period, there was often space to add the element without copying. Whenever copying occurred, extra space was allocated to accommodate future growth. This foresight saved over , copies in this application at the cost of a few hundred bytes of storage.
Bibliographic Notes and Further Reading For a summary of the project and details on the compiler and runtime system see []. The language definition can be found in []. IF is described [].
S
Build-in-place analysis and update-in-place analysis are described in more detail in [] and [], respectively. In and then again in two new versions of the language, Sisal . and Sisal , were defined. These versions included: higher-order functions, user-defined reductions, parameterized data types, foreign language modules, array syntax, and rectangular arrays. While compilers for the version were never released there is some information available on the web.
Bibliography . Feo JT, Cann D, Oldehoeft R () A report of the Sisal Language Project. J Parallel Distrib Comput ():– . McGraw J et al () Sisal: streams and iterations in a singleassignment language, reference manual version .. Lawrence Livermore National Laboratory Manual M-, Livermore, CA, September . Feo J (ed) Proceedings of the nd Sisal User Conference. Lawrence Livermore National Laboratory, CONF-, San Diego, CA, October . Feo J (ed) Proceedings rd Sisal User Conference. Lawrence Livermore National Laboratory, CONF-, San Diego, CA, October . Skedzielewski S, Glauert J () IF – an intermediate form for applicative languages. Lawrence Livermore National Laboratory Manual M-, Livermore National Laboratory, Livermore, CA, September . Ranelletti J () Graph transformation algorithms for array memory optimization in applicative languages. PhD dissertation, University of California at Davis, CA, May . Cann D () Compilation techniques for high performance applicative computation. PhD dissertation, Colorado State University, CO, May . Cann D () Retire Fortran?: A debate rekindled. Commun ACM ():– . Oldehoeft R et al () The Sisal . reference manual. Lawrence Livermore National Laboratory, Technical Report UCRL-MA, Livermore, CA, December
Small-World Network Analysis and Partitioning (SNAP) Framework SNAP (Small-World Network Analysis and Partition-
ing) Framework
S
S
SNAP (Small-World Network Analysis and Partitioning) Framework
SNAP (Small-World Network Analysis and Partitioning) Framework Kamesh Madduri Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Synonyms Graph analysis software; Small-world network analysis and partitioning (SNAP) framework
Definition SNAP (Small-world Network Analysis and Partitioning) is a framework for exploratory analysis of large-scale complex networks. It provides a collection of optimized parallel implementations for common graph-theoretic problems.
Discussion Introduction Graphs are a fundamental abstraction for modeling and analyzing data, and are pervasive in real-world applications. Transportation networks (road and airline traffic), socio-economic interactions (friendship circles, organizational hierarchies, online collaboration networks), and biological systems (food webs, protein interaction networks) are a few examples of data that can be naturally represented as graphs. Understanding the dynamics and evolution of real-world network abstractions is an interdisciplinary research challenge with wide-ranging implications. Empirical studies on networks have led to a variety of models to characterize their topology and evolution. Quite surprisingly, it has been shown that graph abstractions arising from diverse systems such as the Internet, social interactions, and biological networks exhibit common structural features such as a low diameter, unbalanced degree distributions, self-similarity, and the presence of dense subgraphs [, , ]. Some of these topological features are captured by what is known as the small-world model or phenomenon. The analysis of large graph abstractions, particularly small-world complex networks, raises interesting computational challenges. Graph algorithms are typically
highly memory intensive, make heavy use of data structures such as lists, sets, queues, and hash tables, and exhibit a combination of data and task-level parallelism. On current workstations, it is infeasible to do exact incore computations on graphs larger than million vertex/edge entities (large in general refers to the typical problem size for which the graph and the associated data structures do not fit in – GB of main memory). In such cases, parallel computers can be utilized to obtain solutions for memory and compute-intensive graph problems quickly. Due to power constraints and diminishing returns from instruction-level parallelism, the computing industry is rapidly converging towards widespread use of multicore chips and accelerators. Unfortunately, several known parallel algorithms for graph-theoretic problems do not easily map onto clusters of multicore systems. The mismatch arises due to the fact that current systems lean towards efficient execution of regular computations with low memory footprints and working sets, and heavily penalize memory-intensive applications with irregular memory accesses; however, parallel graph algorithms in the past were mostly designed assuming an underlying, wellbalanced compute-memory platform. The small-world characteristics of real networks, and the load-balancing constraints they impose during parallelization, represent an additional challenge to the design of scalable graph algorithms. SNAP [, ] is an open-source computational framework for graph-theoretic analysis of large-scale complex networks. It is intended to be an optimized collection of computational kernels (or algorithmic building-blocks) that the end-user could readily use and compose to answer higher-level, ad-hoc graph analysis queries. The target platforms for SNAP are shared-memory multicore and symmetric multiprocessor systems. SNAP kernels are implemented in C and use OpenMP for parallelization. On distributed memory systems, SNAP can be used for intra-node parallelization, and the user has to manage inter-node communication, and also identify and implement parallelism at a coarser node-level granularity. The parallel graph algorithms in SNAP are significantly faster than alternate implementations in other open-source graph software. This is due to a combination of the use of memory-efficient data structures, preprocessing kernels that are tuned for small-world
SNAP (Small-World Network Analysis and Partitioning) Framework
Complex network
S
SNAP Advanced analytics e.g., community detection, subgraph isomorphism Exploratory network analysis
Graph metrics, preprocessing e.g., centrality, clustering coeff. Parallel graph kernels e.g., BFS, MST, connected components Graph representation e.g., file formats, data structures
SNAP (Small-World Network Analysis and Partitioning) Framework. Fig. A schematic of the SNAP graph analysis framework []
networks, as well as algorithms that are designed to specifically target cache-based multicore architectures. These issues are discussed in more detail in the following sections of this article. As the project title suggests, the initial design goals of SNAP were to provide scalable parallel solutions for community structure detection [], a problem variant of graph partitioning that is of great interest in social, and in general, small-world network analysis. Community structure detection is informally defined as identifying densely connected sets of vertices in the network, thereby revealing latent structure in a large-scale network. It is similar to the problem of graph partitioning in scientific computing, as is usually formulated as a graph clustering problem. SNAP includes several different parallel algorithms for solving this problem of community detection.
Graph Representation The first issue that arises in the design of graph algorithms is the use of appropriate in-memory data structures for representing the graph. The data to be analyzed typically resides on disk in a database, or in multiple files. As the data is read from disk and loaded into main memory, a graph representation is simultaneously constructed. The minimal layout that would constitute an in-memory graph representation is a list of edge tuples with vertex identifiers indicating the source and destination vertices, and any attributes associated with the
edges and vertices. However, this does not give one easy access to lists of edges originating from a specific vertex. Thus, the next commonly used representation is to sort the edge tuple list by the source vertex identifier, and store all the adjacencies of a particular vertex in a contiguous array. This is the primary adjacency list representation of graphs that is supported in SNAP. Edges can have multiple attributes associated with them, in which case they can either be stored along with the corresponding adjacency vertex identifier, or in separate auxiliary arrays. This representation is space-efficient, has a low computational overhead for degree and membership queries, and provides cache-friendly access to adjacency iterators. In cases where one requires periodic structural updates to the graph, such as insertions and deletions of edges, SNAP uses alternate graph representations. An extension to the static representation is the use of dynamic, resizable adjacency arrays. Clearly, this would support fast insertions. Further, parallel insertions can be supported using non-blocking atomic increment operations on most of the current platforms. There are two potential parallel performance issues with this data structure. Edge deletions are expensive in this representation, as one needs to scan the entire adjacency list in the worst case to locate the required tuple. The scan is particularly expensive for high-degree vertices. Second, there may be load-balancing issues with parallel edge insertions (for instance, consider the case where there are a stream of insertions to the adjacency list of
S
S
SNAP (Small-World Network Analysis and Partitioning) Framework
a single vertex). These problems can be alleviated by batching the updates, or by randomly shuffling the updates before scheduling the insertions. If one uses sorted adjacency lists to address the deletion problem, then the cost of insertions goes up due to the overhead of maintaining the sorted order. An alternative to arrays is the use of tree structures to support both quick insertions and deletions. Treaps are binary search trees with a priority (typically a random number) associated with each node. The priorities are maintained in heap order, and this data structure supports insertions, deletions, and searching in averagecase logarithmic time. In addition, there are known efficient parallel algorithms for set operations on treaps such as union, intersection, and difference. Set operations are particularly useful to implement kernels such as graph traversal and induced subgraphs, and also for batch-processing updates. To process both insertions and deletions efficiently, and also given the power-law degree distribution for small-world networks, SNAP supports a hybrid representation that uses dynamically resizable arrays to represent adjacencies of low-degree vertices, and treaps for high-degree vertices. The threshold to determine low and high-degree vertices can be varied based on the data set characteristics. By using dynamic arrays for low-degree vertices (which will be a majority of vertices in the graph), one can achieve good performance for insertions. Also, deletions are fast and cache-friendly for low-degree vertices, whereas they take logarithmic time for high-degree vertices represented using treaps. Madduri and Bader [] discuss the parallel performance and space-time trade-offs involved with each of these representations in more detail.
Parallelization Strategies There is a wide variety in the known approaches for exploiting parallelism in graph problems. Some of the easier ones to implement and analyze are computations involving iterations over vertex and edge lists, without much inter-iteration dependencies. For instance, queries such as determining the top-k highdegree vertices, or the maximum-weighted edge in the graph, can be easily parallelized. However, most parallel algorithms require the use of data structures such as priority queues and multisets. Further, one needs support for fast parallel operations on these structures, such as
parallel insertions, membership queries, and batched deletions. Fine-grained, low overhead synchronization is an important requirement for several efficient parallel implementations. Further, the notion of partitioning a graph is closely related to several parallel graph algorithms. Given that inter-processor communication is expensive on distributed memory systems (compared to computation on a single node), several parallel graph approaches rely on a graph partitioning and vertex reordering preprocessing step to minimize communication in subsequent algorithm executions. Graph partitioning is relevant in the context of shared-memory SNAP algorithms as well, since it reduces parallel synchronization overhead and improves locality in memory accesses. Consider the example of breadth-first graph traversal (BFS) as an illustration of more complex paradigms for parallelism in graph algorithms. Several SNAP graph kernels are designed to exploit fine-grained threadlevel parallelism in graph traversal. There are two common parallel approaches to breadth-first search: level-synchronous graph traversal, where the adjacencies of vertices at each level in the graph are visited in parallel; and path-limited searches, where multiple searches from vertices that are far apart are concurrently executed, and the independent searches are aggregated. The level-synchronous approach is particularly suited for small-world networks due to their low graph diameter. Support for fine-grained efficient synchronization is critical in both these approaches. The SNAP implementation of BFS aggressively reduces locking and barrier constructs through algorithmic changes as well as architecture-specific optimizations. For instance, the SNAP BFS implementation uses a lock-free approach for tracking visited vertices and thread-local work queues, significantly reducing shared memory contention. While designing fine-grained algorithms for small-world networks, it is also important to take the unbalanced degree distributions into account. In a levelsynchronized parallel BFS where vertices are statically assigned to multiple threads of execution, it is likely that there will be phases with severe work imbalance (due to the imbalance in vertex degrees). To avoid this, the SNAP implementation first estimates the processing work to be done at each vertex, and then assigns vertices accordingly to threads. Several other optimization techniques, and their relative performance
SNAP (Small-World Network Analysis and Partitioning) Framework
benefits, are discussed by Bader and Madduri in more detail []. Graph algorithms often involve performance tradeoffs associated with memory utilization and parallelization granularity. In cases where the input graph instance is small enough, one could create several replicates of the graph or the associated data structures for multiple threads to execute concurrently, and thus reduce synchronization overhead. SNAP utilizes this technique for the exact computation of betweenness centrality, which requires a graph traversal from each vertex. The fine-grained implementation parallelizing each graph traversal requires space linear in the number of vertices and edges, whereas the coarse-grained approach (the traversals are distributed among p threads) incurs a p-way multiplicative factor memory increase. Depending on the graph size, one could choose an appropriate number of replicates to reduce the synchronization overhead.
SNAP Kernels for Exploratory Network Analysis Exploratory graph analysis often involves an iterative study of the structure and dynamics of a network, using a discriminating selection of topological metrics. SNAP supports fast exact and approximate computation of well-known social network analysis metrics, such as average vertex degree, clustering coefficient, average shortest path length, rich-club coefficient, and assortativity. Most of these metrics have a linear or sub-linear computational complexity, and are straightforward to implement. When used appropriately, they not only provide insight into the network structure, but also help speed up subsequent analysis algorithms. For instance, the average neighbor connectivity metric is a weighted average that gives the average neighbor degree of a degree-k vertex. It is an indicator of whether vertices of a given degree preferentially connect to high- or low-degree vertices. Assortativity coefficient is a related metric, which is an indicator of community structure in a network. Based on these metrics, it may be possible to say whether the input data is representative of a specific common graph class, such as bipartite graphs or networks with pronounced community structure. This helps one choose an appropriate community detection algorithm and also the right clustering measure to optimize for. Other preprocessing kernels that are beneficial
S
for exposing parallelism in the problem include computation of connected and biconnected components of the graph. If a graph is composed of several large connected components, it can be decomposed and individual components can be analyzed concurrently. A combination of these preprocessing kernels before the execution of a compute-intensive routine often lead to a substantial performance benefit, either by reducing the computation by pruning vertices or edges, or by explosing more coarse-grained concurrency and locality in the problem.
Community Identification Algorithms in SNAP Several routines in SNAP are devoted to solving the community identification problem in small-world networks. Graph partitioning and community identification are related problems, but with an important difference: the most commonly used objective function in partitioning is minimization of edge cut, while trying to balance the number of vertices in each partition. The number of partitions is typically an input parameter for a partitioning algorithm. Clustering, on the other hand, optimizes for an appropriate application-dependent measure, and the number of clusters is unknown beforehand. Multilevel and spectral partitioning algorithms (e.g., Chaco [] and Metis []) have been shown to be very effective for partitioning graph abstractions derived from physical topologies, such as finite-element meshes arising in scientific computing. However, the edge cut when partitioning small-world networks using these tools is nearly two orders of magnitude higher than the corresponding edge cut value for a nearly-Euclidean topology, for graph instances that are comparable in the number of vertices and edges. Clearly, small-world networks lack the topological regularity found in scientific meshes and physical networks, and current graph partitioning algorithms cannot be adapted as-is for the problem of community identification. The parallel community identification algorithms in SNAP optimize for a measure called modularity. Let C = (C , . . ., Ck ) denote a partition of the set of vertices V such that Ci ≠ ϕ and Ci ∩ Cj = ϕ. The cluster G(Ci ) is identified with the induced subgraph G[Ci ] := (Ci , E(Ci )), where E(Ci ) := {⟨u, v⟩ ∈ E : u, v ∈ Ci }. Then, E(C) := ∪ki= E(Ci ) is the set of intra-cluster edges
S
S
SNAP (Small-World Network Analysis and Partitioning) Framework
˜ and E(C) := E − E(C) is the set of inter-cluster edges. Let m(Ci ) denote the number of inter-cluster edges in Ci . Then, the modularity measure q(C) of a clustering C is defined as ⎡ ⎤ ⎢ m(Ci ) ∑v∈Ci deg(v) ⎥ ⎥. − ( ) q(C) = ∑ ⎢ ⎢ m ⎥ m ⎥ i ⎢ ⎣ ⎦ Intuitively, modularity captures the idea that a good division of a network into communities is one in which the inter-community edge count is less than what is expected by random chance. This measure does not try to minimize the edge cut in isolation, nor does it explicitly favor a balanced community partitioning. The general problem of modularity optimization has been shown to be N P-complete, and so there are several known heuristics to maximize modularity. Most of the known techniques fall into one of the two broad categories: divisive and agglomerative clustering. In the agglomerative scheme, each vertex initially belongs to a singleton community, and communities whose amalgamation produces an increase in the modularity score are typically merged together. The divisive approach is a top-down scheme where the entire network is initially in one community, and the network is iteratively broken down into subcommunities. This hierarchical structure of community resolution can be represented using a tree (referred to as a dendrogram). The final list of the communities is given by the leaves of the dendrogram, and internal nodes correspond to splits (joins) in the divisive (agglomerative) approaches. SNAP includes several parallel algorithms for community identification that use agglomerative and divisive clustering schemes []. One divisive clustering approach is based on the greedy betweenness-based algorithm proposed by Newman and Girvan []. In this approach, the community structure is resolved by iteratively filtering edges with high edge betweenness centrality, and tracking the modularity measure as edges are removed. The compute-intensive step in the algorithm is repeated computation of the edge betweenness scores in every iteration, and SNAP performs this computation in parallel. SNAP also supports several greedy agglomerative clustering approaches, and the main strategy to exploit parallelism in these schemes is to concurrently resolve communities that do not share intercommunity edges. Performance results on real-world networks
indicate that the SNAP parallel community identification algorithms give substantial performance gains, without any loss in community structure (given by the modularity score) [, ].
Related Entries Chaco Graph Algorithms Graph Partitioning METIS and ParMETIS PaToH (Partitioning Tool for Hypergraphs) Social Networks
Bibliographic Notes and Further Reading The community detection algorithms in SNAP are discussed in more detail in [, ]. Other popular libraries and frameworks for large-scale network analysis include Network Workbench [], igraph [], and the Parallel Boost Graph Library [].
Bibliography . Amaral LAN, Scala A, Barthélémy M, Stanley HE () Classes of small-world networks. Proc Natl Acad Sci USA (): – . Bader DA, Madduri K (April ) SNAP: Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of the nd IEEE International Parallel and Distributed Processing Symposium (IPDPS ), IEEE, Miami, FL . Csárdi G, Nepusz T () The igraph software package for complex network research. InterJournal Complex Systems . http://igraph.sf.net. Accessed May . Fortunato S (Feb ) Community detection in graphs. Physics Reports (–):– . Gregor D, Lumsdaine A (July ) The Parallel BGL: a generic library for distributed graph computations. In: Proceedings of the Parallel/High-Performance Object-Oriented Scientific Computing (POOSC ’), IOS Press, Glasgow, UK . Hendrickson B, Leland R (Dec ) A multilevel algorithm for partitioning graphs. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ), ACM/IEEE Computer Society, New York . Karypis G, Kumar V () A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput ():– . Madduri K (July ) A high-performance framework for analyzing massive complex networks. Ph.D. thesis, Georgia Institute of Technology . Madduri K, Bader DA (May ) Compact graph representations and parallel connectivity algorithms for massive dynamic
SoC (System on Chip)
.
. . .
.
network analysis. In: Proceedings of the rd IEEE International Parallel and Distributed Processing Symposium (IPDPS ), IEEE Computer Society, Rome, Italy Madduri K, Bader DA, Riedy EJ () SNAP: Small-world Network Analysis and Partitioning v.. http://snap-graph.sf.net. Accessed May Newman MEJ () The structure and function of complex networks. SIAM Rev ():– Newman MEJ, Girvan M () Finding and evaluating community structure in networks. Phys Rev E : NWB Team () Network Workbench Tool. Indiana University, Northeastern University, and University of Michigan, http://nwb.slis.indiana.edu. Accessed May Watts DJ, Strogatz SH () Collective dynamics of small world networks. Nature :–
SoC (System on Chip) Tanguy Risset INSA Lyon, Villeurbanne, France
Definition A System on Chip (SoC) refers to a single-integrated circuit (chip) composed of all the components of an electronic system. A SoC is heterogeneous, in addition to classical digital components: processor, memory, bus, etc.; it may contain analog and radio components. The SoC market has been driven by embedded computing systems: mobile phones and handheld devices.
Discussion Historical View Gordon Moore predicted in the exponential growth of silicon integration and its consequences on the application of integrated circuits. Following this growth, known as Moore’s Law, the number of transistor integrated on a single silicon chip has doubled every months, leading to a constant growth in the semi conductor industry for over years. This technological evolution implied constant changes in the design of digital circuits, with, for instance, the advent of gate level simulation and logic synthesis. Amongst these changes, the advent of System on Chip (SoC) represented a major technological shift. The term SoC became increasingly widespread in the s and is used to describe chips integrating on
S
a single silicon die what was before spread on several circuits: processor, memory, bus, hardware device drivers, etc. SoC technology did not fundamentally change the functionality of the systems built, but drastically enlarged the design space: choosing the best way to assemble a complete system from beginning (system specification) to end (chip manufacturing) became a difficult task. In , a book entitled Surviving the SoC Revolution: A guide to plateform-based design was published. It was written by a group of people from Cadence Design System, an important computer-aided design tool company: SoC was driving a revolution of the design methodologies for digital systems. The production of SoC has been driven by the emerging market of embedded systems, more precisely embedded computing systems: cellular phones, pda, digital camera, etc. Embedded computing systems require strong integration, computing power and energy saving, features that were provided by SoC integration. Moreover, most of these systems required radio communication features; hence, a SoC integrates digital components and analog components: the radio subsystem. The success of mobile devices has increased the economic pressure on SoCs, design cost and time-tomarket have become major issues, leading to new design methodologies such as ip-based design and hardware/ software codesign. Another consequence of integration is the exponential growth of embedded software, shifting the complexity of SoC design from hardware to software. Emulation, simulation at various level of precision, and high level design techniques are used because nowadays SoC are incredibly complex: hundreds of millions of gates and millions of lines of code. Increasing digital computing power enables software radio, providing everything everywhere transparently. SoCs are now used to provide convergence between computing, telecommunication, and multimedia. Pervasive environments will also make extensive use of embedded “intelligence” with SoCs.
What Is a SoC? A system on chip or SoC is an integrated circuit composed of many components including: one or several processors, on-chip memories, hardware accelerators, devices drivers, digital/analog converters, and analog components. It was initially named SoC because all the features of a complete “system” were integrated together
S
S
SoC (System on Chip)
on the same chip. At that time, a system was dedicated to a single application: video processing or wireless communication for instance. Thanks to the increasing role of software, SoCs are no longer specific, in fact many of them are reused in several different telephony or multimedia devices. Processors
The processor is the core of the SoC. Unlike desktop machine processors, a wide variety of processors is integrated in the SoC: general purpose processors, digital signal processors, micro-controllers, application specific processors, fpga based soft cores, etc. The International Technology Road-Map for Semiconductors (itrs) indicated that, in , more than % of application specific integrated circuit (asics) contained a processor. Until , most SoCs were composed of a single processor, responsible for the control of the whole system, associated with hardware accelerators and direct memory access components (dma). In , the majority of SoCs were built around a processor of one of the following form of processor architectures: arm, mips, PowerPC, or x. Multi-processor SoCs (MPSoC) are progressively appearing and making use of the available chip area to keep performance improvements. MPSoC appears to be a new major technological shift and MPSoC designers are facing problem traditionally associated with super-computing. Busses
The bus, or more generally the interconnection medium, is also a major component of the SoC. Many SoCs contain several busses, including a fast system bus connecting the major SoC components (processor, memory, hardware accelerator) and a slower bus for other peripherals. For instance, the Advanced Microcontroller Bus Architecture (Amba) proposes the ahb specification (Advanced High-performance Bus), and the apb (Advanced Peripheral Bus), STBus (stMicroelectronics), and CoreConnect (ibm) are other examples of SoC busses. The use of commercial bus protocol has been a major obstacle to SoC standardization: a bus comes with a dedicated communication protocol which might be incompatible with other busses. Networks on Chip (NoC) are progressively used with MPSoC, the major problem of NoCs today is their considerable power consumption.
Dedicated Components
A SoC will include standard control circuits such as uart for the serial port, usb controller, and control of specific devices: gps, accelerometer, touch pad, etc. It might also include more compute intensive dedicated intellectual properties (ips) usually targeted to signal or image processing: fft, turbo decoder, H. coder, etc. Choosing between a dedicated component or a software implementation of the same functionality is done in early stages of the design, in the so-called hardware/ software partitioning. A hardware ip will save power and execution time compared to a software implementation of the same functionality. Dedicated ips are often mandatory to meet the performance requirements of the application (radio or multimedia), but they increase SoC price, complexity, and also reduce its flexibility. Other Components
Analog and mixed signal components may be included to provide radio connexion: audio front-end, radio to baseband interfaces, analog to digital converters. Other application domains may require very specific components such as microelectromechanical systems (mems), optical computing, biochips, and nanomaterials. Quality Criteria
The only objective indicator of the quality of a SoC is its economical success, which is obviously difficult to predict. Among the quality criteria, the most important are: power consumption, time to market, cost, and reusability. Power consumption has been a major issue for decades. Integrating all components on the same chip reduces power consumption because inter chip wires have disappeared. On the other hand, increasing clock rate, program size, and number of computations has a negative impact on power consumption. Moreover, it is very difficult to statically predict the power consumption of a SoC as it heavily depends on low-level technological details. A precise electrical simulation of a SoC is extremely slow and cannot be used in practice for a complete system. Minimizing memory footprint and memory traffic, gating clocks for idle components, and dynamically adapting clock rate are techniques used to reduce power consumption in SoCs. In recent submicronic technologies, static power consumption is significant, leading to power consumption for idle devices.
SoC (System on Chip)
It is usually said that, for a given product, the first commercially available SoC will capture half of the market. Hence, reducing the design time is a critical issue. Performing hardware and software design in parallel is a designer’s dream: it becomes possible with virtual prototyping, that is, simulation of software on the virtual hardware. High-level design and refinement are also widely used: a good decision at a high level may save weeks of design. The cost of a SoC is divided into two components: the nonrecurring engineering (nre) cost which corresponds to the design cost and the unit manufacturing cost. Depending on the number of units manufactured and the implementation technology, a complex tradeoff between the two can be made, bearing in mind that some technological choices can shorten the time to market such as field programmable gate arrays (fpga) for instance. nre cost integrates, in addition to the design cost of the hardware and software part of the SoC, the cost of computer-aided design tools, computing resource for simulation/emulation, and development of tools for the SoC itself: compilers, linkers, debuggers, simulation models of the ips, etc. To give a very rough idea, in , the development cost of a complex SoC could reach several million dollars and could need more than one hundred man years. In this context, is seems obvious that the reuse of ips, tools, and SoCs is a major issue because it shrinks both nre costs and time to market.
An Example
Figure shows the architecture of the Samsung SCAA SoC, designed to be associated with another chip containing processor and memory through the Cotulla interface. The Cotulla interface is associated with Xscale processor (strong arm processor architecture manufactured by Intel). This chip was used for instance in association with the pxa chip of Intel in the pda of Hewlett Packard iPaQ H. This SoC illustrates the different device driver components that can be found, one can also observe the hierarchy of busses mentioned before. It also shows that there is no standard SoC, this one has no processor inside. Thanks to the development of the open source software community, many SoC architectures have been detailed and Linux ports are available for many of them.
S
Why Is It a SoC “Revolution”? Although its advent was predicted, SoC really caused a revolution in the semiconductor industry because it changed design methods and brought many software design problems in the microelectronics community. Hardware design has been impacted by the arrival of SoCs. The complexity of chip mechanically increases with the number of gates on the chip. But the design and verification effort increases more than linearly with the number of gates. This is known as the design gap: How can the designer efficiently use the additional gates available without breaking design time and design reliability? A SoC requires different technologies to be integrated on the same silicon die (standard cells, memory, fpga). The clock distribution tree is a major source of power dissipation leading to the advent of globally asynchronous, locally synchronous systems (gals). Routing wires became a real problem partially solved by using additional metal layers. Hardware design methods have drastically changed with the constant arrival of new Electronic Design Automation (eda) tools. rtl design methods (rtl stands for register transfer level) have become the standard for hardware developers replacing transistor-level circuit design. The hardware description languages vhdl and Verilog are widely used for hardware specification before synthesis. However, higher-level hardware description language are needed, partly because SoC simulation time became prohibitive, requiring huge fpga based emulators. In addition, the networked nature of embedded applications, and the non-determinism introduced by parallelism in MPSoC greatly complicates SoC simulation. As mentioned before, cost and time to market imposes pressure on engineers productivity, leading managers to new management methods. The hardware/software nature of a SoC requires a constant dialog between hardware and software engineers. Embedded software has been introduced in circuit design, this is, in itself, a revolution. In , the Siemens company employed more software developers than Microsoft did, and since the mid- s, more than half of the SoC design efforts were spent in software development. Originally composed of a very basic infinite loop waiting for interrupts, embedded software has evolved by adding many device drivers, standard or real time operating system services: tasks or
S
S
SoC (System on Chip)
DMA (2channel)
CINT
Interruption DP(1:0)
USB
DN(1:0) Accel I/F
Accelerometer
CS(5:2)
2
nWE
Internal Buffer 32 KBytes
Bridge
DREQ(1:0) MA(25:0)
LED(4:0)
LED Peripheral Bus (APB)
Interface Intel Cotulla (Xscale)
nOE
System Bus (AHB)
RDY
DQ
OneWire
SD Host
SD Card I/F 6
DMA
MD(31:0)
Touch Panel I/F PCMCIA Expansion
11
RXD(1:0)
UART
TXD(1:0)
ADC(3:0)
XP.XM.YP.YM
Data(31:0)
Addr(10:0)
FCD I/F
FCD I/F
PWR Man. ADC
Control
Reset, PLL, VCC, MCU-CLK
SoC (System on Chip). Fig. Architecture of the Samsung SCAA SoC
threads, resource management, shared memory, communication protocols, etc. Even if an important part of embedded code is written in C, there is a trend to use component based software design methodologies that reduces the conceptual difference between hardware and software objects. Verifying software reliability is very difficult, even if an important effort is dedicated to that during the design process, complete formal verification is impossible because of the complexity of the software. This explains the bugs that everybody has experienced on their mobile devices. Both hardware and software design methods have been drastically modified, but the most important novelty concerns the whole SoC design methodology that tries to associate in a single framework hardware and software design.
SoC Design Methodology There are many names associated with SoC design methodologies, all of them refer more or less to the same goal: designing hardware and software in the same framework such that the designer can quickly produce the most efficient implementation for a given system specification. The term frequently used today is system level design, the generic scheme of system level design is the following: () derive hardware components and software components from the specification, () map the software part on the hardware components, () prototype the resulting implementation, and () possibly provide changes at some stage of the design if the performance or cost is not satisfactory. There is a global agreement on the fact that high-level specifications are very useful to reduce the
SoC (System on Chip)
design time. Many design frameworks have been proposed for system level SoC design. The idea of refinement of the original specification down to hardware is present in many frameworks, high level synthesis (hls) was once seen as the solution but appears to be a very difficult problem in general. In hls, the hardware is completely derived from the initial specifications. During the s, the arrival of Intellectual Properties (ip) introduced a clear distinction between circuit fabrication and circuit design, leading some companies to concentrate on ip design (arm for instance). On the other hand, ip-based design and later platform-based design propose to start from fixed (or parameterizable) hardware library to reduce the design space exploration phase. Improvement in simulation techniques permits, with the use of systemC language, to have an approach between the two by using virtual prototypes: cycle true simulation of complete SoC before hardware implementation.
MPSoC and Future Trends An open question remains at the present time: What will be the dominant model for MPSoCs that are predicted to include more than one hundred processors? There are two main trends: heterogeneous MPSoCs following actual heterogeneous SoC architecture or homogeneous spmd-like multiprocessors on a single chip. A heterogeneous MPSoC includes different types of processors with different instruction set architectures (isa): general purpose processors, dsps, application specific instruction set processors (asip). It also contains dedicated ips that might not be considered as processors, dmas, and memories. All these components are connected together through a network on chip. Heterogeneous MPSoCs are a natural evolution of SoC, their main advantage is that they are more energy efficient than homogeneous MPSoCs. They have been driven by specific application domains such as software radio or embedded video processing. But, for the future, they suffer from many drawbacks. As they are usually tuned for a particular application, they are difficult to reuse in another context. Because of incompatibility between isas, tasks cannot migrate between processors inducing a much less flexible task mapping. Moreover, in the foreseen transistor technology, integrated circuits will have to be fault tolerant because of electronics defaults. Heterogeneous MPSoC are not
S
fault tolerant by nature. Finally, their scalability is not obvious and code reuse between different heterogeneous platforms is impossible. However, because heterogeneous MPSoCs provide better performance with lower power consumption and lower cost, there are lots of commercial activities on this model. Homogeneous MPSoC are more flexible, they can implement well known operating system services such as code portability between different architectures, dynamic task migration, virtual shared memory. Homogeneity can help in being fault tolerant, as any processor can be replaced by another. However, with more than a hundred processor on a chip in , the MPSoC architecture cannot be completely homogeneous, it has to be hierarchical: clusters of processors tightly linked, these clusters being interconnected with a network on chip. Memory must be physically distributed with non-uniform access. Ensuring cache coherency and providing efficient compilation and parallelization tools for these architectures are real challenges. In , the itrs road-map predicted for the advent of a concurrent software compiler that will “Enable compilation and software development in highly parallel processing SoC,” this will be a critical step towards the advent of homogeneous MPSoC. However, many unknowns remain concerning the programming model to use, the ability of compilation tools to efficiently use the available resources, the reusability of the chip and the reusability of the embedded code between platforms, and the real power consumption of these homogeneous multiprocessors on chip. A trade-off would be a homogeneous SoC with some level of heterogeneity: dedicated ips for domain-specific treatments. New technological techniques, such as D vlsi packaging technology or more generally the arrival of so-called system in package, mixing various nanotechnologies on a single chip might also open a new road to embedded computing systems. But, whatever will be the successful MPSoC model, it will bring highly parallel computing on a chip and should lead to a revival of parallel computing engineering.
Bibliographic Notes and Further Reading A number of books have been published on SoC, we have already talked about “surviving the SOC
S
S
Social Networks
revolution” []. Many interesting ideas were also present in the Polis project []. Daniel Gajski [] worked on SoC from the beginning. Wayne Wolf published a general presentation []. More recently, Chris Rowen [] proposed an interesting view of SoC design. A good overview of the status of high-level synthesis can be found in Coussy and Moraviec book []. More practical details can be found on the Steve Furber book on ARM []. A number of useful information are present on the web, from the International Technology Roadmap for Semiconductors (itrs []), from analyst or engineer [] or open source software developer for embedded devices [, ]. A simple introduction to semiconductor industry can be found in Jim Turley’s book []. A good introduction on MPSoC is presented in [], but most of the interesting work on this subject are presented in the proceedings of the international conferences on the subject: Design Automation Conference (dac), Design automation and Test in Europe (date), International Forum on Embedded MPSoC and Multicore (mpsoc), etc.
Bibliography . Chang H, Cooke L, Hunt M, Martin G, McNelly AJ, Todd L () Surviving the SOC revolution: a guide to platform-based design. Kluwer Academic Publishers, Norwell, MA . Coussy P, Morawiec A (eds) () High-level synthesis: from algorithm to digital circuit. Springer, Berlin, Germany . Embedded System Design () http://www.embedded.com/ . Balarin F et al () Hardware-software co-design of embedded systems: the POLIS approach. Kluwer Academic Press, Dordrecht . Linux for device () http://www.linuxfordevices.com/ . Open Embedded: framework for embedded Linux () http:// wiki.openembedded.net . International Technology Roadmap for Semiconductors () http://www.itrs.net/ . Furber S () ARM system-on-chip architecture. Addison Wesley, Boston, MA . Gajski DD, Abdi S, Gerstlauer A, Schirner G () Embedded system design: modeling, synthesis and verification. Kluwer Academic Publishers, Dordrecht . Jerraya A, Wolf W () Multiprocessor systems-on-chips (The Morgan Kaufmann Series in Systems on Silicon). Morgan Kaufmann, San Francisco, CA . Rowen C () Engineering the complex SOC: fast, flexible design with configurable processors. Prentice-Hall Press, Upper Saddle River, NJ
. Turley J () The essential guide to semiconductors. Prentice Hall Press, Upper Saddle River, NJ . Wolf W () Modern VLSI design: system-on-chip design. Prentice Hall Press, Upper Saddle River, NJ
Social Networks Maleq Khan, V. S. Anil Kumar, Madha V. Marathe, Paula E. Stretz Virginia Tech, Blacksburg, VA, USA
Introduction Networks are pervasive in today’s world. They provide appropriate representations for systems comprised of individual agents interacting locally. Examples of such systems are urban regional transportation systems, national electrical power markets and grids, the Internet, ad-hoc communication and computing systems, public health, etc. According to Wikipedia, A social network is a social structure made of individuals (or organizations) called “nodes,” which are tied (connected) by one or more specific types of interdependency, such as friendship, kinship, financial exchange, dislike, sexual relationships, or relationships of beliefs, knowledge or prestige. Formally, a social network induced by a set V of agents is a graph G = (V, E), with an edge e = (u, v) ∈ E between individuals u and v, if they are interrelated or interdependent. Here, we use a more general definition of nodes and edges (that represent interdependency). The nodes represent living or virtual individuals. The edges can represent information flow, physical proximity, or any feature that induces a potential interaction. For example, the social contact networks, an edge signifies some form of physical co-location between the individuals – such a contact may either capture physical proximity, an explicit physical contact or a co-location in the virtual world, e.g., Facebook. Socio-technical networks are a generalization of social networks and consist of a large number of interacting physical, technological, and human/societal agents. The links in socio-technical networks can be physically real or a matter of convention such as those imposed by law or social norms, depending on the specific system being represented. This entry primarily deals with social networks.
Social Networks
Social networks have been studied for at least years; see [] for a detailed account of the history and development of social networks and the analytical tools that followed them. Scientists have used social networks to uncover interesting insights related to societies and social interactions. Social networks, their structural analysis, and dynamics on these networks (e.g., spread of diseases over social contact networks) are now a key part of social, economic, and behavioral sciences. Importantly, these concepts and tools are also becoming popular in other scientific disciplines such as public health epidemiology, ecology, and computer science due to their potential applications. For example, in computer science, interest in social networks has been spurred by web search and other online communication and information applications. Google and other search engines crucially use the structure of the web graph. Social networking sites, such as Facebook and Twitter, and blogging sites have grown at an amazing rate and play a crucial role in advertising and the spread of fads, fashion, public opinion, and politics. The growing importance of social networks in the scientific community can be gauged by a report from the National Academies [] as well as a recent issue of Science []. The proliferation of Facebook, MySpace, Twitter, Blogsphere, and other online communities has made social networking a pervasive and essential part of our lingua. An important and almost universal observation is that the network structure seems to have a significant impact on these systems and their associated dynamics. For instance, many infrastructure networks (such as the Internet) have been observed to be highly vulnerable to targeted attacks, but are much more robust to random attacks []. In contrast, social contact networks are known to be very robust to all types of attacks []. These robustness properties have significant implications for control and policies on such systems - for instance, protecting high degree nodes in Internet-like graphs has been found to be effective in stopping the spread of worms []. For many processes, e.g., the SIS model of diffusion (which models “endemic” diseases, such as Malaria and Internet malware), the dynamics have been found to be characterized quite effectively by the spectral properties of the underlying network []. Similarly, classical and new graph theoretic metrics, e.g., centrality (which, informally, determines the fraction of shortest paths passing through a node), have been found to be
S
useful in identifying “important” nodes in networks and community structure. Therefore, computing the properties of these graphs, and simulating the dynamical processes defined on them are important research areas. Most networks that arise in practice are very large and heterogeneous, and cannot be easily stored in memory - for instance, the social contact graph of a city could have millions of nodes and edges [], while the web graph has over billion nodes [] and several billion edges. Additionally, these networks change at very short time scales. They are also naturally labeled structures, which adds to the memory requirements. Consequently, simple and well-understood problems such as finding shortest paths, which have almost linear time algorithms, become challenging to solve on such graphs. High-performance computing has been successfully used in a number of scientific applications involving large data sets with complex processes, and is, therefore, a natural and necessary approach to overcome such resource limitations. Most of the successful applications of highperformance computing have been in physical sciences, such as molecular dynamics, radiation transport, nbody dynamics, and sequence analysis, and researchers routinely solve very large problems using massive distributed systems. Many techniques and general principles have been developed, which have made parallel computing a crucial tool in these applications. Social networks present fundamentally new challenges for parallel computing, since they have very different structure, and do not fit the paradigms that have been developed for other applications. As we discuss later, some of the key issues that arise in parallelization of social network problems include highly irregular structure, poor locality, variable granularity, and data-driven computations. Most social contact networks have a scale-free and “small world” structure (described formally below), which is very different from regular structures such as grids. As a result, social network problems often cannot be decomposed easily into smaller independent subproblems with low communication, making traditional parallel computing techniques very inefficient. The goal of this article is to highlight the challenges for high-performance computing posed by the growing area of social networks. New paradigms are needed at all levels of hardware architecture, software methodologies, and algorithmic optimization.
S
S
Social Networks
Section “Background and Notation” presents some background into social networks, their history, and important problems. Section “Petascale Computing Challenges for Social Network Problems” identifies the main challenges for parallel computing. We discuss some recent developments in Section “Techniques” and conclude in section “Conclusions”.
Background and Notation Many of the results in sociology that have now become folklore have been based on insights from social network analysis. One of such classic results is the “sixdegree of separation” experiment, in which Stanley Milgram [] asked randomly chosen people to forward a letter to a stock broker in Boston, by sending it to one of their acquaintances, who is most likely to know the target. Milgram found that the median length of chains that successfully reached the target was about , suggesting that the global social contact network was very highly connected. Attempts to explain this phenomenon have led to the “small-world” graph models, which show that a small fraction of long-range contacts to individuals spatially located far away is adequate to bring the diameter of the network down. Another classic result is Granovetter’s concept of the “strength of weak ties” [], which has become one of the most influential papers in sociology. He examined the role of information that individuals obtain as a result of their position in the social contact network, and found that information from acquaintances (weak ties), and not close friends, was the most useful, during major lifestyle changes, e.g., finding jobs. This finding has led to the notions of “homophily” (i.e., nodes with similar attributes form contacts) and “triadic closure” (i.e., nodes with a common neighbor are more likely to form a link), and the idea that weak ties form bridges that span different parts of the network. Concepts such as these have been refined and formalized using graph theoretic measures; see [, ] for a formal discussion. Some of the key graph measures are: () (Weighted) Degree distribution: The degree of a node is the number of contacts it has, and the degree distribution is the frequency distribution of degrees. This measure is commonly used to characterize and distinguish various families of graphs. In particular, it has been found that many real networks have power law or lognormal degree distributions, instead of Poisson,
which has been used to develop models to explain the construction and evolution of these networks. () The Clustering coefficient, C(v), of a node v is the probability that two randomly picked neighbors of node v are connected, and models the notion of triadic closure, and its extensions. () Centrality: Let f (s, t) denote the number of shortest paths from s to t, and let fv (s, t) denote the number of shortest paths from s to t that pass through node v. The centrality of node v is defined as bc(v) = ∑s,t fv (s, t)/f (s, t), and captures its “importance”. () Robustness to node and edge deletions, which is quantified in terms of the giant component size, as a result of node and edge deletions, and has been found to be a useful measure to compare graphs. () Page Rank and Hubs have been used to identify “important” web pages for search queries, but have been extended to other applications. The Page Rank of a node is, informally, related to the stationary probability of a node in a random walk model of a web surfer. () A community or a cluster is a “loosely connected” set of nodes with similar attributes, which are different from nodes in other communities. There are many different methods for identifying communities, including modularity and spectral structure [, ]. In applications such as epidemiology and viral marketing, the network is usually associated with a diffusion process, e.g., the spread of disease or fads. There are many models of diffusion, and some of the most widely used classes of models are Stochastic cascade models [, ]. In these models, we are given a probability p(e) for each edge e ∈ E in a graph G = (V, E). The process starts at a node s initially. If node v becomes active at time t, it activates each neighbor w with probability p(v, w), independently (and also independent of the history); no node can be reactivated in this process. In the case of viral marketing, active nodes are those that adopt a certain product, while in the SIR model of epidemics, the process corresponds to the spread of a disease, and the active nodes are the ones that get infected. Another class of models that has been studied extensively involves Threshold functions; one of the earliest uses of threshold models was by Granovetter and Schelling [], who used it for modeling segregation. The Linear Threshold Model [] is at the core of many of these models. In this model, each node v has a threshold Θv (which may be chosen randomly), and each edge (v, w) has a weight bv,w . A node v becomes active at time t, if ∑w∈N(v)∩A t bv,w ≥ Θv ,
Social Networks
where N(v) denotes the set of neighbors of v and At denotes the set of active nodes at time t (in this model, an active node remains active throughout). Such diffusion processes are instances of more general models of dynamical systems, called Sequential Dynamical Systems (SDSs) [, ], which generalize other models, such as cellular automata, Hopfield networks, and communicating finite state machines. An SDS S is defined as a tuple (G, F, π), where: (a) G = (V, E) is the underlying graph on a set V of nodes, (b) F = (fv ) denotes a set of local functions for each node v ∈ V, on some fixed domain. Each node v computes its state by applying the local function fv on the states of its neighbors. (c) π denotes a permutation (or a word) on V, and specifies the order in which the node states are to be updated by applying the local functions. One update of the SDS involves applying the local functions in the order specified by π. It is easy to see that the above diffusion models can be captured by suitable choice of the local functions fv and the order π. It is easy to extend the basic definition of SDS to accommodate local functions that are stochastic, the edges that are dynamic, and the graph that is represented hierarchically (to capture organizational structure). Fundamental questions in social networks include () understanding the structure of the networks and the associated dynamics, especially how the dynamics is affected by the network properties; () techniques to control the dynamics; () identifying the most critical and vulnerable nodes crucial for the dynamics; and () coevolution between the network and dynamics – this issue brings in behavioral changes by individuals as a result of the diffusion process. The above mentioned problems require large-scale simulations, and parallel computing is a natural approach for designing them.
Petascale Computing Challenges for Social Network Problems Lumsdaine et al. [] justifiably assert that traditional parallel computing techniques (e.g., those developed in the context of applications such as molecular dynamics and sequencing) are not well suited for large-scale social network problems because of the following reasons: () Graph algorithms chiefly involve data driven computations, in which the computations are guided
S
by the graph structure. Therefore, the structure and sequence of computations are not known in advance and are hard to predict, making parallelization difficult. () Social networks are typically very irregular and strongly connected, which makes partitioning into “independent” sub-problems difficult []. As observed by many researchers, these networks usually have a small-world structure with high clustering, and such graphs have low diameter and large separators (unlike regular graphs such as grids, which commonly arise in other applications). () Locality is one of the key properties that helps in parallelization; however, computations on social networks tend to give low locality because of the irregular structure, leading to computations and data access patterns with global properties. () As noted in [, ], graph computations often involve exploration of the graph structure, which are highly memory intensive and there is very little computation to hide the latency to memory accesses. Thus, the performance of memory subsystem, rather than the memory clock frequency, dominates the performance of graph algorithms. Traditional parallel architectures have been developed for applications that do not have these constraints, and therefore, are not completely suited for social network algorithms. The most common paradigm of distributed computing, the distributed memory architectures with message-passing interface (MPI), leads to high message passing, because of the lack of locality and high data access to computation ratio. The shared memory model is much more suited for the kinds of datadriven computations that arise in graph algorithms, and multithreaded approaches have been found to be more effective in exploiting the parallelism that arises. Some of the main hardware and software challenges that arise are as follows: () In many graph problems, e.g., shortest paths, parallelism can be found at a fairly fine level of granularity, though in other problems, e.g., computing centrality, there is coarse grained granularity, where each path computation is an independent task. () While multithreading is crucial for graph algorithms, the unstructured nature of these graphs implies that there are significant memory contention and cachecoherence problems. () It is difficult to achieve load balancing on graph computations, where the level of parallelism varies with time, e.g., as in breadth-first search.
S
S
Social Networks
Techniques Parallel algorithms for social network problems is an active area of research and we now discuss some of the key techniques that have been found to be effective for some problems that arise in analyzing social networks. These techniques are still application specific, and developing general paradigms is still an active research area.
Massive Multithreading Techniques In [], Hendrickson and Berry discussed how massively multithreaded architectures, such as the Cray MTA- and its successor the XMT, can be used to boost the performance of parallel graph algorithms. Instead of trying to reduce latency for single-memory access, the MTA- tolerates it by ensuring that the processor has other work to do while waiting for a memory request to be satisfied by supporting many concurrent threads in a single processor and switching between them in a single clock cycle. When a thread issues memory request, the processor immediately switches to another ready-to-execute thread. MTA- supports fast and dynamic thread creation and destruction allowing the program to dynamically determine the number of threads based on the data to be processed. Support of virtualization of threads allows adaptive parallelism and dynamic load balancing. MTA- also supports wordlevel locking of the data items, which decreases access contention and minimizes impact on the execution of other threads. Drawbacks of massively multithreaded machines include higher price and much slower clock rate than mainstream systems, and the difficulty in porting to other architectures. Hendrickson and Berry [], using the massively multithreaded architectures, extended a small subset of the Boost Graph Library (BGL) into the Multithreaded Graph Library (MTGL). This library, which is their ongoing current work, retains the BGL’s look and feel, yet encapsulates the use of nonstandard features of massively multithreaded architectures. Madduri et al. [] also discussed architectural features of the massively multithreaded Cray XMT system and present implementation details and optimization of betweenness centrality computations. They further showed how the parallel algorithm for betweenness centrality can be modified so that locking is not necessary
to avoid concurrent access to a memory unit, which they call lock-free parallel algorithm.
Distributed Streaming Algorithms Data streaming is a useful approach to deal with memory constraints, in which processors keep a “sketch” of the data, while making one or more passes through the entire data (or stream), which is assumed to be too big to store. Streaming is usually sequential, and Google’s MapReduce [] and Apache’s Hadoop [] provide a generic programming framework for distributed streaming. MapReduce/Hadoop provide a transparent implementation which takes care of issues like data distribution, synchronization, load balancing, and processor failures. It can, thus, greatly simplify computation over large-scale data sets, and has been proven to be a useful abstraction for solving many problems, especially those related to web applications. We briefly describe the MapReduce framework here. In the map step, the master node takes a task, partitions it into smaller independent subtasks and passes them down to the worker nodes. The worker nodes may recursively do the same, or process the subtask and pass it back to the master node. The master node then collects the solutions to each subtask that it had assigned and processes these solutions to obtain the final solution for the original task. More formally, the input to the MapReduce system is represented as a set of (key, value) pairs, which can be defined in a completely general manner. There are two functions: Map and Reduce, which need to be specified by the user. A Map function processes a key/value pair (k , v ) and produces a list of zero or more intermediate key/value pairs (k , v ). Once the Map function completes processing of the input key/value pairs, the system groups the intermediate pairs by each key k and makes a list of all values associated with k . These lists are then provided as input to the Reduce function. The Reduce function is then applied independently to each key/value-list pair to obtain a list of final values. The whole sequence can be repeated as needed. Programs written in this functional style are automatically parallelized and executed on a large cluster of machines, insulating users from all the run-time issues. The MapReduce framework is quite powerful, and can be used to efficiently compute any symmetric
Social Networks
order-invariant function [], a large subclass of problems that can be solved by streaming algorithms. This class includes many stream statistics that arise in web applications. Kang et al. [] use this approach to estimate the diameter of a massive instance of the web graph with over billion edges. The (key, value) pairs in their algorithm capture adjacencies of nodes and estimates of the number of nodes within a d-neighborhood of a node; the algorithm is run in multiple stages in order to keep the keys and associated values simple and small, and also uses efficient sketches for representing set unions. The uniqueness of a MapReduce framework lies in its ability to hide the underlying computing architecture from the end-user. Thus, the user can compute using a MapReduce framework on parallel clusters as well as loosely coupled machines in a cloud or over volunteer computing resources comprised of independent, heterogeneous machines.
Dynamical Processes on Social Networks So far, much of the discussion has focused on computational considerations related to social network structure. But many more interesting questions arise when one studies dynamical processes on these networks; in fact, it is fair to say that social networks exist to serve the function of one or more dynamical processes on them. While researchers have worked extensively on structural properties of social networks, the literature on the study of dynamical processes on social networks is relatively sparse. When it does exist, it is usually in the context of very small networks or stylized and regular networks. Unfortunately, many of these results do not scale or apply to large and realistic social networks. In [], the authors discuss new algorithms and their implementations for modeling certain classes of dynamical processes on large social networks. In general, the problem is hard computationally; it becomes even harder when the social network structure, the dynamical processes and individual node behavior coevolve. For certain class of dynamical processes that capture a number of interesting social, economic, and behavioral theories of collective behavior, it is possible to develop fast parallel algorithms. Intuitively, such dynamical processes can be expressed as SDSs with a certain symmetry condition imposed on local transition functions.
S
Conclusions As our society is becoming more connected, there is an increasing need to develop innovative computational tools to synthesize, analyze, and reason about large and progressively realistic social networks. Applications of social networks for analyzing real-world problems will require the study of multi-network, multi-theory systems – systems composed of multiple networks among agents whose behavior is governed by multiple behaviors. Advances in computing and information are also giving rise to new classes of social networks. Two examples of these networks are: () botnet networks in which individual nodes are software agents (bots) and () wireless-social networks in which social networks between individuals are being supported by wireless devices that allow them to communicate and interact anytime and anywhere. As the society becomes more connected, these applications require support for realtime individualized decision making over progressively larger evolving social networks. This same technology will form the basis of new modeling and data processing environments. These environments will allow us to leverage the next generation computing and communication resources, including cloud computing, massively parallel peta-scale machines and pervasive wireless sensor networks. The dynamic models will generate new synthetic data sets that cannot be created in any other way (e.g., direct measurement). This will enable social scientists to investigate entirely new research questions about functioning societal infrastructures and the individuals interacting with these systems. Together, these advances also allow scientists, policy makers, educators, planners, and emergency responders unprecedented opportunities for multi-perspective reasoning.
Acknowledgments This work has been partially supported by NSF Grant CNS-, SES-, CNS-, and OCI, and DTRA Grant HDTRA-- and HDTRA--C-.
Bibliography . Apache hadoop. Code and documentation are available at http:// developer.yahoo.com/hadoop/ . Barabasi A, Albert R () Emergence of scaling in random networks. Science :–
S
S
Software Autotuning
. Barrett C, Bisset K, Marathe A, Marathe M () An integrated modeling environment to study the co-evolution of networks, individual behavior and epidemics. AI Magazine, . Barrett C, Eubuank S, Marathe M () Modeling and simulation of large biological, information and socio-technical systems: an interaction based approach interactive computation. In: Goldin D, Smolka S, Wegner P () The new paradigm, Springer, Berlin, Heidelberg, New York . Baur M, Brandes U, Lerner J, Wagner D (), Group-level analysis and visualization of social networks. In: Lerner J, Wagner D, Zweig K (eds) Algorithmics of large and complex networks, vol , LNCS. Springer, Berlin, Heidelberg, New York, pp – . Dean J, Ghemawat S () Mapreduce: Simplified data processing on large clusters. In: Proceedings of the sixth symposium on operating system design and implementation (OSDI), San Francisco, December . Eubank S, Guclu H, Anil Kumar VS, Marathe MV, Srinivasan A, Toroczkai Z, Wang N () Modelling disease outbreaks in realistic urban social networks. Nature ():– . Feldman J, Muthukrishnan S, Sidiropoulos A, Stein C, Svitkina Z () On distributing symmetric streaming computations. In: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms (SODA), San Francisco, pp – . Freeman L () The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver . Granovetter M () Threshold models of collective behavior. Am J Sociològy ():– . Grimmett G () Percolation. Springer, New York . Hendrickson B, Berry J () Graph analysis with highperformance computing. Comput Sci Eng :– . Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J () Hadi: fast diameter estimation and mining in massive graphs with hadoop. Technical Report CMU-ML--, Carnegie Mellon University . Kempe D, Kleinberg JM, Tardos E () Maximizing the spread of influence through a social network. In: SIGKDD ‘, Washington . Lumsdaine A, Gregor D, Hendrickson B, Berry J () Challenges in parallel graph processing. Parallel Process Lett :– . Madduri K, Ediger D, Jiang K, Bader DA, Chavarra-Miranda DG () A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In: Proceedings of the rd Workshop on Multithreaded Architectures and Applications (MTAAP), Miami, May . Mortveit HS, Reidys CM () Discrete sequential dynamical systems. Discrete Math :– . National Research Council of the National Academies () Network Science. The National Academies Press, Washington, DC . Newman M () The structure and function of complex networks. SIAM Rev :– . Newman MEJ () Detecting community structure in networks. European Phy J B, :–
. Special issue on complex systems and networks, Science, July , ():– . Travers J, Milgram S () An experimental study of the small world problem. Sociometry ():–
Software Autotuning Autotuning
Software Distributed Shared Memory Sandhya Dwarkadas University of Rochester, Rochester, NY, USA
Synonyms Implementations of shared memory in software; Shared virtual memory; Virtual shared memory
Definition Software distributed shared memory (SDSM) refers to the implementation of shared memory in software on systems that do not provide hardware support for data coherence and consistency across nodes (and the memory therein).
Discussion Introduction Executing an application in parallel requires the coordination of computation and the communication of data to respect dependences among tasks. Historically, multiprocessor machines provide either a shared memory or a message passing model for data communication in hardware (with models such as partitioned global address spaces that fall in between the two extremes). In the message passing model, communication and coordination across processes is achieved via explicit messages. In the shared memory model, communication and coordination is achieved by directly reading and writing memory that is accessible by multiple processes and mapped into their address space. Either model can be emulated in software if the hardware does not
Software Distributed Shared Memory
directly implement it. This entry focuses on implementations of shared memory in software on machines that do not support shared memory in hardware across all nodes. In the shared memory model, data communication is implicitly performed when shared data is accessed. Programmers (or compilers) must still ensure, however, that shared data users synchronize (coordinate) with each other in order to respect program dependencies. The message passing model typically requires explicit communication management by the programmer or compiler – software must specify what data to communicate, with whom to communicate the data, and when the communication must be performed (typically on both the sender and the receiver). Coordination is often implicit in the data communication. While parallel programming has proven inherently more difficult than sequential programming, shared memory is considered conceptually easier as a parallel programming model than message passing. In contrast to message passing, the shared memory model hides the need for explicit data communication from the user, thereby avoiding the need to answer the “what, when, and whom” questions. In terms of implementation, typical shared memorybased hardware requires support for coherence – ensuring that any modifications made by a processor propagate to all copies of the data — and consistency – ensuring a well-defined ordering in terms of when and in what order modifications are visible to other processors. Such support and the need for access to the physically shared memory impacts the scalability of the design. In contrast, message-based hardware is typically easier to build and to scale to large numbers of processors. Software distributed shared memory (SDSM) attempts to combine the conceptual appeal of shared memory with the scalability of distributed messagebased communication by allowing shared memory programs to run on message-based machines. SDSM systems target the knee of the price-performance curve by using commodity components as the nodes in a network of machines. SDSM provides processes with the illusion of shared address spaces on machines that do not physically share memory. Kai Li’s [, ] pioneering work in the mid-s was the first to consider
S
implementing SDSM on a network of workstations. Since then, there has been a tremendous amount of work exploring the state space of possible implementations and programming models, targeted at reducing the software overheads required and hiding the larger communication overheads of message-based hardware.
Implementation Issues In order to implement coherence, accesses to shared data must be detected and controlled. Implementing coherence in software can typically be accomplished either using virtual memory hardware [, , ], using instrumentation [], or using language-level hooks [, ]. The former two approaches assume that memory is a linear untyped array of bytes. The latter approach takes advantage of object- and type-specific information to optimize the protocol. The trade-offs are discussed below.
Implementations Using Virtual Memory The virtual memory mechanisms available on generalpurpose processors may be used to implement coherence [, , ]. The virtual memory framework combines protection with address translation at a granularity of a page, which is typically on the order of a few (e.g., ) KBytes on today’s machines. Hardware protection modes can be exercised to ensure that read or write accesses to pages that are shared are trapped if protection is violated. Software handlers installed by the runtime can then determine the necessary actions required to maintain coherence. The advantage of the virtual memory approach is that there is no overhead for data that is private or already cached. The disadvantage is the large sharing granularity, which can result in data being falsely (i.e., different processors reading and writing to independent locations that are colocated on the same page) shared, and the large cost of handling a miss. Implementations Using Instrumentation An alternative approach is to instrument the program in order to monitor every load and store operation to shared data. The advantage of this approach is that coherence can be maintained at any desired granularity [], resulting in lower miss penalties. The disadvantage is that overhead is incurred regardless of whether
S
S
Software Distributed Shared Memory
data is actually shared, resulting in higher hit latency. Optimizations that identify and instrument accesses to only shared data and that hide the instrumentation behind existing program computation [] help reduce this overhead. Alternatively, hardware support, such as ECC (error-correcting code) bits in memory [], can be leveraged to eliminate the software instrumentation by using the ECC bits to raise an exception when the data is not valid. However, complications do arise if these exceptions are not precise.
Language-Level Implementations Language-level programming systems (e.g., [, ]) provide the opportunity to use objects and types to improve the efficiency of recognizing accesses to shared data. Coherence is usually maintained at the granularity of an object, providing application developers with control over how data is managed. The DOSA system [] shows how a handle-based implementation (an implementation in which all object references are redirected through a handle) enables efficient fine- and coarse-grain data sharing. Handle-based implementations work transparently when used in conjunction with safe languages (ones in which no pointer arithmetic is allowed). The handles make it easy to relocate objects in memory, making it possible to use virtual memory techniques for access and modification detection and control. They also interact well with the garbage collection algorithms used in these type-safe languages, allowing good performance for these systems. The redirection allows the implementation to avoid false sharing for fine-grained access patterns.
Single- Versus Multiple-Writer Protocols Traditional hardware-based cache coherence usually maintains the coherence invariant of allowing only a single writer for each coherence granularity unit of data. This approach has the advantage of simplicity, works well under the intuitive assumption that two processes will not intentionally write to the same location concurrently without some form of coordination, and makes it easy to ensure write serialization. However, coherence granularities are often larger than a single word. Even with hardware coherence, it is possible for a cache line to bounce (“ping-pong”) between caches because processes modify logically disjoint regions of
the same cache line. This problem is further exacerbated by the larger coherence granularities of software coherence, particularly when using a virtual memory implementation. In order to combat the resulting performance penalties caused by such false sharing, multiple-writer protocols have been proposed []. In these protocols, the single writer state invariant of traditional hardware coherence no longer holds. At any given time, multiple processes may be writing to the same coherence unit. Write serialization is ensured by making sure that modifications (determined by keeping track of exactly what locations in the coherence unit were written) are propagated to all valid copies or that these copies are invalidated. Modified addresses can be determined as in the previous section, either by using instrumentation or virtual memory techniques. In the latter case, a twin or copy of the virtual memory page is made prior to modification, which is subsequently used to compare to the modified page in order to create a diff or an encoding of only the data that has changed on the page. Multiple-writer protocols work particularly well when combined with a more relaxed memory consistency model that allows coherence actions to be delayed to program synchronization points (elaborated on in the next section).
Memory Models In order to help hide long communication latencies, SDSM systems typically use some form of relaxed consistency model, which enable aggregation of coherence traffic. In release consistency [], ordinary accesses to shared data are distinguished from synchronization. Synchronization is further split into acquires and releases. An acquire roughly corresponds to a request for access to data, such as a lock operation. A release roughly corresponds to making such data available, such as an unlock operation. SDSM systems leverage the release consistency model to delay coherence actions on writes until the point of a release []. This enables both aggregation of coherence messages and a reduction in the false sharing penalty. The former is beneficial because of the high per message costs on typical network-based platforms. The latter is a result of data “ping-ponging” between sharers due to the fact that logically differentiated data resides in the same coherence
Software Distributed Shared Memory
unit. By delaying coherence actions to the release point, the number of such “ping-pongs” is reduced. Release consistency mandates, however, that before the release is visible to any other process, all ordinary data accesses made prior to the release are visible at all other processes. This results in extra messages between processes at the time of a release. The TreadMarks system [] proposed the use of a lazy release consistency model [] by making the observation that for data race free [] programs, such ordinary data accesses need only be visible at the time of an acquire. Communication can thus be further aggregated and limited to between an acquirer and a releaser. While lazy release consistency reduces the number of coherence messages used to the minimum possible, it comes at a cost in terms of implementation complexity. Since data accesses are not made visible everywhere at the time of a release, the runtime system must keep track of the ordering of such coherence events and transmit this information when a process eventually performs an acquire. TreadMarks accomplishes this via the use of vector timestamps. The execution of each process is divided into intervals delineated by synchronization operations (either an acquire or a release). Each interval is assigned a monotonically increasing number. Intervals of different processes are partially ordered: Intervals on a single process are totally ordered by program order and intervals on different processes are ordered by acquire–release causality. This partial order is what is captured by assigning a vector timestamp to each interval. Write notices for all data written in an interval are associated with its vector timestamp. This information is piggybacked on release messages to ensure coherence. Each process must then ensure that write notices from all intervals that precede its current interval are applied prior to execution of the current interval. On system area networks such as Infiniband [], where the latency and CPU occupancy of communication is an order of magnitude lower than on traditional networks, it can sometimes be beneficial to overlap communication with computation. Cashmere [] leverages this observation in a moderately lazy release consistent implementation on the Memory Channel [] network. Write notices are pushed to all processes at the time of a release. They are only applied at the time of the next acquire, thereby avoiding the need to interrupt the remote process. Another
S
advantage of this implementation is that it removes the need to maintain vector timestamps along with the associated metadata management complexity. An alternative consistency model, entry consistency, was defined by Bershad and Zekauskas []. All shared data are required to be explicitly associated with some synchronization (lock) variable. On a lock acquisition, only the shared data associated with the lock is guaranteed to be made coherent. This model has some implementation and efficiency advantages. However, the cost is a change to the programming model, requiring explicit association of shared data with synchronization and additional synchronization beyond what is usually required for data race free programs. Scope consistency [] provides a model similar to entry consistency, but helps eliminate the burden of explicit binding by implicitly associating data with the locks under which they are modified.
Data and Metadata Location In order to get an up-to-date copy of data, the runtime system must determine the location of the latest version/s. Most implementations fall into two categories – ones in which the information is distributed across all nodes at the time of synchronization, and ones in which each coherence and metadata unit has a home that keeps track of where the latest version resides, similar to a hardware directory-based protocol. The former implies that every process knows where the latest version is or has the latest version of the data propagated directly to them. The latter implies a level of indirection (through the home) in order to locate the latest version of the data or owner of the metadata. Variations of the latter protocol include making sure that the home node is kept up to date so that any process requesting a copy can retrieve it directly from the home, and allowing migration of the home to (one of) the most active user/s.
Leveraging Hardware Coherence As multiprocessor desktop machines become more common, and with the advent of multicore chips, a cluster of multicore multiprocessor machines is a fairly common computing platform. In implementing shared memory on these platforms, the challenge is to take advantage of hardware-based sharing whenever possible and ensure that software overhead is incurred only when actively sharing data across nodes in a cluster.
S
S
Software Distributed Shared Memory
SoftFLASH [] is a kernel-level implementation of a two-level coherence protocol on a cluster of symmetric multiprocessors (SMPs). SoftFLASH implements coherence using virtual memory techniques, which implies that coherence is implemented by changing read/write permissions in the process’s page table. These permissions are normally cached in the translation lookaside buffer (TLB) in each of the processors of a single node. Since the TLB is not typically hardware coherent, in order to ensure that all threads/processes on a single node have appropriate levels of permission to access shared data, the TLBs in each processor within a node must be examined and flushed if necessary. This process, called TLB shootdown, is usually accomplished with costly inter-processor interrupts. Cashmere-L [] avoids the need for these expensive TLB shootdowns through a combination of a relaxed memory model and the use of a novel two-way diffing technique. Leveraging the observation that in a data race free model, accesses by different processes between two synchronization points will be to different memory locations, Cashmere-L avoids the need to interrupt other processes during a software coherence operation. Updates to a page are applied through a reverse diffing process that identifies and updates only the bytes that have changed, allowing concurrent processes to continue accessing and modifying unrelated data. Invalidations are applied individually by each process and once again leverage the use of diffing in order to avoid potential conflicts with concurrent writers on the node. Shasta [] uses instrumentation to implement a finer granularity coherence protocol across SMP nodes. The lack of atomicity of coherence state checks with respect to the actual load or store (the two actions consist of multiple instructions) results in protocol race conditions that require extra overhead when data is shared by processes within an SMP node. A naive solution would involve sufficient synchronization to avoid the race. Shasta avoids this costly synchronization through the selective use of explicit messages. Since protocol metadata is visible to all processes, perprocess state tables are used to determine processes that are actively sharing data. Messages are sent only to these processes and overhead incurred only with active sharing.
Leveraging Additional Hardware Support Systems such as Ivy, TreadMarks, and Munin are implemented under the assumption that memory on remote machines is not directly accessible and must be read/accessed by requesting service from a handler on the remote machine. This is typically accomplished by sending a message over the network, which can be detected and received at the responding end either by periodically polling or by using interrupts. On traditional local area networks, interrupts are typically expensive because of the need for operating system intervention. However, the advantage is that overhead is incurred only when communication is required. Polling (periodically checking the status of the network interface), on the other hand, is relatively cheap (on the order of a few tens of processor cycles), but because the runtime system has no knowledge of when communication will occur, the frequency of checks can result in a significant overhead that is incurred regardless of whether or not data is actively shared. TreadMarks attempted to minimize the number of messages using a lazy protocol on the assumption that the per message costs (both network and processor occupancy as well as the cost to interrupt a remote processor in order to elicit a response) are at least two orders of magnitude higher than memory read and write latencies. High-performance network technologies, such as those that conform to the Infiniband [] standard (as well as research prototypes such as Princeton’s Shrimp [] and earlier commercial offerings such as DEC’s Memory Channel, Myrinet, and Quadrics QsNet), provide low latency and high bandwidth communication. These networks achieve low latency by virtualizing the network interface. Issues of protection are separated from actual data communication. The former involves the operating system and is performed once at setup time. The latter is performed at user level, thereby achieving lower latency. The majority of these networks allow the possibility of direct access (reads and/or writes) to remote memory, changing the equation in terms of the trade-off in the number of messages used and the eagerness and laziness of the protocol. Protocols such as Cashmere and HLRC [] leverage such direct access to perform protocol actions eagerly without the need to interrupt the remote processor.
Software Distributed Shared Memory
Sharing in the Wide Area As “computing in the cloud,” i.e., taking advantage of geographically distributed compute resources, becomes more ubiquitous, the ability to seamlessly share information across the wide area will improve programmability. Several projects [, , , , , ] have examined techniques to allow data sharing in the wide area. Most enforce a strongly object-oriented programming model. Specifically, InterWeave [] supports both language and machine heterogeneity, using an intermediate format to provide persistent data storage and to communicate seamlessly across heterogeneous machines. InterWeave leverages existing hardware and network characteristics (including support for coherence in hardware) whenever possible. The memory model is further relaxed to incorporate applicationspecific tolerances for delays in consistency. Writes are, however, serialized using a centralized approach (per object) for coordination.
Compiler and Language-Level Support Early work in incorporating compiler support with SDSMs [] examined techniques by which data communication could be aggregated or eliminated entirely by understanding the specific access patterns of the application. In particular, the shortcomings of a generalized runtime system for shared memory relative to a program written for message passing is that in order to minimize the volume of data communicated, data communication is often separated from synchronization and data is fetched on demand at the granularity of the coherence unit. Compile-time analysis can identify data that will be accessed and inform the runtime system so that the data can be explicitly prefetched. Similarly, by using appropriate directives to the runtime to identify when entire coherence units are being written without being read, coherence communication can be eliminated entirely. Subsequently, several efforts have examined the use of SDSM as a back end for programming models such as OpenMP [, ]. They discuss the trade-offs between using a threaded (with a default of sharing the entire address space) versus a process model (with a default of private address spaces with shared data being specifically identified). Their work shows that naive implementations can perform poorly, and that identification of shared data is important to
S
the feasibility and scalability of using programming environments such as OpenMP on SMP clusters.
Future Directions Despite intense research and significant advances in the development of SDSM systems in the s and s, they remain in limited use due to their trade of scalability for the platform transparency they achieve. As multicore platforms become more ubiquitous, and as the number of cores increases, the scalability of pure hardware-based coherence is also in question, and SDSM systems may see a bigger role. Several researchers continue to explore the possibility of combining hardware and software coherence, in terms of hardware assists for a software-based coherence implementation [] and in terms of alternating between automatic hardware coherence and software-managed incoherence based on application or compiler knowledge []. There is also renewed interest in the use of SDSM techniques in order to support heterogeneous platforms composed of a combination of general-purpose CPUs and accelerators [, ].
Related Entries Cache Coherence Distributed-Memory Multiprocessor Linda Memory Models Network of Workstations POSIX Threads (Pthreads) Processes, Tasks, and Threads Shared-Memory Multiprocessors SPMD Computational Model Synchronization
Bibliographic Notes and Further Reading Protic et al. [] published a compendium of works in the area of distributed shared memory circa . Iftode and Singh [] wrote an excellent survey article that encompasses both network interface advances and the incorporation of multiprocessor nodes.
S
S
Software Distributed Shared Memory
Acknowledgment This work was supported in part by NSF grants CCF-, CCF-, CNS-, and CNS. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the granting agencies.
Bibliography . Adve S, Hill M () Weak ordering: a new definition. In: Proceedings of the th annual international symposium on computer architecture, May . ACM, New York, pp – . Amza C, Cox A, Dwarkadas S, Keleher P, Lu H, Rajamony R, Zwaenepoel W () Tread-marks: shared memory computing on networks of workstations. IEEE Comput ():– . Bal H, Kaashoek M, Tanenbaum A () Orca: a language for parallel programming of distributed systems. IEEE Trans Softw Eng ():– . Becchi M, Cadambi S, Chakradhar S () Enabling legacy applications on heterogeneous platforms. Poster paper, nd USENIX workshop on hot topics in parallelism (HOTPAR), Berkley, June . Bershad B, Zekauskas M () Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS--, CarnegieMellon University, Sept . Bilas A, Jiang D, Singh JP () Accelerating shared virtual memory via general-purpose network interface support. ACM Trans Comput Syst :– . Blumrich M, Li K, Alpert R, Dubnicki C, Felten E, Sandberg J () Virtual memory mapped network interface for the SHRIMP multicomputer. In: Proceedings of the st annual international symposium on computer architecture. ACM, New York, pp – . Carter J, Bennett J, Zwaenepoel W () Implementation and performance of Munin. In: Proceedings of the th ACM symposium on operating systems principles, ACM Press, New York, pp – . Chen D, Tang C, Chen X, Dwarkadas S, Scott ML () Beyond S-DSM: shared state for distributed systems. Technical report , University of Rochester, Mar . Chen D, Tang C, Chen X, Dwarkadas S, Scott ML () Multilevel shared state for distributed systems. In: International conference on parallel processing, Aug , Vancouver . Dwarkadas S, Cox A, Zwaenepoel W () An integrated compile-time/run-time software distributed shared memory system. In: Proceedings of the th symposium on architectural support for programming languages and operating systems, pp –, Oct . Erlichson A, Nuckolls N, Chesson G, Hennessy J () SoftFLASH: analyzing the performance of clustered distributed virtual shared memory. In: Proceedings of the th symposium on architectural support for programming languages and operating systems, Oct . ACM Press, New York, pp –
. Fensch C, Cintra M () An OS-based alternative to full hardware coherence on tiled CMPs. In: Proceedings of the fourteenth international symposium on high-performance computer architecture symposium, February , Phoenix . Foster I, Kesselman C () Globus: a metacomputing infrastructure toolkit. Int J Supercomputer Appl ():– . Gharachorloo K, Lenoski D, Laudon J, Gibbons P, Gupta A, Hennessy J () Memory consistency and event ordering in scalable shared-memory multiprocessors. In: Proceedings of the th annual international symposium on computer architecture, May . ACM, New York, pp – . Gillett R () Memory channel: an optimized cluster interconnect. IEEE Micro ():– . Hu Y, Lu H, Cox AL, Zwaenepoel W () OpenMP for networks of SMPs. In: Proceedings of the th international parallel processing symposium (IPPS/SPDP), Apr . IEEE, New York, pp – . Hu YC, Yu W, Cox AL, Wallach D, Zwaenepoel W () Runtime support for distributed sharing in sage languages. ACM Trans Comput Syst ():– . Iftode L, Singh JP () Shared virtual memory: progress and challenges. Proceedings of the IEEE ():– . Iftode L, Singh JP, Li K () Scope consistency: a bridge between release consistency and entry consistency. In: ACM symposium on parallelism in algorithms and architectures, June . ACM Press, New York, pp – . Association () InfiniBand. http://www.infinibandta.org . Keleher P, Cox AL, Zwaenepoel W () Lazy release consistency for software distributed shared memory. In: Proceedings of the th annual international symposium on computer architecture, May . ACM Press, New York, pp – . Keleher P, Dwarkadas S, Cox A, Zwaenepoel W () Treadmarks: distributed shared memory on standard workstations and operating systems. In: Proceedings of the winter Usenix conference, Jan . USENIX Association, Berkeley, pp – . Kelm JH, Johnson DR, Tuohy W, Lumetta SS, Patel SJ () Cohesion: a hybrid memory model for accelerators. In: Proceedings of the international symposium on computer architecture (ISCA), June , St Malo . Kontothanassis L, Hunt G, Stets R, Hardavellas N, Cierniak M, Parthasarathy S, Meira W, Dwarkadas S, Scott M () VMbased shared memory on low-latency, remote-memory-access networks. In: th international symposium on computer architecture, June . ACM Press, New York, pp – . Lewis M, Grimshaw A () The core legion object model. In: Proceedings of the th high performance distributed computing conference, Aug , Syracuse . Li K () Shared virtual memory on loosely coupled multiprocessors. Ph.D. thesis, Yale University . Li K, Hudak P () Memory coherence in shared virtual memory systems. ACM Trans Comput Syst ():– . Min S-J, Basumallik A, Eigenmann R () Optimizing openmp programs on software distributed shared memory systems. Int J Parallel Prog :– . Protic J, Tomasevic M, Milutinovic V () Distributed shared memory: concepts and systems. IEEE Computer Society Press, Piscataway, p
Sorting
. Rogerson D () Inside COM. Microsoft Press, Redmond . Saha B, Zhou X, Chen H, Gao Y, Yan S, Rajagopalan M, Fang J, Zhang P, Ronen R, Mendelson () A Programming Model for a Heterogeneous x Platform. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June . Scales D, Gharachorloo K, Aggarwal A () Fine-grain software distributed shared memory on smp clusters. In: Proceedings of the fourth international symposium on high-performance computer architecture symposium, Feb . ACM, New York, pp – . Scales D, Gharachorloo K, Thekkath C () Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In: Proceedings of the th symposium on architectural support for programming languages and operating systems, Oct , pp – . Schoinas I, Falsafi B, Lebeck AR, Reinhardt SK, Larus JR, Wood DA () Fine-grain access control for distributed shared memory. In: Proceedings of the th symposium on architectural support for programming languages and operating systems, Oct . ACM, New York, pp – . Stets R, Dwarkadas S, Hardavellas N, Hunt G, Kontothanassis L, Parthasarathy S, Scott M () Cashmere-L: software coherent shared memory on a clustered remote-write network. In: Proceedings of the th ACM symposium on operating systems principles, Oct . ACM, New York, pp – . van Steen M, Homburg P, Tanenbaum AS () Globe: a widearea distributed system. IEEE Concurr ():–
Sorting Laxmikant V. Kalé , Edgar Solomonik University of Illinois at Urbana-Champaign, Urbana, IL, USA University of California at Berkeley, Berkeley, CA, USA
Definition Parallel sorting is a process that given n keys distributed over p processors (numbered through p − ), migrates the keys so that all keys on processor k, for k ∈ [, p − ], are sorted locally and are smaller than or equal to all keys on processor k + .
Discussion Introduction Parallel sorting algorithms have been studied in a variety of contexts. Early studies in parallel sorting addressed the theoretical problem of sorting n keys
S
distributed over n processors, using fixed interconnection networks. Modern parallel sorting algorithms focus on the scenario where n is much larger than the number of processors, p. However, even contemporary algorithms need to satisfy an array of usecases and architectures, so various parallel sorting techniques have been analyzed for GPU-based sorting, shared memory sorting, distributed memory sorting, and external memory sorting. On most if not all of these architectures, parallel sorting is typically dominated by communication, in particular, the movement of data values associated with the keys. Parallel sorting has a wide breadth of practical applications. Some scientific applications in the field of high-performance computing (e.g., ChaNGa) perform sorting each iteration, placing high demand on good scalability and adaptivity of parallel sorting algorithms. Parallel sorting is also utilized in the commercial field for processing of numeric as well as nonnumeric data (i.e., parallel database queries). Moreover, Integer Sort is one of the NAS parallel benchmarks.
Parallel Sorting Algorithms There are a few important parallel sorting algorithms that dominate multiple use-cases and serve as building blocks for more specialized sorting algorithms. The algorithms will be described for the distributed memory paradigm, though most are also applicable to other architectures and models.
Parallel Quicksort Parallelization of Quicksort can be done in a variety of ways. A simplified, recursive version is described below. A more thorough analysis can be found in [] or []. . A processor broadcasts a pivot element to all p processors. . Each processor then splits its keys into two sections (smaller keys and larger keys) according to the pivot. . Two prefix sums calculate the total number of smaller keys and larger keys on the first k processors, for k ∈ [, p − ]. If, smallk and largek are, respectively, the total numbers of smaller and larger keys on processors through k, then, after this operation, processor k knows smallk− and largek− as well as smallk and largek . . Processor p − knows the total sums, smallp− and largep− . It can therefore divide the set of processors
S
S
Sorting
in proportion with smallp− and largep− , thus determining two sets of processors (Ps and Pl ) that should be given the smaller and larger keys. This processor should also broadcast the average number of keys these two sets of processors should receive (smallavg and largeavg ). . Each processor k can now decide where its keys need to be sent. For example, the smaller keys should go smallk k− ⌋ through processor ⌊ small ⌋. to processor ⌊ small smallavg avg . After the communication, it is sufficient for the Parallel Quicksort procedure to recurse on the two processor sets (Ps and Pl ) until all data values are on the correct processors and can be sorted locally. This algorithm generally achieves good load balance by determining the correct portions of processors that should receive smaller and larger keys. However, the necessity of moving half the data for every iteration is costly on large distributed systems. In the average case, Parallel Quicksort necessitates Θ(n log p) data movement. Moreover, both the quality of load balance achieved for small processor sets and the total number of recursive levels are dependent on pivot selection. Efficient implementations of Parallel Quicksort typically use more complex pivoting techniques and multiple pivots []. Nevertheless, Parallel Quicksort is easy to implement and can achieve good performance on some shared memory and smaller distributed systems. Additionally, the number of messages sent by each processor in step is constant – typically no more than four. Therefore, Parallel Quicksort has a relatively small message latency cost of Θ(p log p) messages.
Bitonic Sort Introduced in by Batcher [], Bitonic Sort is one of the oldest parallel sorting algorithms. This algorithm is based on the sorting of bitonic sequences. A bitonic sequence is a sequence S or any cyclic shift of S, such that S = S S where S is monotonically nondecreasing and S is monotonically nonincreasing. Further, any unsorted sequence can be treated as a series of bitonic subsequences of length two. A bitonic merge turns a bitonic subsequence into a fully sorted subsequence. Bitonic Sort works by application of a series of bitonic merges until the entire sequence is sorted. Applying bitonic merges on a series of bitonic subsequences effectively doubles the length of each sorted subsequence and cuts the number of bitonic subsequences in half.
Therefore, for an unsorted sequence of length k, where k is a power of two, Bitonic Sort requires log k merges. The main insight of Bitonic Sort is in the bitonic merge operation. A bitonic merge recursively slices a bitonic sequence into two bitonic sequences, with all elements in one sequence larger than all elements in the other. When the bitonic sequence is sliced into pieces of unit length, the entire input is sorted. Given a bitonic sequence of length s, where s is a power of two, every slice operation compares and swaps each element k, where k ∈ [, s/) with element k + s/. These swaps result in two bitonic sequences of length s/, with the elements in one being larger than the elements of the other. Thus, if s is a power of two, a bitonic merge requires log s slices, which amounts to log s comparison operations on every element. In the case of n = p on a hypercube network, a bitonic merge requires Θ(log n) swaps, one swap in each hypercube dimension. The algorithm requires log n merges on such a network, for a composed running time of Θ(log n). Adaptive Bitonic Sorting [] avoids the redundant comparisons in Bitonic Sort and achieves a runtime complexity of Θ(log n). Further, Bitonic Sort is also asymptotically optimal on mesh connected networks []. As proposed by Blelloch [], Bitonic Sort can be extended for the case of n ≥ p, by introducing virtual hypercube dimensions within each processor. Alternatively, a more efficient sequential sorting algorithm can be used for the sequential sorting work. Though historically significant and elegant in nature, Bitonic Sort is not widely used on modern supercomputers, since the algorithm requires data to be migrated through the network Θ(log p) times or, in the case of Adaptive Bitonic Sorting, Θ(log p) times. Nevertheless, Bitonic Sort has been used in a wide variety of applications, ranging from network router hardware to sorting on GPUs [–]. A more thorough analysis and justification of correctness of this algorithm can be found in the Bitonic Sort article.
Parallel Radix Sort Radix Sort is a counting-based sorting algorithm that relies on the bitwise representation of keys. An r-bit radix looks at r bits of each key at a time and permutes the keys to move to one of the r buckets, where the r-bit value of each key corresponds to its destination bucket. If each key has b bits, by looking at the r least-significant bits first, Radix Sort can sort the entire
Sorting
dataset in ⌈ br ⌉ permutations. Therefore, this algorithm has a complexity of Θ ( br n). Notably, this complexity is linear with respect to n, a feature that cannot be matched by comparison-based sorting algorithms, which must do at least Θ(n log n) comparison operations. However, Radix Sort is inherently cache-inefficient since each key may need to go to any one of the r buckets for each iteration, independent of which bucket it resided in the previous iteration []. A further limitation on Radix Sort is its reliance on the bitwise representation of keys, which is satisfied for integers and requires a simple transformation for floats, but is not necessarily possible or simple for strings and other data types. Parallel Radix Sort is implemented by assigning a subset of the buckets to each processor []. Thus, each of ⌈ br ⌉ permutations would result in a step of all-to-all communication. Load balancing is also achieved relatively easily by computing a histogram of the number of keys headed for each bucket at the beginning of each step. These histograms can be computed using local data on each processor first, then summed up using a reduction. Given a summed-up histogram, a single processor can then adaptively decide the set of buckets to assign to each processor and broadcast this information to all other processors. Each processor then sends all of its keys to the processor that owns the appropriate bucket for each key, using all-to-all communication. A good way to improve the efficiency of local histogram computation in Radix Sort is to use an auxiliary set of low-bit counters []. Given a large r, it is unlikely that an array of r -bit counters can fit into the L cache. However, by using an array of r -bit counters and incrementing the -bit counter array only when the -bit counters are about to overflow, cache performance can be significantly improved. The simplicity of Radix Sort as well as its relatively good all-around performance and scalability have made it a popular parallel sorting algorithm in a variety of contexts. However, Radix Sort still suffers from cacheefficiency problems and requires multiple steps of allto-all communication, which can be an extremely costly operation on a large enough system.
Sample Sort Sample Sorting is a splitter-based method which performs data partitioning by collecting a sample of the entire dataset []. Splitter-based parallel sorting algorithms determine the destinations for each key by
S
determining a range bounded by two splitters for each processor. These splitters are simply values meant to subdivide the entire key range into p approximately equal chunks, so that each processor can be assigned a roughly even-sized chunk. After the splitters have been determined and broadcasted to all processors, a single all-to-all communication step suffices in giving each processor the correct data. Parallel Sorting by Regular Sampling is a Sample Sorting algorithm introduced by Shi and Schaeffer []. This algorithm determines the correct splitters by collecting a sample of data from each processor. Sorting by Regular Sampling uses a regular sample of size p − and generally operates as follows: . Sort local data on each processor. . Collect sample of size p − on each processor with the kth element of each sample as element np × k+ of p the local data. . Merge the p samples to form a combined sorted sample of size p × (p − ). . Define p − splitters with the kth splitter as element p × (k + ) of the sorted sample. . Broadcast splitters to all processors. . On each processor, subdivide the local keys into p chunks (numbered through p − ) according to the splitters. Send each chunk to equivalently numbered processor. . Merge the incoming data on each processor. Collecting a regular sample of size p − from each processor has been proven to guarantee no more than n p elements on any processor [] and shown to achieve almost perfect load balance for most practical distributions. The algorithm has also been shown to be asymptotically optimal as long as n ≥ p . Regular Sample Sort is easy to implement, insensitive to key distribution, and optimal in terms of the data movement required (there is a single all-to-all step, so each key gets moved only once). The algorithm has been shown to perform very well for n ≫ p and has been the parallel sorting algorithm of choice for many modern applications. One important issue with the traditional Sorting by Regular Sampling technique is the requirement of a combined sample of size Θ(p ). For a small-enough p this is not a major cost; however, for high-performance computing applications running on thousands of processors, this cost begins to overwhelm
S
S
Sorting
the running time of the sorting algorithm and the n ≥ p assumption crumbles. Sorting by Random Sampling [] is a parallel sorting technique that can potentially alleviate some of the drawbacks of Parallel Sorting by Regular Sampling. Instead of selecting an evenly distributed sample of size p − from each processor, random samples of size s are collected from the initial local datasets to form a combined sample of size s × p. The s parameter has to be carefully chosen, but sometimes sufficient load balance can be achieved for s < p. Additionally, Sorting by Random Sampling allows for better potential overlap between computation and communication since the sample can be collected before the local sorting is done. Another interesting variation of Sample Sort was introduced by Helman et al. []. Instead of collecting a sample, this sorting procedure first permutes the elements randomly with a randomized data transpose, then simply selects the splitters on one processor. To execute the transpose, each processor randomly assigns each of its local keys to one of the p buckets, then sends the jth bucket to the jth processor. With high probability, this permutation guarantees that any processor’s elements will be representative of the entire key set. Thus, this algorithm avoids the cost of collecting and analyzing a large sample on a single processor. One processor still needs to select and broadcast splitters but this is a relatively cheap operation. The main disadvantage of this technique is the extra all-to-all communication round, which is very expensive on a large system. Additionally, as p scales to n/p, the load balance achieved by the algorithm deteriorates.
Histogram Sort Histogram Sort [] is another splitter-based method for parallel sorting. Like Sample Sort, Histogram Sort determines a set of p − splitters to divide the keys into p evenly sized sections. However, it achieves this task by taking an iterative guessing approach rather than simply collecting one big sample. Each set of splitterguesses, called the probe, is matched up to the data then adjusted, until satisfactory values for all splitters have been determined. A satisfactory value for the kth splitter needs to divide the data so that approximately k+ × n keys are smaller than the splitter value. Typip cally, a threshold range is established for each splitter so that the splitter-guesses can converge quicker. The
kth splitter must divide the data within the range of keys, ( nk − nT , nk + nT ), where T is the given threshp p p p old. A basic implementation of Histogram Sort operates as follows: . Sort local data on each processor. . Define a probe of p − splitter-guesses distributed evenly over the key data range. . Broadcast the probe to all processors. . Produce local histograms by determining how much of each processor’s local data fits between each key range defined by the splitter-guesses. . Sum up the histograms from each processor using a reduction to form a complete histogram. . Analyze the complete histogram on a single processor, determining any splitter values satisfied by a splitter-guess, and bounding any unsatisfied splitter values by the closest splitter-guesses. . If any splitters have not been satisfied, produce a new probe and go back to step . . Broadcast splitters to all processors. . On each processor, subdivide the local keys into p chunks (numbered through p − ) according to the splitters. Send each chunk to equivalently numbered processor. . Merge the incoming data on each processor. This iterative technique can refine the splitter-guesses to an arbitrarily narrow threshold range. Quick convergence can be guaranteed by defining each new probe to contain a guess in the middle of the bounded range for each unsatisfied splitter. Like any splitter-based sort, Histogram Sort is optimal in terms of the data movement required. However, unlike all previously described sorting algorithms, the running time of Histogram Sort depends on the distribution of the data through the data range. The iterative approach employed by this sorting procedure guarantees desired level of load balance which Sample Sort and Radix Sort cannot. Additionally, the probing technique is flexible and does not require that local data be sorted immediately []. This advantage allows an excellent opportunity for the exploitation of communication and computation overlap. However, Histogram Sort is generally more difficult to implement than common alternatives such as Radix Sort or Sample Sort.
Sorting
Architectures and Theoretical Models Parallelism is exhibited by a variety of computer architectures. Shared memory multiprocessors, distributed systems, supercomputers, sorting networks, and GPUs are all fundamentally different parallel computing constructions. As such, parallel sorting algorithms need to be designed and tuned for each of these architectures specifically.
Sorting Networks and Early Theoretical Models Traditional parallel sorting targeted the problem of sorting n numbers on n processors using a fixed interconnection network. Bitonic Sort [] was an early success as it provided a Θ(log n)-depth sorting network. The bitonic sorting network also yielded a practical algorithm for sorting n keys in parallel using n processors in Θ(log n) time on network topologies such as the hypercube and shuffle-exchange. In , Ajtai et al. [], introduced an Θ(log n)-depth sorting network capable of sorting n keys in Θ(log n) time using Θ(n log n) comparators. However, this construction was shown to lead to less-efficient networks than Bitonic Sort for reasonable values of n. Leighton [] introduced the Column Sort algorithm. He showed that, based on Column Sort, for any sorting network with Θ(n log n) comparators and Θ(log n) depth, one can construct a corresponding constant-degree network of n processors that can sort in Θ(log n) time. Much work has targeted the complexity of parallel sorting on the less-restrictive PRAM model where each processor can access the memory of all other processors in constant time. In , Cole [] introduced an efficient and practical parallel sorting algorithm with versions for the CREW (concurrent read only) and EREW (no concurrent access) PRAM models. This algorithm was based on a simple tree-based Mergesort, but was elegantly pipelined to achieve a Θ(log n) complexity for sorting n keys using n processors. Cole’s merge sort used a Θ(log n) time merging algorithm, which naturally leads to a Θ(log n) sorting algorithm. However, his sorting algorithm used results from lower levels of the merge tree to partially precompute the merging done in higher levels of the merge tree. Thus, the sorting algorithm was designed so that at every node of the merge tree only a constant amount of work needed to be done, yielding a Θ(log n) overall sorting complexity.
S
Sorting networks have also been extensively studied for the VLSI model of computation. The VLSI model focuses on area-time complexity, that is, the area of the chip on which the network is constructed and the running time. A good analysis of lower bounds for the complexity of such VLSI sorters can be found in []. These theoretical methods have been studied intensively in literature and have yielded many elegant parallel sorting algorithms. However, the theoretical machine models they were designed for are no longer representative or directly useful for current parallel computer architectures. Nevertheless, these studies provide valuable groundwork for modern and future parallel sorting algorithms. Moreover, sorting networks may prove to be useful for emerging architectures such as GPUs and chip multicores.
GPU-Based Sorting Early GPU-based sorting algorithms utilized a limited graphics API which, among other restrictions, did not allow scatter operations and made Bitonic Sort the dominant choice. GPUTeraSort [] is an early efficient hybrid algorithm that uses Radix Sort and Bitonic Sort. GPUTeraSort was designed for GPU-based external sorting, but is also general to in-memory GPU-based sorting. A weakness of the GPUTeraSort algorithm is its Θ(n log n) running time, typical of parallel sorting algorithms based on Bitonic Sort. GPU-ABiSort [] improved over GPUTeraSort by using Adaptive Bitonic Sorting [], lowering the theoretical complexity to Θ(n log n) and often demonstrating a lower practical running time. Newer GPUs, assisted by the CUDA software environment, allow for efficient scan primitives and a much broader set of parallel sorting algorithms. Efficient versions of Radix Sort and a parallel Mergesort are presented by Satish et al. []. Newer GPU-based sorting algorithms also exploit instruction-level parallelism by performing steps such as merging using custom vector operations. An array of various other GPU-based sorting algorithms, which are not detailed here, can be found in literature. Current results on modern GPUs suggest that Radix Sort typically performs best, particularly when the key size is small []. Radix Sort is well fit for GPU execution since keys can be processed independently and synchronization is almost purely in the form of prefix sums,
S
S
Sorting
which can be executed with high efficiency on GPUs. Radix Sort also requires few or no low-level branches, unlike comparison-based sorting algorithms. Finally, the cache inefficiency of Radix Sort has been less costly on GPUs since most GPUs have no cache. However, newer GPUs, such as the NVIDIA GPUs of compute capability ., already have small caches and are rapidly evolving. It is hard to predict whether Radix Sort or a different sorting algorithm will prove most efficient on emerging GPU and accelerator architectures.
Shared Memory Sorting Parallel merging techniques have commonly been used to produce simple shared memory parallel sorting algorithms. Francis and Mathieson [] present a kprocessor merge that allows for all processors to participate in a parallel merge tree, with two processors participating in each merge during the first merging stage and all processors participating in the final merge at the head of the tree. Good performance is achieved by subdividing each of the two arrays of size a and b being merged into k sections, where k is the number of processors participating in the merge. The subdivisions are selected so that the ith processor merges the ith section of each of the two arrays and produces elements ki (a+b) (a + b) of the merged array. through i+ k Merge-based algorithms, such as the one detailed above, as well as parallel versions of Quicksort and Mergesort, are predominant on contemporary shared memory multiprocessor architectures. However, with the advent of increased parallelism in chip architectures, techniques such as sampling and histogramming may become more viable. Distributed Memory Sorting Parallel Sorting algorithms for distributed memory architectures are typically used in the high-performance computing field, and often require good scaling on modern supercomputers. In the s Radix Sort and Bitonic Sort were widely used. However, as architectures evolved, these sorting techniques proved insufficient. Modern machines have thousands of cores, so, to achieve good scaling interprocessor communication needs to be minimized. Therefore, splitter-based algorithms such as Sample Sort and Histogram Sort are now more commonly used for distributed memory sorting.
Splitter-based parallel sorting algorithms have minimal communication since they only move data once.
Future Directions Despite the extraordinarily large amount of literature on parallel sorting, the demand for optimized parallel sorting algorithms continues to motivate novel algorithms and more research on the topic. Moreover, the continuously changing and, more recently, diverging nature of parallel architectures has made parallel sorting an evolving problem. High-performance computer architectures are rapidly growing in size, creating a premium on parallel sorting algorithms that have minimal communication and good overlap between computation and communication. Since parallel sorting necessitates communication between all processors, the creation and optimization of topology-aware all-to-all personalized communication strategies is extremely valuable for many splitter-based parallel sorting algorithms. As previously mentioned, algorithms similar to Sample Sort and Histogram Sort are the most viable candidates for this field due to the minimal nature of the communication they need to perform. Shared memory sorting algorithms are beginning to face a challenge that most modern sequential algorithms have to endure. The demand for good cache efficiency is now a key constraint for all algorithms due to the increasing relative cost of memory accesses. Parallel sorting is being studied under the cache-oblivious model [] in an attempt to reevaluate previously established algorithms and introduce better ones. Currently, parallel sorting using accelerators, such as GPUs, is probably the most active of all parallel sorting research areas due to the rapid changes and advances happening in the accelerator architecture field. Little can be said about which algorithms will dominate this field in the future, due to the influx of novel GPU-based sorting algorithms and newer accelerators in recent years. The wide use of sorting in computer science along with the popularization of parallel architectures and parallel programming necessitates the implementation of parallel sorting libraries. Such libraries can be difficult to standardize, however, especially
Sorting
for the efficiency-sensitive field of high-performance computing. Nevertheless, these libraries are quickly emerging, especially under the shared memory computing paradigm.
Related Entries Algorithm Engineering All-to-All Bitonic Sort Bitonic Sorting, Adaptive Collective Communication Data Mining NAS Parallel Benchmarks PRAM (Parallel Random Access Machines)
Bibliographic Notes and Further Reading The literature on parallel sorting is very large and this entry is forced to cite only a select few. The sources cited are a mixture of the largest impact publications and the most modern publications. A few of the sources also give useful information on multiple parallel sorting algorithms. Blelloch et al. [] provide an in-depth experimental and theoretical comparative analysis of a few of the most important distributed memory parallel sorting algorithms. Satish et al. [] provide good analysis of a few of the most modern GPU-sorting algorithms. Vitter [] presents a good analysis of external memory parallel sorting.
Bibliography . Kumar V, Grama A, Gupta A, Karypis G () Introduction to parallel computing: design and analysis of algorithms. BenjaminCummings, Redwood City, CA . Sanders P, Hansch T () Efficient massively parallel quicksort. In: IRREGULAR ’: Proceedings of the th International Symposium on Solving Irregularly Structured Problems in Parallel, pp –. Springer-Verlag, London, UK . Batcher K () Sorting networks and their application. Proc. SICC, AFIPS :– . Bilardi G, Nicolau A () Adaptive bitonic sorting: an optimal parallel algorithm for shared memory machines. Technical report, Ithaca, NY . Thompson CD, Kung HT () Sorting on a mesh-connected parallel computer. Commun ACM ():–
S
. Blelloch G et al. () A comparison of sorting algorithms for the Connection Machine CM-. In: Proceedings of the Symposium on Parallel Algorithms and Architectures, July . Govindaraju N, Gray J, Kumar R, Manocha D () Gputerasort: high performance graphics co-processor sorting for large database management. In: SIGMOD ’: Proceedings of the ACM SIGMOD international conference on Management of data, pp –. ACM, New York . Greb A, Zachmann G () Gpu-abisort: optimal parallel sorting on stream architectures. In: Parallel and Distributed Processing Symposium, . IPDPS , th International, pp –, April . Satish N, Harris M, Garland M () Designing efficient sorting algorithms for manycore gpus. Parallel and Distributed Processing Symposium, International, Rome, pp – . LaMarca A, Ladner RE () The influence of caches on the performance of sorting. In: SODA ’: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, pp –. Society for Industrial and Applied Mathematics, Philadelphia, PA . Thearling K, Smith S () An improved supercomputer sorting benchmark. In: Proceedings of the Supercomputing, November . . Huang JS, Chow YC () Parallel sorting and data partitioning by sampling. In: Proceedings of the Seventh International Computer Software and Applications Conference, November . Shi H, Schaeffer J () Parallel sorting by regular sampling. J Parallel Distrib Comput :– . Li X, Lu P, Schaeffer J, Shillington J, Wong PS, Shi H () On the versatility of parallel sorting by regular sampling. Parallel Comput ():– . Helman DR, Bader DA, JáJá J () A randomized parallel sorting algorithm with an experimental study. J Parallel Distrib Comput ():– . Kale LV, Krishnan S () A comparison based parallel sorting algorithm. In: Proceedings of the nd International Conference on Parallel Processing, pp –, St. Charles, IL, August . Solomonik E, Kale LV () Highly scalable parallel sorting. In: Proceedings of the th IEEE International Parallel and Distributed Processing Symposium (IPDPS). Urbana, IL, April . Ajtai M, Komlós J, Szemerédi E () Sorting in c log n parallel steps. Combinatorica ():– . Leighton T () Tight bounds on the complexity of parallel sorting. In: STOC ’: Proceedings of the sixteenth annual ACM symposium on Theory of computing, pp –. ACM, New York . Cole R () Parallel merge sort. In: SFCS ’: Proceedings of the th Annual Symposium on Foundations of Computer Science, pp –. IEEE Computer Society, Washington, DC . Bilardi G, Preparata FP () Area-time lower-bound techniques with applications to sorting. Algorithmica ():– . Francis RS, Mathieson ID () A benchmark parallel sort for shared memory multiprocessors. Comput IEEE Trans ():–
S
S
Space-Filling Curves
. Blelloch GE, Gibbons PB, Simhadri HV () Brief announcement: low depth cache-oblivious sorting. In: SPAA ’: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pp –. ACM, New York . Vitter JS () External memory algorithms and data structures: dealing with massive data. ACM Comput Surv ():–
Space-Filling Curves Michael Bader , Hans-Joachim Bungartz , Miriam Mehl Universität Stuttgart, Stuttgart, Germany Technische Universität München, Garching, Germany
Synonyms FASS (space-filling, self-avoiding, simple, and selfsimilar)-curves
Definition A space-filling curve is a continuous and surjective mapping from a D parameter interval, say [, ], onto a higher-dimensional domain, say the unit square in D or the unit cube in D. Although this, at first glance, seems to be of a purely mathematical interest, spacefilling curves and their recursive construction process have obtained a broad impact on scientific computing in general and on the parallelization of numerical algorithms for spatially discretized problems in particular.
Discussion
as placing and connecting shrinked, rotated, or reflected versions of the generator in the next-level subdomains. This has to happen in an appropriate way, ensuring the two properties neighborhood (neighboring subintervals are mapped to neighboring subdomains) and inclusion (subintervals of an interval are mapped to subdomains of the interval’s image). If all is done properly, it can be proven that the limit of this recursive process in fact defines a curve that completely fills the target domain and, hence, results in an SFC. While the SFC itself as the asymptotical result of a continuous limit process is more a playground of mathematics, the iterative or recursive construction or, to be precise, the underlying mapping can be used for sequentializing higher-dimensional domains and data. Roughly speaking, these higher-dimensional data (elements of a finite element mesh, particles in a molecular dynamics simulation, stars in an astrophysics simulation, pixels in an image, voxels in a geometric model, or even entries in a data base, e.g.) now appear as pearls on a thread. Thus, via the locality properties of the SFC mapping, clusters of data can be easily identified (interacting finite elements, neighboring stars, similar data base entries, etc.). This helps for the efficient processing of tasks such as answering data base queries, defining hardware-aware traversal strategies through adaptively refined meshes or heterogeneous particle sets, and, of course, subdividing the data across cores or processors in the sense of (static or dynamic) load distribution in parallel computing. The main idea for the latter is that the linear (D) arrangement of the data via the mapping basically reduces the load distribution problem to the sorting of indices.
Introduction Space-filling curves (SFC) were presented at the end of the nineteenth century – first by Peano () and Hilbert (), and later by Moore, Lebesgue, Sierpinski, and others. The idea that some curves (i.e., something actually one-dimensional) may completely cover an area or a volume sounds somewhat strange and formerly caused mathematicians to call them “topological monsters.” The construction of all SFC follows basically the same principle: start with a generator indicating an order of traversal through the similar first-level substructures of the initial domain (the unit square, unit cube, etc.), and produce the next iterates by successively subdividing the domain in the same way as well
Construction Space-filling curves are typically constructed via an iterative or recursive process. The source interval and the target domain are recursively substructured into smaller intervals and domains. From each level of recursion to the next, a mapping between subintervals and subdomains is constructed, where the child intervals of a subinterval are typically mapped to the children of the image of the parent interval. The SFC is then defined as the image of the limit of these mappings. Figure illustrates this recursive construction process for the D Hilbert curve. From each level to the next, the subintervals are split into four congruent
Space-Filling Curves
1
6
9
10
4
7
8
11
3
2
13
12
0
1
14
15
2
0
0
5
S
3
1
2
3
0
4
8
12
Space-Filling Curves. Fig. The first three iterations of the Hilbert curve
subintervals. Likewise, the square subdomains are split into four subsquares. In the nth iteration, an interval [i ⋅ −n , (i + ) ⋅ −n ] is mapped to the ith subsquare, as indicated in the figure. The curves in Fig. connect the subsquares of the nth level according to their source intervals and are called iterations of the Hilbert curve. For the limit n → ∞, the iterations shall, in an intuitive sense, converge to the Hilbert curve. More formal: for any given parameter t ∈ [, ], there exists a sequence of nested intervals [in ⋅ −n , (in + ) ⋅ −n ] that all contain t. The corresponding subsquares converge to a point h(t) ∈ [, ] . The image of the mapping h defined by that construction is called the Hilbert curve. h shall be called the Hilbert mapping.
For finite quarternary fractions, successive application of Eq. leads to the following formula to compute h: h( .q q . . . qn ) = Hq ○ Hq ○ . . . ○ Hqn ○ h().
For the Hilbert curve, the operators Hq are defined as: ⎛ x ⎞ ⎛ y ⎞ ⎟ ⎜ ⎟ H := ⎜ ⎜ ⎟→⎜ ⎟ ⎝ y ⎠ ⎝ x ⎠ ⎛ x ⎞ ⎛ x ⎟ ⎜ H := ⎜ ⎜ ⎟→⎜ ⎝ y ⎠ ⎝ y +
H
Computation of Mappings Figure shows that the nth Hilbert iteration consists of the connection of four (n − )th Hilbert iterations, which are scaled, rotated, and translated appropriately. For n → ∞, this turns into a fix-point argument: the Hilbert curve consists of the connection of four suitably scaled, rotated, and translated Hilbert curves. The respective transformations shall be given by operations Hq , where q ∈ {, , , } determines the relative position of the transformed Hilbert curve. This leads to the following recursive equation: h( .q q q q . . .) = Hq ○ h( .q q q . . .).
()
If the parameter t is given as a quarternary fraction, i.e., t = .q q q ⋅ ⋅ ⋅ = ∑n qn n , then the interval numbers in and the relative position of subintervals within their parent can be obtained from the quarternary digits qn .
()
H
⎛ x ⎞ ⎛ x+ ⎟ ⎜ := ⎜ ⎜ ⎟→⎜ ⎝ y ⎠ ⎝ y +
⎞ ⎟ ⎟ ⎠ ⎞ ⎟ ⎟ ⎠
⎛ x ⎞ ⎛ −y+ ⎟ ⎜ := ⎜ ⎜ ⎟→⎜ ⎝ y ⎠ ⎝ − x +
⎞ ⎟ ⎟ ⎠
Equations and may be easily turned into an algorithm to compute the image point h(t) from any given parameter t. Inverting this process leads to algorithms that find a parameter t that is mapped to a given point p = h(t). However, note that the Hilbert mapping h is not bijective; hence, an inverse mapping h− does not exist. Still, it is possible to construct mappings h¯ − that return a uniquely defined parameter t = h¯ − (p) with p = h(t). Note that for practical applications, only the discrete orders induced by h are of interest. These orders are usually bijective – for example, the relation between
S
S
Space-Filling Curves
subsquares and subintervals during the construction of the Hilbert mapping is a bijective one. Bijectivity is only lost with n → ∞.
Examples of Space-Filling Curves Figure illustrates the construction of different SFC. All of them are constructed in a similar way as the Hilbert curve and can be computed via an analogous approach. The Hilbert-Moore curve combines four regular, scaleddown Hilbert curves to a closed curve. The βΩ-curve is a Hilbert-like curve that uses nonuniform refinement patterns throughout the iterations. Similar to the Hilbert-Moore curve, it is a closed curve. Morton order and the Z-curve result from a bit-interleaving code for D grids. They lead to discontinuous mappings from the unit interval to the unit square. However, the Lebesgue curve uses the same construction, but maps the Cantor set to the unit square in order to obtain continuity.
Morton order:
00
Locality Properties of Space-Filling Curves The recursive construction of SFC leads to locality properties that can be exploited for efficient load distribution and load balancing. The Hilbert curve, for example, b Ω-curve:
Hilbert-Moore curve:
01
The Sierpinski curve is a curve that is generated via recursive substructuring of triangles. The combination of two triangle-filling Sierpinski curves leads to a curve that fills the unit square. The H-index follows a construction compatible to that for the Sierpinski curve, but generates a discrete order on Cartesian grids. Infinite refinement of the H-index, then, leads to the Sierpinski curve. The Peano curve, finally, as well as its variant, the Peano-Meander curve, are square-filling curves that are based on a recursive ×-refinement of the unit square. Figure provides snapshots of the construction processes of D Hilbert and Peano curves.
Z-curve: 11
10
0101 0111 1101 1111
00
01
0000 0001 0100 0101
0100 0110 1100 1110
0010 0011 0110 0111
0001 0011 1001 1011
1000 1001 1100 1101
0000 0010 1000 1010
10
11
Sierpinski curve:
H-index:
Peano curve:
Peano-Meander curve:
1010 1011 1110 1111
Space-Filling Curves. Fig. Several examples for the construction of space-filling curves on the unit square
Space-Filling Curves
S
1
0.8
0.6
0.4
0.2 1 0.8
0 0
0.6 0.2
0.4
0.4 0.6
0.2 0.8
1
0
Space-Filling Curves. Fig. Second iterations of three-dimensional Hilbert and Peano curves
and curves that follow a similar construction, can be shown to be Hölder continuous, i.e., for two parameters t and t , the distance of the images h(t ) and h(t ) is bounded by ∥h(t ) − h(t )∥ ≤ C ∣t − t ∣
/d
,
()
where d denotes the dimension. As ∣t − t ∣ is also equal to the area covered by the space-filling curve segment defined by the parameter interval [t , t ] (i.e., h is parameterized by area or volume), Eq. gives a relation between the area covered by a curve segment and the distance between the end points of the curve. It is thus a measure for the compactness of partitions defined by curve segments. For d-dimensional objects such as spheres or cubes, the volume typically grows with the dth power of the extent (diameter, e.g.,) of the object. Hence, an exponent of d− is asympotically optimal, and the constant C characterizes the compactness of an SFC.
High Performance Computing and Load Balancing with Space-Filling Curves The ability of recursively substructuring and linearizing high-dimensional domains is mainly responsible for the revival of SFC – or their arrival in computational applications – starting in the s of the last century. As some milestones, the Z-order (Morton-order) curve has been proposed for file sequencing in geodetic data
bases in by G.M. Morton []; quadtrees (nothing else than D Lebesgue SFC) were introduced in image processing [, ]; octrees, their D counterparts, marked the crucial idea for breakthroughs in the complexity of particle simulations, the Barnes Hut [] as well as the Fast Multipole algorithms []. Octrees were also successfully used in CAD [, ], as well as for organizational tasks such as grid generation [, ] or steering volume-oriented simulations [], and the portfolio of SFC of practical relevance is still widening. The locality properties of SFC yield a high locality of data access and, therewith, a high efficiency of cacheusage for various kinds of applications. This will be a crucial aspect for future computing architectures, in particular multi-core architectures, where the memory bottleneck will become even more severe than already for today’s high-performance computers. For example, the Peano curve is used for highly efficient blockstructured matrix–matrix products [], and it can serve as a general paradigm for PDE frameworks, where all the steps of geometry representation, adaptive grid generation, grid traversal, data structure design, and parallelization follow an SFC-based strategy [, ]. While all the SFC discussed so far refer to structured Cartesian grids or subdomain structures, the Sierpinski curve is defined on a triangular master domain and has been recently used to benefit from SFC characteristics also in the context of managing, traversing, and distributing triangular finite element meshes [].
S
S
Space-Filling Curves
Space-Filling Curves. Fig. Partitioning of a triangulation according to the Sierpinski curve and a Cartesian grid according to the Peano curve
Hilbert or Hilbert-Peano curves were probably the first SFC to see a broad application for load distribution and load balancing [, ]. Meanwhile, SFC-based strategies have become a well-established tool frequently outperforming alternatives such as graph partitioning. The classical SFC-based load-balancing approach is to queue up grid elements like pearls on a thread and to cut this queue into pieces with equal workload [, , , , , , ]. Fig. shows examples for the partitioning of a triangulation according to the Sierpinski curve and a Cartesian grid according to the Peano curve. This reduces the load balancing problem to a problem of sorting data according to their positions on a space-filling curve. The locality properties of SFC yield connected partitions with quasi-minimal surfaces that is quasi-minimal communication costs. However, the constants involved in the quasi-optimality statement can be rather large []. More recent approaches combine the space-filling curve ordering with a tree-based domain decomposition [, , ]. This approach has the advantage that already the domain decomposition itself as well as dynamical rebalancing can easily be done fully parallel and that it fits in a natural way with highly efficient multilevel numerical methods such as multigrid solvers. Summarizing, it is obvious that although load balancing is the most attractive application of SFC in the context of parallel computing, the scope of SFC has become much wider with their parallelization characteristics often just being one side effect.
Related Entries Domain Decomposition Hierarchical data format
Bibliographic Notes and Further Reading . Sagan H () Space-filling curves, Springer, New York. . Bader M Space-filling curves – an introduction with applications in scientific computing, Texts in Computational Science and Engineering, Springer, submitted.
Bibliography . Bader M, Schraufstetter S, Vigh CA, Behrens J () Memory efficient adaptive mesh generation and implementation of multigrid algorithms using Sierpinski curves. Int J Comput Sci Eng ():– . Bader M, Zenger Ch () Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl (–):– . Barnes J, Hut P () A hierarchical O(n log n) force-calculation algorithm. Nature :– . Brázdová V, Bowler DR () Automatic data distribution and load balancing with space-filling curves: implementation in conquest. J Phy Condens Matt . Brenk M, Bungartz H-J, Mehl M, Muntean IL, Neckel T, Weinzierl T () Numerical simulation of particle transport in a drift ratchet. SIAM J Sci Comput ():– . Griebel M, Zumbusch GW () Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves. Parallel Comput :–
SPAI (SParse Approximate Inverse)
. Günther F, Krahnke A, Langlotz M, Mehl M, Pögl M, Zenger Ch () On the parallelization of a cache-optimal iterative solver for PDES based on hierarchical data structures and space-filling curves. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface: th European PVM/MPI Users Group Meeting Budapest, Hungary, September –, . Proceedings, vol of Lecture Notes in Computer Science. Springer, Heidelberg . Günther F, Mehl M, Pögl M, Zenger C () A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves. SIAM J Sci Comput ():– . Hunter GM, Steiglitz K () Operations on images using quad trees. IEEE Trans Pattern Analy Machine Intell PAMI-(): – . Jackings C, Tanimoto SL () Octrees and their use in representing three-dimensional objects. Comp Graph Image Process ():– . Rokhlin V, Greengard L () A fast algorithms for particle simulations. J Comput Phys :– . Meagher D () Octree encoding: A new technique for the representation, manipulation and display of arbitrary d objects by computer. Technical Report, IPL-TR-- . Mehl M, Weinzierl T, Zenger C () A cache-oblivious selfadaptive full multigrid method. Numer Linear Algebr (– ):– . Mitchell WF () A refinement-tree based partitioning method for dynamic load balancing with adaptively refined grids. J Parallel Distrib Comput ():– . Morton GM () A computer oriented geodetic data base and a new technique in file sequencing. Technical Report, IBM Ltd., Ottawa, Ontario . Mundani R-P, Bungartz H-J, Niggl A, Rank E () Embedding, organisation, and control of simulation processes in an octreebased cscw framework. In: Proceedings of the th International Conference on Computing in Civil and Building Engineering, Montreal, pp – . Patra A, Oden JT () Problem decomposition for adaptive hp finite element methods. Comput Syst Eng ():– . Roberts S, Klyanasundaram S, Cardew-Hall M, Clarke W () A key based parallel adaptive refinement technique for finite element methods. In: Proceedings of the Computational Techniques and Applications: CTAC ’, Singapore, pp – . Samet H () Region representation: quadtrees from binary arrays. Comput Graph Image Process ():– . Saxena M, Finnigan PM, Graichen CM, Hathaway AF, Parthasarathy VN () Octree-based automatic mesh generation for non-manifold domains. Eng Comput (): – . Schroeder WJ, Shephard MS () A combined octree/delaunay method for fully automatic -d mesh generation. Int J Numer Methods Eng ():– . Sundar H, Sampath RS, Biros G () Bottom-up construction and : balance refinement of linear octrees in parallel. SIAM J Sci Comput ():–
S
. Weinzierl T () A framework for parallel PDE solvers on multiscale adaptive Cartesian grids. Verlag Dr. Hut . Zumbusch GW () On the quality of space-filling curve induced partitions. Z Angew Math Mech :–
SPAI (SParse Approximate Inverse) Thomas Huckle, Matous Sedlacek Technische Universität München, Garching, Germany
Synonyms Sparse approximate inverse matrix
Definition
For a given sparse matrix A a sparse matrix M ≈ A− is computed by minimizing ∥AM − I∥F in the Frobenius norm over all matrices with a certain sparsity pattern. In the SPAI algorithm the pattern of M is updated dynamically to improve the approximation until a certain stopping criterion is reached.
Discussion Introduction For applying an iterative solution method like the conjugate gradient method (CG), GMRES, BiCGStab, QMR, or similar algorithms, to a system of linear equations Ax = b with sparse matrix A, it is often crucial to include an efficient preconditioner. Here, the original problem Ax = b is replaced by the preconditioned system MAx = Mb or Ax = A(My) = b. In a parallel environment a preconditioner should satisfy the following conditions: ● ●
M can be computed efficiently in parallel. Mc can be computed efficiently in parallel for any given vector c. ● The iterative solver applied on AMx = b or MAx = Mb converges much faster than for Ax = b (e.g., it holds cond(MA) ≪ cond(A)). The first two conditions can be easily satisfied by using a sparse matrix M as approximation to A− . Note, that the inverse of a sparse A is nearly dense, but in many
S
S
SPAI (SParse Approximate Inverse)
cases the entries of A− are rapidly decaying, so most of the entries are very small []. Benson and Frederickson [] were the first to propose a sparse approximate inverse preconditioner in a static way by computing min ∥AM − I∥F M
()
for a prescribed a priori chosen sparsity pattern for M. The computation of M can be split into n independent subproblems minMk ∥AMk − ek ∥ , k = , ..., n with Mk the columns of M and ek the k-th column of the identity matrix I. In view of the sparsity of these Least Squares (LS) problems, each subproblem is related to a ˆ k := A(Ik , Jk ) with index set Jk which is small matrix A given by the allowed pattern for Mk and the so-called shadow Ik of Jk , that is, the indices of nonzero rows in A(:, Jk ). These n small LS problems can be solved independently, for example, based on QR decompositions of ˆ k by using the Householder method or the matrices A the modified Gram-Schmidt algorithm.
The SPAI Algorithm The SPAI algorithm is an additional feature in this Frobenius norm minimization that introduces different strategies for choosing new profitable indices in Mk that lead to an improved approximation. Assume that, by solving () for a given index set J, an optimal solution Mk (Jk ) has been already determined resulting in the sparse vector Mk with residual rk . Dynamically there will be defined new entries in Mk . Therefore, () has to be solved for this enlarged index set ˜Jk such that a reduction in the norm of the new residual ˜rk = A(˜Ik , ˜Jk )Mk (˜Jk ) − ek (˜Ik ) is achieved. Following Cosgrove, Griewank, Díaz [], and Grote, Huckle [], one possible new index j ∈ Jnew out of a given set of possible new indices Jnew is tested to improve Mk . Therefore, the reduced D problem min ∥A(Mk + λ j ej ) − ek ∥ = min ∥λj Aj + rk ∥ λj
λj
()
has to be considered. The solution of this problem is given by λj = −
rkT Aej ∥Aej ∥
which leads to an improved squared residual norm ρ j = ∥rk ∥ −
(rkT Aej ) . ∥Aej ∥
Obviously, for improving Mk one has to consider only indices j in rows of A that are related to the nonzero entries in the old residual rk ; otherwise they do not lead to a reduction in the residual norm. Thus, the column indices j have to be determined that satisfy rkT Aej ≠ with the old residual rk . Let the index set of nonzero entries in rk be denoted by L. Furthermore, let ˜Ji denote the set of new indices that are related to the nonzero elements in the i-th row of A, and let Jnew = ∪i∈L ˜Ji denote the set of all possible new indices that can lead to a reduction of the residual norm. Then, one or more indices Jc are chosen as a subset of Jnew that corresponds to a large reduction in rk . For this enlarged index set Jk ∪ Jc the QR decomposition of the related LS submatrix has to be updated and solved for the new column Mk . Inside SPAI there are different parameters that steer the computation of the preconditioner M: ● ● ● ● ● ●
How many entries are added in one step How many steps of adding new entries are allowed Start pattern Maximum allowed pattern What residual ∥rk ∥ should be reached How to solve the LS problems
Modifications of SPAI A different and more expensive way to determine a new profitable index j with ˜Jk := Jk ∪ {j} considers the more accurate problem min ∥A(:, ˜Jk )Mk (˜Jk ) − ek ∥
Mk (˜J k )
introduced by Gould and Scott []. For ˜Jk the optimal reduction of the residual is determined for the full minimization problem instead of the D minimization in SPAI. Chow [] showed ways to prescribe an efficient static pattern a priori and developed the software package PARASAILS. Holland, Shaw, and Wathen [] have generalized this ansatz allowing a sparse target matrix on the right side in the form minM ∥AM − B∥F . This approach is
SPAI (SParse Approximate Inverse)
useful in connection with some kind of two-level preconditioning: First compute a standard sparse preconditioner B for A and then improve this preconditioner by an additional Frobenius norm minimization with target B. From the algorithmic point of view the minimization with target matrix B instead of I introduces no additional difficulties. Only the pattern of M should be chosen more carefully with respect to A and B. Zhang [] introduced an iterative form of SPAI where in each step a thin M is derived starting with minM ∣∣AM − I∣∣F . In the second step the sparse matrix AM is used and minM ∣∣(AM )M − I∣∣F is solved, and so on. The advantage is, that because of the very sparse patterns in Mi the Least Squares problems are very cheap. Chan and Tang [] applied SPAI not to the original matrix but first used a Wavelet transform W and computed the sparse approximate inverse preconditioner for WAW T that is assumed to be more diagonal dominant. Yeremin, Kolotilina, Nikishin, and Kaporin [, ] introduced factorized sparse approximate inverses of the form A− ≈ LU. Huckle generalized the factorized preconditioners adding new entries dynamically like in SPAI []. Grote and Barnard [] developed a software package for SPAI and also introduced a block version of SPAI. Huckle and Kallischko [] generalized SPAI and the target approach. They combined SPAI with the probing method [] in the form min (∣∣AM − I∣∣F + ρ ∣∣eT AM − eT ∣∣ ) M
for probing vectors e on which the preconditioner should be especially improved. Furthermore, they developed a software package for MSPAI.
Properties and Applications Advantages of SPAI: ● ●
Good parallel scalability. SPAI allows modifications like factorized approximation or including probing conditions to improve the preconditioner relative to certain subspaces, for example, as smoother in Multigrid or for regularization []. ● It is especially efficient for preconditioning dense problems (see Benzi [] et al.).
S
Disadvantages of SPAI: ●
SPAI is sequentially more expensive, especially for denser patterns of M. ● Sometimes it shows poor approximation of A− and slow convergence as preconditioner.
Related Entries Preconditioners for Sparse Iterative Methods
Bibliographic Notes and Further Reading Books . Axelsson O () Iterative solution methods. Cambridge University Press, Cambridge . Saad Y () Iterative methods for sparse linear systems. SIAM Philadelpha, PA . Bruaset AM () A survey of preconditioned iterative methods. Longman Scientific & Technical, Harlow, Essex . Chen K () Matrix preconditioning techniques and applications. Cambridge University Press, Cambridge
Software . Chow E. Parasails, https://computation.llnl.gov/casc/ parasails/parasails.html . Barnard S, Bröker O, Grote M, Hagemann M. SPAI and Block SPAI, http://www.computational.unibas. ch/software/spai . Huckle T, Kallischko A, Sedlacek M. MSPAI, http://www.in.tum.de/wiki/index.php/MSPAI
Bibliography . Alleon G, Benzi M, Giraud L () Sparse approximate inverse preconditioning for dense linear systems arising in computational electromagnetics. Numer Algorith ():– . Barnard S, Grote M () A block version of the SPAI preconditioner. Proceedings of the th SIAM conference on Parallel Processing for Scientific Computing, San Antonio, TX . Barnard ST, Clay RL () A portable MPI implemention of the SPAI preconditioner in ISIS++. In: Heath M, et al (eds) Proceedings of the eighth SIAM conference on parallel processing for scientific computing, Philadelphia, PA
S
S
Spanning Tree, Minimum Weight
. Benson MW, Frederickson PO () Iterative solution of large sparse linear systems arising in certain multidimensional approximation problems. Utilitas Math :– . Bröker O, Grote M, Mayer C, Reusken A () Robust parallel smoothing for multigrid via sparse approximate inverses. SIAM J Scient Comput ():– . Bröker O, Grote M () Sparse approximate inverse smoothers for geometric and algebraic multigrid. Appl Num Math (): – . Chan TFC, Mathew TP () The interface probing technique in domain decomposition. SIAM J Matrix Anal Appl (): – . Chan TF, Tang WP, Wan WL () Wavelet sparse approximate inverse preconditioners. BIT ():– . Chow E () A priori sparsity patterns for parallel sparse approximate inverse preconditioners. SIAM J Sci Comput ():– . Cosgrove JDF, D´iaz JC, Griewank A () Approximate inverse preconditionings for sparse linear systems. Int J Comput Math :– . Demko S, Moss WF, Smith PW () Decay rates of inverses of band matrices. Math Comp :– . Gould NIM, Scott JA () On approximate-inverse preconditioners. Technical Report RAL-TR--, Rutherford Appleton Laboratory, Oxfordshire, England . Grote MJ, Huckle T () Parallel preconditioning with sparse approximate inverses. SIAM J Sci Comput ():– . Huckle T () Factorized sparse approximate inverses for preconditioning. J Supercomput :– . Huckle T, Kallischko A () Frobenius norm minimization and probing for preconditioning. Int J Comp Math ():– . Huckle T, Sedlacek M () Smoothing and regularization with modified sparse approximate inverses. Journal of Electrical and Computer Engineering – Special Issue on Iterative Signal Processing in Communications, Appearing () . Holland RM, Shaw GJ, Wathen AJ () Sparse approximate inverses and target matrices. SIAM J Sci Comput ():– . Kaporin IE () New convergence results and preconditioning strategies for the conjugate gradient method. Numer Linear Algebra Appl :– . Kolotilina LY, Yeremin AY () Factorized sparse approximate inverse preconditionings I: Theory. SIAM J Matrix Anal Appl ():– . Kolotilina LY, Yeremin AY () Factorized sparse approximate inverse preconditionings II: Solution of D FE systems on massively parallel computers. Inter J High Speed Comput (): – . Tang W-P () Toward an effective sparse approximate inverse preconditioner. SIAM J Matrix Anal Appl ():– . Tang WP, Wan WL () Sparse approximate inverse smoother for multigrid. SIAM J Matrix Anal Appl ():– . Zhang J () A sparse approximate inverse technique for parallel preconditioning of general sparse matrices. Appl Math Comput ():–
Spanning Tree, Minimum Weight David A. Bader , Guojing Cong Georgia Institute of Technology, Atlanta, GA, USA IBM, Yorktown Heights, NY, USA
Definition Given an undirected connected graph G with n vertices and m edges, the minimum-weight spanning tree (MST) problem consists in finding a spanning tree with the minimum sum of edge weights. A single graph can have multiple MSTs. If the graph is not connected, then it has a minimum spanning forest (MSF) that is a union of minimum spanning trees for its connected components. MST is one of the most studied combinatorial problems with practical applications in VLSI layout, wireless communication, and distributed networks, recent problems in biology and medicine such as cancer detection, medical imaging, and proteomics. With regard to any MST of graph G, two properties hold: Cycle property: the heaviest edge (edge with the maximum weight) in any cycle of G does not appear in the MST. Cut property: if the weight of an edge e of any cut C of G is smaller than the weights of other edges of C, then this edge belongs to all MSTs of the graph. When all edges of G are of unique weights, the MST is unique. When the edge weights are not unique, they can be made unique by numbering the edges and break ties using the edge number.
Sequential Algorithms Three classical sequential algorithms, Prim, Kruskal, and Bor˚uvka, are known for MST. Each algorithm grows a forest in stages, adding at each stage one or more tree edges, whose membership in the MST is guaranteed by the cut property. They differ in how the tree edges are chosen and the order that they are added. Prim starts with one vertex and takes a greedy approach in growing the tree. In each step, it always maintains a connected tree by choosing the edge of the smallest weight that connects the current tree to a vertex that is outside the tree.
Spanning Tree, Minimum Weight
Kruskal starts with isolated vertices. As the algorithm progresses, multiple trees may appear, and eventually they merge into one. Bor˚uvka selects for each vertex the incident edge of the smallest weight as a tree edge. It then compacts the graph by contracting each connected component into a super-vertex. Note that finding the tree edges can be done in parallel for each vertices. Bor˚uvka’s algorithm lends itself naturally to parallelization. These algorithms can easily be made to run in O(m log n) time. Better complexities can be achieved with Prim’s algorithm if Fibonacci heap is used instead of binary heap. Graham and Hell [] gave a good introduction for the history of MST algorithms up to . More complex algorithms with better asymptotic run times have since been proposed. For example, Gabow et al. designed an algorithm that runs in almost linear time, i.e., O(m log β(m, n)), where β(m, n) = min{i∣ log(i) n ≤ m/n}. Karger, Klein, and Tarjan presented a randomized linear-time algorithm to find minimum spanning trees. This algorithm uses the random sampling technique together with a linear time verification algorithm (e.g., King’s verification algorithm). Pettie and Ramachandran presented an optimal MST algorithm that runs in O(T ∗ (m, n)), where T ∗ is the minimum number of edge-weight comparisons needed to determine the solution. Moret and Shapiro [] presented a comprehensive experimental study on the performance MST algorithms. Prim’s algorithm with binary heap is found in general to be a fast solution. Katriel et al. [] have developed a pipelined algorithm that uses the cycle property and provide an experimental evaluation on the special-purpose NEC SX- vector computer. Cache-oblivious MST is presented by Arge et al. []. Their algorithm is based on a cache-oblivious priority queue that supports insertion, deletion, and deletemin operations in O( B logM/B NB ) amortized memory transfers, where M and B are the memory and block transfer sizes. The cache-oblivious implementation of MST runs in O(sort(m) log log(n/c) + c) memory transfers (c = m/B). The MST algorithm combines two phases: Bor˚uvka and Prim. The memory access behavior of some MST algorithms are characterized in [].
S
Parallel Algorithms Fast theoretical parallel MST algorithms also exist in the literature. Pettie and Ramachandran [] designed a randomized, time-work optimal MST algorithm for the EREW PRAM, and using EREW to QSM and QSM to BSP emulations from [], mapped the performance onto QSM and BSP models. Cole et. al. [, ] and Poon and Ramachandran [] earlier had randomized linearwork algorithms on CRCW and EREW PRAM. Chong et al. [] gave a deterministic EREW PRAM algorithm that runs in logarithmic time with a linear number of processors. On the BSP model, Adler et al. [] presented a communication-optimal MST algorithm. Most parallel and some fast sequential MST algorithms employ the Bor˚uvka iteration. Three steps characterize a Bor˚uvka iteration: find-min, connectcomponents, and compact-graph. . find-min: for each vertex v label the incident edge with the smallest weight to be in the MST. . connect-components: identify connected components of the induced graph with edges found in Step . . compact-graph: compact each connected component into a single supervertex, remove self-loops and multiple edges; and relabel the vertices for consistency. The Bor˚uvka algorithm iterates until no new tree edges can be found. After each Bor˚uvka iteration, the number of vertices in the graph is reduced at least by half. Some algorithms invoke several rounds of the Bor˚uvka iterations to reduce the input size (e.g., []) and/or to increase the edge density (e.g., []).
Implementation of Parallel Bor˚uvka Many of the fast theoretic MST algorithms are considered impractical for input of realistic size because they are too complicated and have large constant factors hidden in the asymptotic complexity. Complex MST algorithms are hard to implement and usually do not achieve good parallel speedups on current architectures. Most existing implementations are based on the Bor˚uvka algorithm. For a Bor˚uvka iteration, Find-min and connectcomponents are simple and straightforward to implement. The compact-graph step performs bookkeeping
S
S
Spanning Tree, Minimum Weight
that is often left as a trivial exercise to the reader. JáJá [] describes a compact-graph algorithm for dense inputs. For sparse graphs, though, the compact-graph step often is the most expensive step in the Bor˚uvka iteration. Implementations and data structures for parallel Bor˚uvka on shared-memory machines are described in [].
Edge List Representation This implementation of Bor˚uvka’s algorithm (designated Bor-EL) uses the edge list representation of graphs, with each edge (u, v) appearing twice in the list for both directions (u, v) and (v, u). An elegant implementation of the compact-graph step sorts the edge list (using an efficient parallel sample sort []) with the supervertex of the first endpoint as the primary key, the supervertex of the second endpoint as the secondary key, and the edge weight as the tertiary key. When sorting completes, all of the self-loops and multiple edges between two supervertices appear in consecutive locations and can be merged efficiently using parallel prefix-sums. Adjacency List Representation With the adjacency list representation, each entry of an index array of vertices points to a list of its incident edges. The compact-graph step first sorts the vertex array according to the supervertex label, then concurrently sorts each vertex’s adjacency list using the supervertex of the other endpoint of the edge as the key. After sorting, the set of vertices with the same supervertex label are contiguous in the array, and can be merged efficient. This approach is designated as Bor-AL. v1 v1
1
5
2
v4
v3
3
a
v5
v2
6 4
v6
b
nil
Both Bor-EL and Bor-AL achieve the same goal that self-loops and multiple edges are moved to consecutive locations to be merged. Bor-EL uses one call to sample sort, while Bor-AL calls a smaller parallel sort and then a number of concurrent sequential sorts.
Flexible Adjacency List Representation The flexible adjacency list augments the traditional adjacency list representation by allowing each vertex to hold multiple adjacency lists instead of just a single one; in fact, it is a linked list of adjacency lists. During initialization, each vertex points to only one adjacency list. After the connect-components step, each vertex appends its adjacency list to its supervertex’s adjacency list by sorting together the vertices that are labeled with the same supervertex. The compact-graph step is simplified, allowing each supervertex to have self-loops and multiple edges inside its adjacency list. Thus, the compact-graph step now uses a smaller parallel sort plus several pointer operations instead of costly sortings and memory copies, while the find-min step gets the added responsibility of filtering out the self-loops and multiple edges. This approach is designated as Bor-FAL. Figure illustrates the use of the flexible adjacency list for a -vertex input graph. After one Bor˚uvka iteration, vertices , , and form one supervertex, and vertices , , and form a second supervertex. Vertex labels and represent the supervertices and receive the adjacency lists of vertices and , and vertices and , respectively. Vertices and are relabeled as and . Note that most of the original data structure is kept intact. Instead of relabeling vertices in the adjacency list, a separate lookup table is maintained that holds the
v2
1
v4
5
v2
nil
v1
1
v3
2
v3
nil
v2
2
v6
6
v4
nil
v1
5
v5
3
v5
nil
v4
3
v6
4
v6
nil
v5
4
v3
6
v1 nil v2
c
v2
1
v4
5
v1
1
v3
2
v2
2
v6
6
v1
5
v5
3
v4
3
v6
4
v5
4
v3
6
nil
Spanning Tree, Minimum Weight. Fig. Example of flexible adjacency list representation. (a) Input graph. (b) Initialized flexible adjacency list. (c) Flexible adjacency list after one iteration
Spanning Tree, Minimum Weight
supervertex label for each vertex. The find-min step uses this table to filter out self-loops and multiple edges.
Analysis of Implementations Helman and JáJá’s SMP complexity model [] provides a reasonable framework for the realistic analysis that favors cache-friendly algorithms by penalizing noncontiguous memory accesses. Under this model, there are two parts to an algorithm’s complexity: ME , the memory access complexity and, TC , the computation complexity. The ME term is the number of noncontiguous memory accesses, and the TC term is the running time. The ME term recognizes the effect that memory accesses have over an algorithm’s performance. Parameters of the model includes the problem size n and the number of processors p. For a sparse graph G with n vertices and m edges, as the algorithm iterates, the number of vertices decreases by at least half in each iteration, so there are at most log n iterations for all of the Bor˚uvka variants. Hence, the complexity of Bor-EL is given as, where c and z are constants related to cache size and sampling ratio [].
S
Bor-AL is a faster algorithm than Bor-EL, as expected, since the input for Bor-AL is “bucketed” into adjacency lists, versus Bor-EL that is an unordered list of edges, and sorting each bucket first in Bor-AL saves unnecessary comparisons between edges that have no vertices in common. The complexity of Bor-EL can be considered to be an upper bound of Bor-AL. In Bor-FAL, n reduces at least by half while m stays the same. Compact-graph first sorts the n vertices, then assigns O(n) pointers to append each vertex’s adjacency list to its supervertex’s. For each processor, sorting takes O( np log n) time, and assigning pointers takes O(n/p) time assuming each processor gets to assign roughly the same amount of pointers. Updating the lookup table costs each processor O(n/p) time. With Bor-FAL, to find the smallest weight edge for the supervertices, all the m edges will be checked, with each processor covering O(m/p) edges. The aggregate running time is TC (n, p)fm = O(m log n/p), and the memory access complexity is ME (n, p)fm = m/p. For the finding connected component step, each processor takes Tcc = O(n log np )time, and ME (n, p)cc ≤ n log n. The complexity for the whole Bor˚uvka’s algorithm is
T(n, p) = ⟨ME ; TC ⟩ = ⟨(
m + n + n log n mc log(m/p) + ) log n ; p p log z
≤⟨
O( mp log m log n)⟩ . As in each iteration these Bor-AL and Bor-EL compute similar results in different ways, it suffices to compare the complexity of the first iteration. For Bor-AL, the complexity of the first iteration is T(n, p) = ⟨ME ; TC ⟩ n + m + n log n = ⟨( p +
nc log (n/p) + mc log(m/n) ); p log z
O( np log m +
m p
log(m/n))⟩ .
While for Bor-EL, the complexity of the first iteration is T(n, p) = ⟨ME ; TC ⟩ = ⟨(
m + n + n log n mc log (m/p) + ); p p log z
O( mp log m)⟩ .
T(n, p) = T(n, p)fm + T(n, p)cc + T(n, p)cg n + n log n + m log n cn log(n/p) + ; p p log z
O( m+n log n)⟩ p
A Hybrid Parallel MST Algorithm An MST algorithm has been proposed in [] that marries Prim’s algorithm (known as an efficient sequential algorithm for MST) with that of the naturally parallel Bor˚uvka approach. In this algorithm, essentially each processor simultaneously runs Prim’s algorithm from different starting vertices. A tree is said to be growing when there exists a lightweight edge that connects the tree to a vertex not yet in another tree, and mature otherwise. When all of the vertices have been incorporated into mature subtrees, the algorithm contracts each subtree into a supervertex and call the approach recursively until only one supervertex remains. When the problem size is small enough, one processor solves the remaining problem using the best sequential MST algorithm.
S
S
Spanning Tree, Minimum Weight
This parallel MST algorithm possesses an interesting feature: when run on one processor the algorithm behaves as Prim’s, and on n processors becomes Bor˚uvka’s, and runs as a hybrid combination for < p < n, where p is the number of processors. Each of p processors in the algorithm finds for its starting vertex the smallest-weight edge, contracts that edge, and then finds the smallest-weight edge again for the contracted supervertex. It does not find all the smallest-weight edges for all vertices, synchronize, and then compact as in the parallel Bor˚uvka’s algorithm. The algorithm adapts for any number p of processors in a practical way for SMPs, where p is often much less than n, rather than in parallel implementations of Bor˚uvka’s approach that appear as PRAM emulations with p coarse-grained processors that emulate n virtual processors.
Implementation with Fine-Grained Locks Most parallel graph algorithms are designed without locks. Indeed it is hard to measure contention for these algorithms. Yet proper use of locks can simplify implementation and improve performance. Cong and Bader [] presented an implementation of Bor˚uvka’s algorithm (Bor-spinlock) that uses locks and avoids modifying the input data structure. In Bor-spinlock, the compact-graph step is completely eliminated. The main idea is that instead of compacting connected components, for each vertex there is now an associated label supervertex showing to which supervertex it belongs. In each iteration, all the vertices are partitioned as evenly as possible among the processors. Processor p finds the adjacent edge with smallest weight for a supervertex v′ . As the graph is not compacted, the adjacent edges for v′ are scattered among the adjacent edges of all vertices that share the same supervertex v′ , and different processors may work on these edges simultaneously. Now the problem is that these processors need to synchronize properly in order to find the edge with the minimum weight. Figure illustrates the specific problem for the MST case. On the top in Fig. is an input graph with six vertices. Suppose there are two processors P and P . Vertices , , and are partitioned on to processor P , and vertices , , and are partitioned on to processor P . It takes two iterations for Bor˚uvka’s algorithm to find the MST. In the first iteration, the find-min step of Borspinlock labels < , >, < , >, < , >, and < , >,
5
6 2
1 3
1
2
1 4
3
2
5
6 2
1 1
4
1 3
2
1′
2
3
4
2′
4
Spanning Tree, Minimum Weight. Fig. Example of the race condition between two processors when Bor˚uvka’s algorithm is used to solve the MST problem
to be in the MST. Connected-components finds vertices , , and in one component, and vertices , , and in another component. The MST edges and components are shown in the middle of Fig. . Vertices connected by dashed lines are in one component, and vertices connected by solid lines are in the other component. At this time, vertices , , and belong to supervertex ′ , and vertices , , and belong to supervertex ′ . In the second iteration, processor P again inspects vertices , , and , and processor P inspects vertices , , and . Previous MST edges < , >, < , >, < , >, and < , > are found to be edges inside supervertices and are ignored. On the bottom of Fig. are the two supervertices with two edges between them. Edges < , > and < , > are found by P to be the edges between supervertices ′ and ′ , edge < , > is found by P to be the edge between the two supervertices. For supervertex ′ , P tries to label < , > as the MST edge, while P tries to label < , >. This is a race condition between the two processors, and locks are used in Bor-spinlock to ensure correctness. In addition to locks and barriers, recent development in transactional memory provides a new mechanism for synchronization among processors. Kang and Bader implemented minimum spanning forest algorithms with transactional memory []. Their
Spanning Tree, Minimum Weight
implementation achieved good scalability on some current architectures.
Implementation on Distributed-Memory Machines Partitioned global address space (PGAS) languages such as UPC and X [, ] have been proposed recently that present a shared-memory abstraction to the programmer for distributed-memory machines. They allow the programmer to control the data layout and work assignment for the processors. Mapping shared-memory graph algorithms onto distributedmemory machines is straightforward with PGAS languages. Figure shows both the SMP implementation and UPC implementation of parallel Bor˚uvka. The two implementations are also almost identical. The differences are shown in underscore. Performance wise, straightforward PGAS implementation for irregular graph algorithms does not usually achieve high performance due to the aggregate startup cost of many small messages. Cong, Almasi, and Saraswat presented
S
their study in optimizing the UPC implementation of graph algorithm in []. They apply communication coalescing together with other techniques for improving the performance. The idea is to merge the small messages to/from a processor into a single, large message. As all operations in each step of a typical PRAM algorithm are parallel, reads and writes can be scheduled in an order such that communication coalescing is possible. After communication coalescing, these data can be accessed in one communication round where one processor sends at most one message to another processor.
Experimental Results Chung and Condon [] implement parallel Bor˚uvka’s algorithm on the TMC CM-. On a -processor machine, for geometric, structured graphs with , vertices and average degree and graphs with fewer vertices but higher average degree, their code achieves a relative parallel speedup of about , on -processors, over the sequential Bor˚uvka’s algorithm, which was already – times slower than their sequential Kruskal
S
Spanning Tree, Minimum Weight. Fig. UPC implementation and SMP implementation of MST: the main loop bodies
S
Spanning Tree, Minimum Weight
algorithm. Dehne and Götz [] studied practical parallel algorithms for MST using the BSP model. They implement a dense Bor˚uvka parallel algorithm, on a -processor Parsytec CC-, that works well for sufficiently dense input graphs. Using a fixed-sized input graph with , vertices and , edges, their code achieves a maximum speedup of . using processors for a random dense graph. Their algorithm is not suitable for the more challenging sparse graphs. Bader and Cong presented their studies of parallel MST on symmetric multiprocessors (SMPs) in [, ]. Their implementation achieved for the first time good parallel speedups over a wide range of inputs on SMPs. Their Experimental results show that for Bor-EL and Bor-AL the compact-graph step dominates the running time. Bor-EL takes much more time than Bor-AL, and only gets worse when the graphs get denser. In contrast the execution time of compact-graph step of Bor-FAL is greatly reduced: in the experimental section with a random graph of M vertices and M edges, it is over times faster than Bor-EL, and over times faster than Bor-AL. Actually the execution time of the compact-graph step of Bor-FAL is almost the same for the three input graphs because it only depends on the number of vertices. As predicted, the execution time of the find-min step of Bor-FAL increases. And the connect-components step only takes a small fraction of the execution time for all approaches. Cong, Almasi and Saraswat presented a UPC implementation of distributed MST in []. For input graphs with billions of edges, the distributed implementation achieved significant speedups over the SMP implementation and the best sequential implementation.
Bibliography . Adler M, Dittrich W, Juurlink B, Kutyłowski M, Rieping I () Communication-optimal parallel minimum spanning tree algorithms (extended abstract). In: SPAA ’: proceedings of the tenth annual ACM symposium on parallel algorithms and architectures, Puerto Vallarta, Mexico. ACM, New York, pp – . Arge L, Bender MA, Demaine ED, Holland-Minkley B, Munro JI () Cache-oblivious priority queue and graph algorithm applications. In: Proceedings of the th annual ACM symposium on theory of computing, Montreal, Canada. ACM, New York, pp – . Bader DA, Cong G () Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J Parallel Distrib Comput :–
. Charles P, Donawa C, Ebcioglu K, Grothoff C, Kielstra A, Van Praun C, Saraswat V, Sarkar V () X: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the ACM SIGPLAN conference on object-oriented programming systems, languages and applications (OOPSLA), San Diego, CA, pp – . Chong KW, Han Y, Lam TW () Concurrent threads and optimal parallel minimum spanning tree algorithm. J ACM : – . Chung S, Condon A () Parallel implementation of Bor˚uvka’s minimum spanning tree algorithm. In: Proceedings of the th international parallel processing symposium (IPPS’), Honolulu, Hawaii, pp – . Cole R, Klein PN, Tarjan RE () Finding minimum spanning forests in logarithmic time and linear work using random sampling. In: Proceedings of the th annual symposium parallel algorithms and architectures (SPAA-), Newport, RI. ACM, New York, pp – . Cole R, Klein PN, Tarjan RE () A linear-work parallel algorithm for finding minimum spanning trees. In: Proceedings of the th annual ACM symposium on parallel algorithms and architectures, Cape May, NJ, ACM, New York, pp – . Cong G, Almasi G, Saraswat V () Fast PGAS implementation of distributed graph algorithms. In: Proceedings of the ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC ’), IEEE Computer Society, Washington, DC, pp – . Cong G, Bader DA () Lock-free parallel algorithms: an experimental study. In: Proceeding of the rd international conference on high-performance computing (HiPC ), Banglore, India . Cong G, Sbaraglia S () A study of the locality behavior of minimum spanning tree algorithms. In: The th international conference on high performance computing (HiPC ), Bangalore, India. IEEE Computer Society, pp – . Dehne F, Götz S () Practical parallel algorithms for minimum spanning trees. In: Proceedings of the seventeenth symposium on reliable distributed systems, West Lafayette, IN. IEEE Computer Society, pp – . Gibbons PB, Matias Y, Ramachandran V () Can sharedmemory model serve as a bridging model for parallel computation? In: Proceedings th annual symposium parallel algorithms and architectures (SPAA-), Newport, RI, ACM, New York pp – . Graham RL, Hell P () On the history of the minimum spanning tree problem. IEEE Ann History Comput ():– . Helman DR, JáJá J () Designing practical efficient algorithms for symmetric multiprocessors. In: Algorithm engineering and experimentation (ALENEX’), Baltimore, MD, Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . JáJá J () An Introduction to parallel algorithms. AddisonWesley, New York . Kang S, Bader DA () An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs. In: Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP), Raleigh, NC
Sparse Direct Methods
. Karger DR, Klein PN, Tarjan RE () A randomized linear-time algorithm to find minimum spanning trees. J ACM ():– . Katriel I, Sanders P, Träff JL () A practical minimum spanning tree algorithm using the cycle property. In: th Annual European symposium on algorithms (ESA ), Budapest, Hungary, Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . Moret BME, Shapiro HD () An empirical assessment of algorithms for constructing a minimal spanning tree. In: DIMACS monographs in discrete mathematics and theoretical computer science: computational support for discrete mathematics vol , American Mathematical Society, Providence, RI, pp – . Pettie S, Ramachandran V () A randomized time-work optimal parallel algorithm for finding a minimum spanning forest. SIAM J Comput ():– . Poon CK, Ramachandran V () A randomized linear work EREW PRAM algorithm to find a minimum spanning forest. In: Proceedings of the th international symposium algorithms and computation (ISAAC’), Lecture notes in computer science, vol . Springer-Verlag, Heidelberg, pp – . Carlson WW, Draper JM, Culler DE, Yelick K, Brooks E, Warren K () Introduction to UPC and Language Specification. CCS-TR-. IDA/CCS, Bowie, Maryland
Sparse Approximate Inverse Matrix SPAI (SParse approximate inverse)
Sparse Direct Methods Anshul Gupta IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms Gaussian elimination; Linear equations solvers; Sparse gaussian elimination
Definition Direct methods for solving linear systems of the form Ax = b are based on computing A = LU, where L and U are lower and upper triangular, respectively. Computing the triangular factors of the coefficient matrix A is also known as LU decomposition. Following the factorization, the original system is trivially solved by solving
S
the triangular systems Ly = b and Ux = y. If A is symmetric, then a factorization of the form A = LLT or A = LDLT is computed via Cholesky factorization, where L is a lower triangular matrix (unit lower triangular in the case of A = LDLT factorization) and D is a diagonal matrix. One set of common formulations of LU decomposition and Cholesky factorization for dense matrices are shown in Figs. and , respectively. Note that other mathematically equivalent formulations are possible by rearranging the loops in these algorithms. These algorithms must be adapted for sparse matrices, in which a large fraction of entries are zero. For example, if A[ j, i] in the division step is zero, then this operation need not be performed. Similarly, the update steps can be avoided if either A[ j, i] or A[i, k] (A[k, i] if A is symmetric) is zero. When A is sparse, the triangular factors L and U typically have nonzero entries in many more locations than A does. This phenomenon is known as fill-in, and results in a superlinear growth in the memory and time requirements of a direct method to solve a sparse system with respect to the size of the system. Despite a high memory requirement, direct methods are often used in many real applications due to their generality and robustness. In applications requiring solutions with respect to several right-hand side vectors and the same coefficient matrix, direct methods are often the solvers of choice because the one-time cost of factorization can be amortized over several inexpensive triangular solves.
Discussion The direct solution of a sparse linear system typically involves four phases. The two computational phases, factorization and triangular solutions have already been mentioned. The number of nonzeros in the factors and sometimes their numerical properties are functions of the initial permutation of the rows and columns of the coefficient matrix. In many parallel formulations of sparse factorization, this permutation can also have an effect on load balance. The first step in the direct solution of a sparse linear system, therefore, is to apply heuristics to compute a desirable permutation the matrix. This step is known as ordering. A sparse matrix can be viewed as the adjacency matrix of a graph. Ordering heuristics typically use the graph view of the matrix
S
S
Sparse Direct Methods
1. begin LU_Decomp (A, n) for i = 1, n 2. 3. for j = i + 1, n A[j, i] = A[j, i]/A[i, i]; /* division step, computes column i of L */ 4. 5. end for 6. for k = i + 1, n 7. for j = i + 1, n 8. A[j, k] = A[j, k] – A[j, i]×A[i, k]; /* update step*/ 9. end for 10. end for 11. end for 12. end LU_Decomp
Sparse Direct Methods. Fig. A simple column-based algorithm for LU decomposition of an n × n dense matrix A. The algorithm overwrites A by L and U such that A = LU, where L is unit lower triangular and U is upper triangular. The diagonal entries after factorization belong to U; the unit diagonal of L is not explicitly stored 1. begin Cholesky (A, n) 2. for i = 1, n 3. A[i, i] = √A[i, i]; 4. for j = i + 1, n 5. A[j, i] = A[j, i]/A[i, i]; /* division step, computes column i of L */ 6. end for 7. for k = i + 1, n 8. for j = k, n 9. A[j, k] = A[j, k] – A[j, i] × A[k, i]; /* update step*/ 10. end for 11. end for 12. end for 13. end Cholesky
Sparse Direct Methods. Fig. A simple column-based algorithm for Cholesky factorization of an n × n dense symmetric positive definite matrix A. The lower triangular part of A is overwritten by L, such that A = LLT
and label the vertices in a particular order that is equivalent to computing a permutation of the coefficient matrix with desirable properties. In the second phase, known as symbolic factorization, the nonzero pattern of the factors is computed. Knowing the nonzero pattern of the factors before actually computing them is useful for several reasons. The memory requirements of numerical factorization can be predicted during symbolic factorization. With the number and locations of nonzeros known before hand, a significant amount of indirect addressing can be avoided during numerical factorization, thus boosting performance. In a parallel implementation, symbolic factorization helps in the distribution of data and computation among processing units. The ordering and symbolic factorization phases are also referred to as preprocessing or analysis steps.
Of the four phases, numerical factorization typically consumes the most memory and time. Many applications involve factoring several matrices with different numerical values but the same sparsity structure. In such cases, some or all of the results of the ordering and symbolic factorization steps can be reused. This is also advantageous for parallel sparse solvers because parallel ordering and symbolic factorization are typically less scalable. Amortization of the cost of these steps over several factorization steps helps maintain the overall scalability of the solver close to that of numerical factorization. The parallelization of the triangular solves is highly dependent on the parallelization of the numerical factorization phase. The parallel formulation of numerical factorization dictates how the factors are distributed among parallel tasks. The subsequent triangular solution steps must use a parallelization scheme
Sparse Direct Methods
that works on this data distribution, particularly in a distributed-memory parallel environment. Given its prominent role in the parallel direct solution of sparse linear system, the numerical factorization phase is the primary focus of this entry. The algorithms used for preprocessing and factoring a sparse coefficient matrix depend on the properties of the matrix, such as symmetry, diagonal dominance, positive definiteness, etc. However, there are common elements in most sparse factorization algorithms. Two of these, namely, task graphs and supernodes, are key to the discussion of parallel sparse matrix factorization of all types for both practical and pedagogical reasons.
Task Graph Model of Sparse Factorization A parallel computation is usually the most efficient when running at the maximum possible level of granularity that ensures a good load balance among all the processors. Dense matrix factorization is computationally rich and requires O(n ) operations for factoring an n × n matrix. Sparse factorization involves a much smaller overall number of operations per row or column of the matrix than its dense counterpart. The sparsity results in additional challenges, as well as additional opportunities to extract parallelism. The challenges are centered around finding ways of orchestrating the unstructured computations in a loadbalanced fashion and of containing the overheads of interaction between parallel tasks in the face of a relatively small number or operations per row or column of the matrix. The added opportunity for parallelism results from the fact that, unlike the dense algorithms of Figs. and , the columns of the factors in the sparse case do not need to be computed one after the other. Note that in the algorithms shown in Figs. and , row and column i are updated by rows and columns . . . i−. In the sparse case, column i is updated by a column j < i only if U[ j, i] ≠ , and a row i is updated by a row j < i only if L[ i, j] ≠ . Therefore, as the sparse factorization begins, the division step can proceed in parallel for all columns i for which A[i, j] = and A[ j, i] = for all j < i. Similarly, at any stage in the factorization process, there could be large pool of columns that are ready of the division step. Any unfactored column i would belong to this pool iff all columns j < i with a nonzero entry in row
S
i of L and all rows j < i with a nonzero entry in column i of U have been factored. A task dependency graph is an excellent tool for capturing parallelism and the various dependencies in sparse matrix factorization. It is a directed acyclic graph (DAG) whose vertices denote tasks and the edges specify the dependencies among the tasks. A task is associated with each row and column (column only in the symmetric case) of the sparse matrix to be factored. The vertex i of the task graph denotes the task responsible for computing column i of L and row i of U. A task is ready for execution if and only if all tasks with incoming edges to it have completed. Task graphs are often explicitly constructed during symbolic factorization to guide the numerical factorization phase. This permits the numerical factorization to avoid expensive searches in order to determine which tasks are ready for execution at any given stage of the parallel factorization process. The task graphs corresponding to matrices with a symmetric structure are trees and are known as elimination trees in the sparse matrix literature. Figure shows the elimination tree for a structurally symmetric sparse matrix and Fig. shows the task DAG for a structurally unsymmetric matrix. Once a task graph is constructed, then parallel factorization (and even parallel triangular solution) can be viewed as the problem of scheduling the tasks onto parallel processes or threads. Static scheduling is generally preferred in a distributed-memory environment and dynamic scheduling in a shared-memory environment. The shape of the task graph is a function of the initial permutation of rows and columns of the sparse matrix, and is therefore determined by the outcome of the ordering phase. Figure shows the elimination tree corresponding to the same matrix as in Fig. a, but with a different initial permutation. The structure of the task DAG usually affects how effectively it can be scheduled for parallel factorization. For example, it may be intuitively recognizable to readers that the elimination tree in Fig. is more amenable to parallel scheduling than the tree corresponding to a different permutation of the same matrix in Fig. . Figure illustrates that the matrices in Figs. and have the same underlying graph. The only difference is in the labeling of the vertices of the graph, which results in a different permutation of the rows and columns of the matrix, different amount
S
S
Sparse Direct Methods
0 0
1
X
1 2
X
2
3
X
X
X
X
3
5
5 X
X X
X
X
X
8
7
X
X
X
2
X
X
X
X
X
X
X
X
X
6
X
X X
8
8
X
X
7
7
X X
6
6 X
4
a
4
X
b
0
5
1
3
4
Sparse Direct Methods. Fig. A structurally symmetric sparse matrix and its elimination tree. An X indicates a nonzero entry in the original matrix and a box denotes a fill-in 0 1 2 3 4 5 6 7 8 0 X X
8
1 X X
6
2 X
X X
3
X X
4
4 X
5
a
7 5
X X
3
X
6
X X
7
X
1
X
8 X X X
X X
2 0
b
Sparse Direct Methods. Fig. An unsymmetric sparse matrix and the corresponding task DAG. An X indicates a nonzero entry in the original matrix and a box denotes a fill-in 0
1
2
0
X
X
X
1
X
X
2
X X
4
X
6 7
a
8
4
5
6
7
8 8
X X
3
5
3
7
X X
X
X X
X X
5
X X
X
6
X
X
3
X X
X
4
X
X
X X
2
X
X
1
X X
b
0
Sparse Direct Methods. Fig. A permutation of the sparse matrix of Fig. a and its elimination tree
of fill-in, and different shapes of task graphs. In general, long and skinny task graphs result in limited parallelism and a long critical path. Short and broad task graphs have a high degree of parallelism and shorter critical paths.
Since the shape, and hence the amenability to efficient parallel scheduling of the task graph is sensitive to ordering, heuristics that result in balanced and broad task graphs are preferred for parallel factorization. The best known ordering heuristic in this class is called
Sparse Direct Methods
6
0
2
a
1
7
8
3
0
5
2
4
b
1
4
S
3
6
5 7
8
Sparse Direct Methods. Fig. An illustration of the duality between graph vertex labeling and row/column permutation of a structurally symmetric sparse matrix. Grid (a), with vertices labeled based on nested dissection, is the adjacency graph of the matrix in Fig. a and grid (b) is the adjacency graph of the matrix in Fig. a
nested dissection. Nested dissection is based on recursively computing balanced bisections of a graph by finding small vertex separators. The vertices in the two disconnected partitions of the graph are labeled before the vertices of the separator. The same heuristic is applied recursively for labeling the vertices of each partition. The ordering in Fig. is actually based on nested dissection. Note that the vertex set , , forms a separator, dividing the graph into two disconnected components, , , and , , . Within the two components, vertices and are the separators, and hence have the highest label in their respective partitions.
Supernodes In sparse matrix terminology, a set of consecutive rows or columns that have the same nonzero structure is loosely referred to as a supernode. The notion of supernodes is crucial to efficient implementation of sparse factorization for a large class of sparse matrices arising in real applications. Coefficient matrices in many applications have natural supernodes. In graph terms, there are sets of vertices with identical adjacency structures. Graphs like these can be compressed by having one supervertex represent the whole set that has the same adjacency structure. When most vertices of a graph belong to supernodes and the supernodes are of roughly the same size (in terms of the number of vertices in them) with an average size of, say, m, then it can be shown that the compressed graph has O(m) fewer vertices and O(m ) fewer edges than in the original graph. It can also be shown that an ordering of original graph can be derived from an ordering of the compressed graph,
while preserving the properties of the ordering, by simply labeling the vertices of the original graph consecutively in the order of the supernodes of the compressed graph. Thus, the space and the time requirements of ordering can be dramatically reduced. This is particularly useful for parallel sparse solvers because parallel ordering heuristics often yield orderings of lower quality than their serial counterparts. For matrices with highly compressible graphs, it is possible to compute the ordering in serial with only a small impact on the overall scalability of the entire solver because ordering is performed on a graph with O(m ) fewer edges. While the natural supernodes in the coefficient matrix, if any, can be useful during ordering, it is the presence of supernodes in the factors that have the biggest impact on the performance of the factorization and triangular solution steps. Although there can be multiple ways of defining supernodes in matrices with an unsymmetric structure, the most useful form involves groups of indices with identical nonzero pattern in the corresponding columns of L and rows of U. Even if there are no supernodes in the original matrix, supernodes in the factors are almost inevitable for matrices in most real applications. This is due to fillin. Examples of supernodes in factors include indices – in Fig. a, indices – and – in Fig. a, and indices – and – in Fig. . Some practitioners prefer to artificially increase the size (i.e., the number of member rows and columns) of supernodes by padding the rows and columns that have only slightly different nonzero patterns, so that they can be merged into the same supernode. The supernodes in the factors are typically detected and recorded as they emerge during symbolic
S
S
Sparse Direct Methods
factorization. In the remainder of this chapter, the term supernode refers to a supernode in the factors. It can be seen from the algorithms in Figs. and that there are two primary computations in a columnbased factorization: the division step and the update step. A supernode-based sparse factorization too has the same two basic computation steps, except that these are now matrix operations on row/column blocks corresponding to the various supernodes. Supernodes impart efficiency to numerical factorization and triangular solves because they permit floating point operations to be performed on dense submatrices instead of individual nonzeros, thus improving memory hierarchy utilization. Since rows and columns in supernodes share the nonzero structure, indirect addressing is minimized because the structure needs to be stored only once for these rows and columns. Supernodes help to increase the granularity of tasks, which is useful for improving computation to overhead ratio in a parallel implementation. The task graph model of sparse matrix factorization was introduced earlier with a task k defined as the factorization of row and column k of the matrix. With supernodes, a task can be defined as the factorization of all rows and columns associated with a supernode. Actual task graphs in practical implementations of parallel sparse solvers are almost always supernodal task graphs. Note that some applications, such as power grid analysis, in which the basis of the linear system is not a finite-element or finite-difference discretization of a physical domain, can give rise to sparse matrices that incur very little fill-in during factorization. The factors of these matrices may have very small supernodes.
An Effective Parallelization Strategy The task graphs for sparse matrix factorization have some typical properties that make scheduling somewhat different from traditional DAG scheduling. Note that the task graphs corresponding to irreducible matrices have a distinct root; that is, one node that has no outgoing edges. This corresponds to the last (rightmost) supernode in the matrix. The number of member rows and columns in supernodes typically increases away from the leaves and toward the root of the task graph. The reason is that a supernode accumulates fill-in from all its predecessors in the task graph. As a result, the portions of the factors that correspond to task graph
nodes with a large number of predecessors tend to get denser. Due to their larger supernodes, the tasks that are relatively close to the root tend to have more work associated with them. On the other hand, the width of the task graph shrinks close to the root. In other words, a typical task graph for sparse matrix factorization tends to have a large number of small independent tasks closer to the leaves, but a small number of large tasks closer to the root. An ideal parallelization strategy that would match the characteristics of the problem is as follows. Starting out, the relatively plentiful independent tasks at or near the leaves would be scheduled to parallel threads or processes. As tasks complete, other tasks become available and would be scheduled similarly. This could continue until there are enough independent tasks to keep all the threads or processes busy. When the number of available parallel tasks becomes smaller than the number of available threads or processes, then the only way to keep the latter busy would be to utilize more than one of them per task. The number of threads or processes working on individual tasks would increase as the number of parallel tasks decreases. Eventually, all threads or processes would work on the root task. The computation corresponding to the root task is equivalent to factoring a dense matrix of the size of the root supernode.
Sparse Factorization Formulations Based on Task Roles So far in this entry, the tasks have been defined somewhat ambiguously. There are multiple ways of defining the tasks precisely, which can result in different parallel implementations of sparse matrix factorization. Clearly, a task is associated with a supernode and is responsible for computing that supernode of the factors; that is, performing the computation equivalent to the division steps in the algorithms in Figs. and . However, a task does not own all the data that is required to compute the final values of its supernode’s rows and columns. The data for performing the update steps on a supernode may be contributed by many other supernodes. Based on the tasks’ responsibilities, sparse LU factorization has traditionally been classified into three categories, namely, left-looking, right-looking, and Crout. These variations are illustrated in Fig. . The traditional left-looking variant uses nonconforming supernodes made up of columns of both L and U, which are not very
Sparse Direct Methods
Read
Factor and read
Update and factor
Update
S
Supernode
a
b
c
Sparse Direct Methods. Fig. The left-looking (a), right-looking (b), and Crout (c) variations of sparse LU factorization. Different patterns indicate the parts of the matrix that are read, updated, and factored by the task corresponding to a supernode. Blank portions of the matrix are not accessed by this task
common in practice. In this variant, a task is responsible for gathering all the data required for its own columns from other tasks and for updating and factoring its columns. The left-looking formulation is rarely used in modern high-performance sparse direct solvers. In the right-looking variation of sparse LU, a task factors the supernode that it owns and performs all the updates that use the data from this supernode. In the Crout variation, a task is responsible for updating and factoring the supernode that it owns. Only the right-looking and Crout variants have symmetric counterparts. A fourth variation, known as the multifrontal method, incorporates elements of both right-looking and Crout formulations. In the multifrontal method, the task that owns a supernode computes its own contribution to updating the remainder of the matrix (like the right-looking formulation), but does not actually apply the updates. Each task is responsible for collecting all relevant precomputed updates and applying them to its supernode (like the Crout formulation) before factoring the supernode. The supernode data and its update contribution in the multifrontal method is organized into small dense matrices called frontal matrices. Integer arrays maintain a mapping of the local contiguous indices of the frontal matrices to the global indices of the sparse factor matrices. Figure illustrates the complete supernodal multifrontal Cholesky factorization of the symmetric matrix shown in Fig. a. Note that, since rows and columns with indices – form a supernode,
there would be only one task (Fig. g) corresponding to these in the supernodal task graph (elimination tree). When a task is ready for execution, it first constructs its frontal matrix by accumulating contributions from the frontal matrices of its children and from the coefficient matrix. It then factors its supernode, which is the portion of the frontal matrix that is shaded in Fig. . After factorization, the unshaded portion (this submatrix of a frontal matrix is called the update matrix) is updated based on the update step of the algorithm in Fig. . The update matrix is then used by the parent task to construct its frontal matrix. Note that Fig. illustrates a symmetric multifrontal factorization; hence, the frontal and update matrices are triangular. For general LU decomposition, these matrices would be square or rectangular. In symmetric multifrontal factorization, a child’s update matrix in the elimination tree contributes only to its parent’s frontal matrix. The task graph for general matrices is usually not a tree, but a DAG, as shown in Fig. . Apart from the shape of the frontal and update matrices, unsymmetric pattern multifrontal method differs from its symmetric counterpart in two other ways. First, an update matrix can contribute to more than one frontal matrices. Secondly, the frontal matrices receiving data from an update matrix can belong to the contributing supernode’s ancestors (not necessarily parents) in the task graph.
S
S
Sparse Direct Methods
6 X
g
2 X 6
5 X 6
7 X
7 X
8
8
e
a
7 X X 8 X X 6 7 8 Supernode 6−8
2 6 7 8 Supernode 2
f
5 6 7 8 Supernode 5
0 X 2 X
1 X 2 X
3 X 5 X
4 X 5 X
6 X 0 2 6
8 X 1 2 8
6 X 3 5 6
8 X 4 5 8
Supernode 3
d Supernode 4
Supernode 0
b
Supernode 1
c
Sparse Direct Methods. Fig. Frontal matrices and data movement among them in the supernodal multifrontal Cholesky factorization of the sparse matrix shown in Fig. a
The multifrontal method is often the formulation of choice for highly parallel implementations of sparse matrix factorization. This is because of its natural data locality (most of the work of the factorization is performed in the well-contained dense frontal matrices) and the ease of synchronization that it permits. In general, each supernode is updated by multiple other supernodes and it can potentially update many other supernodes during the course of factorization. If implemented naively, all these updates may require excessive locking and synchronization in a shared-memory environment or generate excessive message traffic in a distributed environment. In the multifrontal method, the updates are accumulated and channeled along the paths from the leaves of the task graph to its the root. This gives a manageable structure to the potentially haphazard interaction among the tasks. Recall that the typical supernodal sparse factorization task graph is such that the size of tasks generally increases and the number of parallel tasks generally
diminishes on the way to the root from the leaves. The multifrontal method is well suited for both task parallelism (close to the leaves) and data parallelism (close to the root). Larger tasks working on large frontal matrices close to the root can readily employ multiple threads or processes to perform parallel dense matrix operations, which not only have well-understood data-parallel algorithms, but also a well-developed software base.
Pivoting in Parallel Sparse LDLT and LU Factorization The discussion in this entry so far has focussed on the scenario in which the rows and columns of the matrix are permuted during the ordering phase and this permutation stays static during numerical factorization. While this assumption is valid for a large class of practical problems, there are applications that would generate matrices that could encounter a zero or a very small entry on the diagonal during the factorization process.
Sparse Direct Methods
This will cause the division step of the LU decomposition algorithm to fail or to result in numerical instability. For nonsingular matrices, this problem can be solved by interchanging rows and columns of the matrix by a process known as partial pivoting. When a small or zero entry is encountered at A[i, i] before the division step, then row i is interchanged with another row j (i < j ≤ n) such that A[ j, i] (which would occupy the location A[i, i] after the interchange) is sufficiently greater in magnitude compared to other entries A[k, i] (i < k ≤ n, k ≠ j). Similarly, instead of row i, column i could be exchanged with a suitable column j (i < j ≤ n). In symmetric LDLT factorization, both row and column i are interchanged simultaneously with a suitable row–column pair to maintain symmetry. Until recently, it was believed that due to unpredictable changes in the structure of the factors due to partial pivoting, a priori ordering and symbolic factorization could not be performed, and these steps needed to be combined with numerical factorization. Keeping the analysis and numerical factorization steps separate has substantial performance and parallelization benefits, which would be lost if these steps are combined. Fortunately, modern parallel sparse solvers are able to perform partial pivoting and maintain numerical stability without mixing the analysis and numerical steps. The multifrontal method permits effective implementation of partial pivoting in parallel and keeps its effects as localized as possible. Before computing a fill-reducing ordering, the rows or columns of the coefficient matrix are permuted such that the absolute value of the product of the magnitude of the diagonal entries is maximized. Special graph matching algorithms are used to compute this permutation. This step ensures that the diagonal entries of the matrix have relatively large magnitudes at the beginning of factorization. It has been observed that once the matrix has been permuted this way, in most cases, very few interchanges are required during the factorization process to keep it numerically stable. As a result, factorization can be performed using the static task graph and the static structures of the supernodes of L and U predicted by symbolic factorization. When an interchange is necessary, the resulting changes in the data structures are registered. Since such interchanges are rare, the resulting disruption and the overhead is usually well contained.
S
The first line of defense against numerical instability is to perform partial pivoting within a frontal matrix. Exchanging rows or columns within a supernode is local, and if all rows and columns of a supernode can be successfully factored by simply altering their order, then nothing outside the supernode is affected. Sometimes, a supernode cannot be factored completely by local interchanges. This can happen when all candidate rows or columns for interchange have indices greater than that of the last row–column pair of the supernode. In this case, a technique known as delayed pivoting is employed. The unfactored rows and columns are simply removed from the current supernode and passed onto the parent (or parents) in the task graph. Merged with the parent supernode, these rows and columns have additional candidate rows and columns available for interchange, which increases the chances of their successful factorization. The process of upward migration of unsuccessful pivots continues until they are resolved, which is guaranteed to happen at the root supernode for a nonsingular matrix. In the multifrontal framework, delayed pivoting simply involves adding extra rows and columns to the frontal matrices of the parents of supernode with failed pivots. The process is straightforward for the supernodes whose tasks are mapped onto individual threads or processes. For the tasks that require data-parallel involvement of multiple threads or processes, the extra rows and columns can be partitioned using the same strategy that is used to partition the original frontal matrix.
Parallel Solution of Triangular Systems As mentioned earlier, solving the original system after factoring the coefficient matrix involves solving a sparse lower triangular and a sparse upper triangular system. The task graph constructed for factorization can be used for the triangular solves too. For matrices with an unsymmetric pattern, a subset of edges of the task DAG may be redundant in each of the solve phases, but these redundant edges can be easily marked during symbolic factorization. Just like factorization, the computation for the lower triangular solve phase starts at the leaves of the task graph and proceeds toward the root. In the upper triangular solve phase, computation starts at the root and fans out toward the leaves (in other
S
S
Sparse Gaussian Elimination
words, the direction of the edges in the task graph is effectively reversed).
Related Entries Dense Linear System Solvers Multifrontal Method Reordering
Bibliographic Notes and Further Reading Books by George and Liu [] and Duff et al. [] are excellent sources for a background on sparse direct methods. A comprehensive survey by Demmel et al. [] sums up the developments in parallel sparse direct solvers until the early s. Some remarkable progress was made in the development of parallel algorithms and software for sparse direct methods during a decade starting in the early s. Gupta et al. [] developed the framework for highly scalable parallel formulations of symmetric sparse factorization based on the multifrontal method (see tutorial by Liu [] for details), and recently demonstrated scalable performance of an industrial strength implementation of their algorithms on thousands of cores []. Demmel et al. [] developed one of the first scalable algorithms and software for solving unsymmetric sparse systems without partial pivoting. Amestoy et al. [, ] developed parallel algorithms and software that incorporated partial pivoting for solving unsymmetric systems with (either natural or forced) symmetric pattern. Hadfield [] and Gupta [] laid the theoretical foundation for a general unsymmetric pattern parallel multifrontal algorithm with partial pivoting, with the latter following up with a practical implementation [].
Bibliography . Amestoy PR, Duff IS, Koster J, L’Excellent JY () A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl ():– . Amestoy PR, Duff IS, L’Excellent JY () Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput Methods Appl Mech Eng :– . Demmel JW, Gilbert JR, Li XS () An asynchronous parallel supernodal algorithm for sparse Gaussian elimination. SIAM J Matrix Anal Appl ():– . Demmel JW, Heath MT, van der Vorst HA () Parallel numerical linear algebra. Acta Numerica :–
. Duff IS, Erisman AM, Reid JK () Direct methods for sparse matrices. Oxford University Press, Oxford, UK . George A, Liu JW-H () Computer solution of large sparse positive definite systems. Prentice-Hall, NJ . Gupta A () Improved symbolic and numerical factorization algorithms for unsymmetric sparse matrices. SIAM J Matrix Anal Appl ():– . Gupta A () A shared- and distributed-memory parallel general sparse direct solver. Appl Algebra Eng Commun Comput ():– . Gupta A, Karypis G, Kumar V () Highly scalable parallel algorithms for sparse matrix factorization. IEEE Trans Parallel Distrib Syst ():– . Gupta A, Koric S, George T () Sparse matrix factorization on massively parallel computers. In: SC Proceedings, ACM, Portland, OR, USA . Hadfield SM () On the LU factorization of sequences of identically structured sparse matrices within a distributed memory environment. PhD thesis, University of Florida, Gainsville, FL . Liu JW-H () The multifrontal method for sparse matrix solution: theory and practice. SIAM Rev ():–
Sparse Gaussian Elimination Sparse Direct Methods SuperLU
Sparse Iterative Methods, Preconditioners for Preconditioners for Sparse Iterative Methods
SPEC Benchmarks Matthias Müller , Brian Whitney Robert Henschel , Kalyan Kumaran Technische Universität Dresden, Dresden, Germany Oracle Corporation, Hillsboro, OR, USA Indiana University, Bloomington, IN, USA Argonne National Laboratory, Argonne, IL, USA
Synonyms SPEC HPC; SPEC HPC; SPEC MPI; SPEC OMP
SPEC Benchmarks
Definition Application-based Benchmarks measure the performance of computer systems by running a set of applications with a well defined configuration and workload.
Discussion Introduction The Standard Performance Evaluation Corporation (SPEC [Product and service names mentioned herein may be the trademarks of their respective owners]) is an organization for creating industry-standard benchmarks to measure various aspects of modern computer system performance. SPEC was founded in . In January , the High-Performance Group of the Standard Performance Evaluation Corporation (SPEC HPG) was founded with the mission to establish, maintain, and endorse a suite of benchmarks representative of real-world, high-performance computing applications. Several efforts joined forces to form SPEC HPG and to initiate a new benchmarking venture that is supported broadly. Founding partners included the member organizations of SPEC, former members of the Perfect Benchmark effort, and representatives of areaspecific benchmarking activities. Other benchmarking organizations have joined the SPEC HPG committee since its formation. SPEC HPG has developed various benchmark suites and its run rules over the last few years. The purpose of those benchmarks and their run rules is to further the cause of fair and objective benchmarking of highperformance computing systems. Results obtained with the benchmark suites are to be reviewed to see whether the individual run rules have been followed. Once they are accepted, the results are published on the SPEC web site (http://www.spec.org). All results, including a comprehensive description of the hardware they were produced on, are freely available. SPEC believes that the user community benefits from an objective series of tests which serve as a common reference. The development of the benchmark suites includes obtaining candidate benchmark codes, putting these codes into the SPEC harness, testing and improving the codes’ portability across as many operating systems, compilers, interconnects, runtime libraries as possible, and testing the codes for correctness and scalability.
S
The codes are put into the SPEC harness. The SPEC harness is a set of tools that allow users of the benchmark suite to easily run the suite, and obtain validated and publishable results. The users then only need to submit the results obtained to SPEC for review and publication on the SPEC web site. The goals of the run rules of the benchmark suites are to ensure that published results are meaningful, comparable to other results, and reproducible. A result must contain enough information to allow another user to reproduce the result. The performance tuning methods employed when attaining a result should be more than just “prototype” or “experimental” or “research” methods; there must be a certain level of maturity and general applicability in the performance methods employed, e.g., the used compiler optimization techniques should be beneficial for other applications as well and the compiler should be generally available and supported. Two set of metrics can be measured with the benchmark suite: “Peak” and “Base” metrics. “Peak” metrics may also be referred to as “aggressive compilation,” e.g., they may be produced by building each benchmark in the suite with a set of optimizations individually selected for that benchmark, and running them with environment settings individually selected for that benchmark. The optimizations selected must adhere to the set of general benchmark optimization rules. Base optimizations must adhere to a stricter set of rules than the peak optimizations. For example, the “Base” metrics must be produced by building all the benchmarks in the suite with a common set of optimizations and running them with environment settings common to all the benchmarks in the suite.
SPEC HPC The efforts of SPEC HPG began in when a group from industry and academia came together to try and provide a benchmark suite based upon the principles that had started with SPEC. Two of the more popular benchmarks suites at the time were the NAS Parallel Benchmarks and the PERFECT Club Benchmarks. The group built upon the direction these benchmarks provided to produce their first benchmark, SPEC HPC. The benchmark SPEC HPC came out originally with two components, SPECseis and SPECchem, with a third component SPECclimate available later.
S
S
SPEC Benchmarks
Each of these benchmarks provided a set of rules, code, and validation which allowed benchmarking across a wide variety of hardware, including parallel platforms. SPECseis is a benchmark application that was originally developed at Atlantic Richfield Corporation (ARCO). This benchmark was designed to test computation that was of interest to the oil and gas industry, in particular, time and depth migrations which are used to locate gas and oil deposits. SPECchem is a benchmark based upon the application GAMESS (General Atomic and Molecular Electronic Structure System). This computational chemistry code was used in the pharmaceutical and chemical industries for drug design and bonding analysis. SPECclimate is a benchmark based upon the application MM, the PSU/NCAR limited-area, hydrostatic or non-hydrostatic, sigma-coordinate model designed to simulate or predict mesoscale and regional-scale atmospheric circulation. MM was developed by the Pennsylvania State University (Penn State) and the University Corporation for Atmospheric Research (UCAR). SPEC HPC was retired in February , a little after the introduction of SPEC HPC and SPEC OMP. The results remain accessible on the SPEC web site, for reference purposes.
SPEC HPC The benchmark SPEC HPC was a follow-on to SPEC HPC. The update involved using newer versions of some of the software, as well as additional parallelism models. The use of MM was replaced with the application WRF. The benchmark was suitable for shared and distributed memory machines or clusters of shared memory nodes. SPEC HPC applications have been collected from among the largest, most realistic computational applications that are available for distribution by SPEC. In contrast to SPEC OMP, they were not restricted to any particular programming model or system architecture. Both shared-memory and message passing methods are supported. All codes of the current SPEC HPC suite were available in an MPI and an OpenMP programming model and they included two data set sizes.
The benchmark consisted of three scientific applications: SPECenv (WRF) is based on the WRF weather model, a state-of-the-art, non-hydrostatic mesoscale weather model, see http://www.wrf-model.org. The code consists of , lines of C and , lines of F. SPECseis was developed by ARCO beginning in to gain an accurate measure of performance of computing systems as it relates to the seismic processing industry for procurement of new computing resources. The code is written in F and C and has approximately , lines. SPECchem used to simulate molecules ab initio, at the quantum level, and optimize atomic positions. It is a research interest under the name of GAMESS at the Gordon Research Group of Iowa State University and is of interest to the pharmaceutical industry. It consists of , lines of F and C. The SPEC HPC suite was retired in June . The results remain accessible on the SPEC web site, for reference purposes.
SPEC OMP SPEC’s benchmark suite that measures performance using applications based on the OpenMP standard for shared-memory parallel processing. Two levels of workload (OMPM and OMPL) characterize the performance of medium and large sized systems. Benchmarks running under SPEC OMPM use up to . GB of memory, whereas the applications of SPEC OMPL require about . GB in a -thread run. The SPEC OMPM benchmark suite consists of large application programs, which represent the type of software used in scientific technical computing. The applications include modeling and simulation programs from the fields of chemistry, mechanical engineering, climate modeling, and physics. Of the application programs, are written in Fortran and (AMMP, ART, and EQUAKE) are written in C. The benchmarks require a virtual address space of about . GB in a -processor execution. The rationales for this size were to provide data sets fitting in a -bit address space. SPEC OMPL consists of application programs, of which are written in Fortran and (ART and
SPEC Benchmarks
EQUAKE) are written in C. The benchmarks require a virtual address space of about . GB in a -processor run. The rationale for this size were to provide data sets significantly larger than those of the SPEC OMPM benchmarks, with a requirement for a -bit address space. The following is a short description of the application programs of OMP:
APPLU Solves five coupled non-linear PDEs on a -dimensional logically structured grid, using the Symmetric Successive Over-Relaxation implicit time-marching scheme. APSI Lake environmental model, which predicts the concentration of pollutants. It solves the model for the mesoscale and synoptic variations of potential temperature, wind components, and for the mesoscale vertical velocity, pressure, and distribution of pollutants. MGRID Simple multigrid solver, which computes a -dimensional potential field. SWIM Weather prediction model, which solves the shallow water equations using a finite difference method. FMAD Crash simulation program. It simulates the inelastic, transient dynamic response of dimensional solids and structures subjected to impulsively or suddenly applied loads. It uses an explicit finite element method. ART (Adaptive Resonance Theory) neural network, which is used to recognize objects in a thermal image. The objects in the benchmark are a helicopter and an airplane. GAFORT Computes the global maximum fitness using a genetic algorithm. It starts with an initial population and then generates children who go through crossover, jump mutation, and creep mutation with certain probabilities. EQUAKE Is an earthquake-modeling program. It simulates the propagation of elastic seismic waves in large, heterogeneous valleys in order to recover the time history of the ground motion everywhere in the valley due to a specific seismic event. It uses a finite element method on an unstructured mesh. WUPWISE (Wuppertal Wilson Fermion Solver) is a program in the field of lattice gauge theory. Lattice
S
gauge theory is a discretization of quantum chromodynamics. Quark propagators are computed within a chromodynamic background field. The inhomogeneous lattice-Dirac equation is solved. GALGEL This problem is a particular case of the GAMM (Gesellschaft fuer Angewandte Mathematik und Mechanik) benchmark devoted to numerical analysis of oscillatory instability of convection in low-Prandtl-number fluids. This program is only part of OMPM. AMMP (Another Molecular Modeling Program) is a molecular mechanics, dynamics, and modeling program. The benchmark performs a molecular dynamics simulation of a protein-inhibitor complex, which is embedded in water. This program is only part of OMPM.
SPEC MPI SPEC MPI is SPEC’s benchmark suite for evaluating MPI-parallel, floating point, compute-intensive performance across a wide range of cluster and SMP hardware. MPI continues the SPEC tradition of giving users the most objective and representative benchmark suite for measuring and comparing highperformance computer systems. SPEC MPI focuses on performance of compute intensive applications using the Message-Passing Interface (MPI), which means these benchmarks emphasize the performance of the type of computer processor (CPU), the number of computer processors, the MPI Library, the communication interconnect,the memory architecture,the compilers, and the shared file system. It is important to remember the contribution of all these components. SPEC MPI performance intentionally depends on more than just the processor. MPI is not intended to stress other computer components such as the operating system, graphics, or the I/O system. Table contains the list of codes, together with information on the benchmark set size, programming language, and application area. .milc,.dmilc stands for MIMD Lattice Computation and is a quantum chromodynamics (QCD) code for lattice gauge theory with dynamical quarks. Lattice gauge theory involves the study of some of the fundamental constituents of matter, namely
S
S
SPEC Benchmarks
SPEC Benchmarks. Table List of applications in SPEC MPI Benchmark
Suite
Language
Application domain
.milc
medium
C
Physics: Quantum Chromodynamics (QCD)
.leslied
medium
Fortran
Computational Fluid Dynamics (CFD)
.GemsFDTD
medium
Fortran
Computational Electromagnetics (CEM)
.fds
medium
C/Fortran
Computational Fluid Dynamics (CFD)
.pop
medium, large
C/Fortran
Ocean Modeling
.tachyon
medium, large
C
Graphics: Parallel Ray Tracing
.RAxML
large
C
DNA Matching
.lammps
medium, large
C++
Molecular Dynamics Simulation
.wrf
medium
C/Fortran
Weather Prediction
.GAPgeofem
medium, large
C/Fortran
Heat Transfer using Finite Element Methods (FEM)
.tera_tf
medium, large
Fortran
D Eulerian Hydrodynamics
.socorro
medium
C/Fortran
Molecular Dynamics using Density-Functional Theory (DFT)
.zeusmp
medium, large
C/Fortran
Physics: Computational Fluid Dynamics (CFD)
.lu
medium, large
Fortran
Computational Fluid Dynamics (CFD)
.dmilc
large
C
Physics: Quantum Chromodynamics (QCD)
.dleslie
large
Fortran
Computational Fluid Dynamics (CFD)
.lGemsFDTD
large
Fortran
Computational Electromagnetics (CEM)
.lwrf
large
C/Fortran
Weather Prediction
quarks and gluons. In this area of quantum field theory traditional perturbative expansions are not useful and introducing a discrete lattice of spacetime points is the method of choice. .leslied,.dleslie The main purpose of this code is to model chemically reacting (i.e., burning) turbulent flows. Various different physical models are available in this algorithm. For MPI, the program has been set up a to solve a test problem which represents a subset of such flows, namely the temporal mixing layer. This type of flow occurs in the mixing regions of all combustors that employ fuel injection (which is nearly all combustors). Also, this sort of mixing layer is a benchmark problem used to understand physics of turbulent mixing. LESlied uses a strongly conservative, finite-volume algorithm with the MacCormack PredictorCorrector time integration scheme. The accuracy is fourth-order spatially and second-order temporally. .GemsFDTD,.lGemsFDTD GemsFDTD solves the Maxwell equations in D in the time domain using the finite-difference time-domain (FDTD)
method. The radar cross section (RCS) of a perfectly conducting (PEC) object is computed. GemsFDTD is a subset of the code GemsTD developed in the General ElectroMagnetic Solvers (GEMS) project. The core of the FDTD method are second-order accurate central-difference approximations of the Faraday’s and Ampere’s laws. These centraldifferences are employed on a staggered Cartesian grid resulting in an explicit finite-difference method. The FDTD method is also referred to as the Yee scheme. It is the standard time-domain method within computational electrodynamics (CEM). An incident plane wave is generated using socalled Huygens’ surfaces. This means that the computational domain is split into a total field part and a scattered field part, where the scattered field part surrounds the total field part. A timedomain near-to-far-field transformation computes the RCS according to Martin and Pettersson. Fast Fourier transforms (FFT) are employed in the postprocessing. .lGemsFDTD contains extensive performance improvements as compared with .GemsFDTD.
SPEC Benchmarks
.fds is a computational fluid dynamics (CFD) model of fire-driven fluid flow. The software solves numerically a form of the Navier-Stokes equations appropriate for low-speed, thermally driven flow with an emphasis on smoke and heat transport from fires. It uses the block test case as dataset. This dataset is similar to ones used to simulate fires in the World Trade Center. The author’s agency did the investigation of the collapse. .pop The Parallel Ocean Program (POP) is a descendant of the Bryan-Cox-Semtner class of ocean models first developed by Kirk Bryan and Michael Cox at the NOAA Geophysical Fluid Dynamics Laboratory in Princeton, NJ, in the late s. POP had its origins in a version of the model developed by Semtner and Chervin. POP is the ocean component of the Community Climate System Model. Time integration of the model is split into two parts. The three-dimensional vertically varying (baroclinic) tendencies are integrated explicitly using a leapfrog scheme. The very fast vertically uniform (barotropic) modes are integrated using an implicit free surface formulation in which a preconditioned conjugate gradient solver is used to solve for the two-dimensional surface pressure. .tachyon is a ray tracing program. It implements all of the basic geometric primitives such as triangles, planes, spheres, cylinders, etc. Tachyon is nearly embarrassingly parallel. As a result, MPI usage tends to be much lower as compared to other types of MPI applications. The scene to be rendered is partitioned into a fixed number of pieces, which are distributed out by the master process to each processor participating in the computation. Each processor then renders its piece of the scene in parallel, independent of the other processors. Once a processor completes the rendering of its particular piece of the scene, it waits until the other processors have rendered their pieces of the scene, and then transmits its piece back to the master process. The process is repeated until all pieces of the scene have been rendered. .lammps is a classical molecular dynamics simulation code designed to run efficiently on parallel computers. It was developed at Sandia National Laboratories, a US Department of Energy facility, with
S
funding from the DOE. LAMMPS divides D space into D sub-volumes, e.g., a P = A×B×C grid of processors, where P is the total number of processors. It tries to make the sub-volumes as cubic as possible, since the volume of data exchanged is proportional to the surface of the sub-volume. .wrf,.lwrf is a weather forecasting code based on the Weather Research and Forecasting (WRF) Model, which is a next-generation mesocale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. The code is written in Fortran . WRF features multiple dynamical cores, a dimensional variational (DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility. Multi-level parallelism support includes distributed memory (MPI), shared memory (OpenMP), and hybrid shared/distributed modes of execution. In the SPEC MPI version of WRF, all OpenMP directives have been switched off. WRF version .. is used in the .wrf benchmark version. WRF version .. is used in the benchmark version, .lwrf. .GAPgeofem is a GeoFEM-based parallel finiteelement code for transient thermal conduction with gap radiation and very heterogeneous material property. GeoFEM is an acronym for Geophysical Finite Element Methods, and it is a software used on the Japanese Earth Simulator system for the modeling of solid earth phenomena such as mantlecore convection, plate tectonics, and seismic wave propagation and their coupled phenomena. A backward Euler implicit time-marching scheme has been adopted. Linear equations are solved by parallel CG (conjugate gradient) iterative solvers with point Jacobi preconditioning. .tera_tf is a three dimensional Eulerian hydrodynamics application using a nd order Godunov-type scheme and a rd order remapping. It uses mostly point-to-point messages, and some reductions use non-blocking messages. The global domain is a cube, with N cells in each direction, which amounts to a total number of N cells. To set up the problem, one needs to define the number of cells in each direction, and the number of blocks in each direction. Each block corresponds to an MPI task.
S
S
SPEC Benchmarks
.socorro is a modular, object oriented code for performing self-consistent electronic-structure calculations utilizing the Kohn-Sham formulation of density-functional theory. Calculations are performed using a plane wave basis and either normconserving pseudopotentials or projector augmented wave functions. Several exchange-correlation functionals are available for use including the local-density approximation (Perdew-Zunger or Perdew-Wang parameterizations of the CeperleyAlder QMC correlation results) and the generalizedgradient approximation (PW, PBE, and BLYP). Both Fourier-space and real-space projectors have been implemented, and a variety of methods are available for relaxing atom positions, optimizing cell parameters, and performing molecular dynamics calculations. .zeusmp is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, SDSC, University of Illinois at Urbana-Champaign, UC San Diego) for the simulation of astrophysical phenomena. The program solves the equations of ideal (non-resistive), nonrelativistic, hydrodynamics and magnetohydrodynamics, including externally applied gravitational fields and self-gravity. The gas can be adiabatic or isothermal, and the thermal pressure is isotropic. Boundary conditions may be specified as reflecting, periodic, inflow, or outflow. .zeusmp is based on ZEUS-MP Version , which is a Fortran rewrite of ZEUS-MP under development at UCSD (by John Hayes). It includes new physics models such as flux-limited radiation diffusion (FLD), and multispecies fluid advection is added to the physics set. ZEUS-MP divides the computational space into D tiles that are distributed across processors. The physical problem solved in SPEC MPI is a D blastwave simulated with the presence of a uniform magnetic field along the x-direction. A Cartesian grid is used and the boundaries are “outflow.” .lu has a rich ancestry in benchmarking. Its immediate predecessor is the LU benchmark in NPB.MPI, part of the NAS Parallel Benchmark suite. It is sometimes referred to as APPLU (a version of which was .applu in CPU) or NASLU. The NAS-LU code is a simplified compressible
Navier-Stokes equation solver. It does not perform an LU factorization, but instead implements a symmetric successive over-relaxation (SSOR) numerical scheme to solve a regular-sparse, block lower and upper triangular system. The code computes the solution of five coupled nonlinear PDE’s, on a -dimensional logically structured grid, using an implicit pseudo-time marching scheme, based on two-factor approximate factorization of the sparse Jacobian matrix. This scheme is functionally equivalent to a nonlinear block SSOR iterative scheme with lexicographic ordering. Spatial discretization of the differential operators are based on a second-order accurate finite volume scheme. It insists on the strict lexicographic ordering during the solution of the regular sparse lower and upper triangular matrices. As a result, the degree of exploitable parallelism during this phase is limited to O(N ) as opposed to O(N ) in other phases and its spatial distribution is non-homogenous. This fact also creates challenges during the loop re-ordering to enhance the cache locality.
Related Entries Benchmarks HPC Challenge Benchmark NAS Parallel Benchmarks Perfect Benchmarks
Bibliographic Notes and Further Reading There are numerous efforts to create benchmarks for different purposes. One of the early application benchmarks are the so-called “Perfect Club Benchmarks”[], an effort that among others initiated the SPEC High Performance Group (HPG) activities[]. The goal of SPEC HPG is to create application benchmarks to measure the performance of High Performance Computers. There are also other efforts that share this goal. Often such a collection is assembled during a procurement process. However, it normally is not used outside this specific process, nor does such a collection claim to be representative for a wider community. Two of the collections that are in wider use are the NAS Parallel Benchmarks [] and the HPC Challenge benchmark []. The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel
SPEC MPI
supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. HPCC consists of synthetic kernels measuring different aspects of a parallel system, like CPU, memory subsystem, and interconnect. Two other benchmarks that consist of applications are the ASCI-benchmarks used in the procurement process of various ASCI-machines. ASCI-Purple [] consists of nine benchmarks and three stress tests. Its last update was in . The ASCI-Sequoia benchmark [] is the successor, it consists of applications and synthetic benchmarks, but only seven codes are MPI parallel. A similar collection is the DEISA benchmark suite []. It contains codes, but for licensing reasons, three of the benchmarks must be obtained directly from the code authors and placed in the appropriate location within the benchmarking framework. There are also a number of publications with more details about the SPEC HPG benchmarks. Details of SPEC HPC are available in [].Characteristics of the SPEC benchmark suite OMP are described by Saito et al. [] and Müller et. al []. Aslot et al. [] have presented the benchmark suite. Aslot et al.[] and Iwashita et al.[] have described performance characteristics of the benchmark suite. The SPEC High Performance Group published a more detailed description of MPI in []. Some performance characteristics are depicted in [].
Bibliography . ASCI-Purple home page () https://asc.llnl.gov/computing_ resources/purple/archive/benchmarks . ASCI-Sequoia home page. https://asc.llnl.gov/sequoia/ benchmarks. . Aslot V, Domeika M, Eigenmann R, Gaertner G, Jones WB, Parady B () SPEComp: a new benchmark suite for measuring parallel computer performance. In: Eigenmann R, Voss MJ (eds) WOMPAT’: workshop on openmp applications and tools. LNCS, vol . Springer, Heidelberg, pp – . Aslot V, Eigenmann R () Performance characteristics of the SPEC OMP benchmarks. In: rd European workshop on OpenMP, EWOMP’, Barcelona, Spain, September . Bailey D, Harris T, Saphir W, van der Wijngaart R, Woo A, Yarrow M () The NAS parallel benchmarks .. Technical report NAS--, NASA Ames Research Center, Moffett Field, CA. http://www.nas.nasa.gov/Software/NPB . Berry M, Chen D, Koss P, Kuck D, Lo S, Pang Y, Pointer, Rolo R, Sameh A, Clementi E, Chin S, Schneider D, Fox G, Messina P,
. . .
.
.
.
.
.
.
S
Walker D, Hsiung C, Schwarzmeier J, Lue K, Orszag S, Seidl F, Johnson O, Goodrum R, Martin J () The perfect club benchmarks: effective performance evaluation of supercomputers. Int J High Perform Comp Appl ():– DEISA benchmark suite home page. http://www.deisa.eu/ science/benchmarking Dongarra J, Luszczek P () Introduction to the hpcchallenge benchmark suite. ICL technical report ICL-UT--, ICL Eigenmann R, Hassanzadeh S () Benchmarking with real industrial applications: the SPEC high-performance group. IEEE Comp Sci & Eng ():– Eigenmann R, Gaertner G, Jones W () SPEC HPC: the next high-performance computer benchmark. In: Lecture notes in computer science, vol . Springer, Heidelberg, pp – Iwashita H, Yamanaka E, Sueyasu N, van Waveren M, Miura K () The SPEC OMP benchmark on the Fujitsu PRIMEPOWER system. In: rd European workshop on OpenMP, EWOMP’, Barcelona, Spain, September Müller MS, Kalyanasundaram K, Gaertner G, Jones W, Eigenmann R, Lieberman R, van Waveren M, Whitney B () SPEC HPG benchmarks for high performance systems. Int J High Perform Comp Netw ():– Müuller MS, van Waveren M, Lieberman R, Whitney B, Saito H, Kumaran K, Baron J, Brantley WC, Parrott C, Elken T, Feng H, Ponder C () SPEC MPI – an application benchmark suite for parallel systems using MPI. Concurrency Computat: Pract Exper ():– Saito H, Gaertner G, Jones W, Eigenmann R, Iwashita H, Lieberman R, van Waveren M, Whitney B () Large system performance of SPEC OMP benchmarks. In: Zima HP, Joe K, Sata M, Seo Y, Shimasaki M (eds) High performance computing, th international symposium, ISHPC . Lecture notes in computer science, vol . Springer, Heidelberg, pp – Szebenyi Z, Wylie BJN, Wolf F () SCALASCA parallel performance analyses of SPEC MPI applications. In: Proceedings of the st SPEC international performance evaluation workshop (SIPEW). LNCS, vol . Springer, Heidelberg, pp –
SPEC HPC SPEC Benchmarks
SPEC HPC SPEC Benchmarks
SPEC MPI SPEC Benchmarks
S
S
SPEC OMP
SPEC OMP SPEC Benchmarks
Special-Purpose Machines Anton, a Special-Purpose Molecular Simulation
Machine GRAPE JANUS FPGA-Based Machine QCD apeNEXT Machines QCDSP and QCDOC Computers
Speculation Speculative Parallelization of Loops Speculation, Thread-Level Transactional Memory
Speculation, Thread-Level Josep Torrellas University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Speculative multithreading (SM); Speculative parallelization; Speculative run-time parallelization; Speculative threading; Speculative thread-level parallelization; Thread-level data speculation (TLDS); Thread level speculation (TLS) parallelization; TLS
Definition Thread-Level Speculation (TLS) refers to an environment where execution threads operate speculatively, performing potentially unsafe operations, and temporarily buffering the state they generate in a buffer or cache. At a certain point, the operations of a thread are declared to be correct or incorrect. If they are correct, the thread commits, merging the state it generated with the correct state of the program; if they are incorrect,
the thread is squashed and typically restarted from its beginning. The term TLS is most often associated to a scenario where the purpose is to execute a sequential application in parallel. In this case, the compiler or the hardware breaks down the application into speculative threads that execute in parallel. However, strictly speaking, TLS can be applied to any environment where threads are executed speculatively and can be squashed and restarted.
Discussion Basic Concepts in Thread-Level Speculation In its most common use, Thread-Level Speculation (TLS) consists of extracting units of work (i.e., tasks) from a sequential application and executing them on different threads in parallel, hoping not to violate sequential semantics. The control flow in the sequential code imposes a relative ordering between the tasks, which is expressed in terms of predecessor and successor tasks. The sequential code also induces a data dependence relation on the memory accesses issued by the different tasks that parallel execution cannot violate. A task is Speculative when it may perform or may have performed operations that violate data or control dependences with its predecessor tasks. Otherwise, the task is nonspeculative. The memory accesses issued by speculative tasks are called speculative memory accesses. When a nonspeculative task finishes execution, it is ready to Commit. The role of commit is to inform the rest of the system that the data generated by the task is now part of the safe, nonspeculative program state. Among other operations, committing always involves passing the Commit Token to the immediate successor task. This is because maintaining correct sequential semantics in the parallel execution requires that tasks commit in order from predecessor to successor. If a task reaches its end and is still speculative, it cannot commit until it acquires nonspeculative status and all its predecessors have committed. Figure shows an example of several tasks running on four processors. In this example, when task T executing on processor finishes the execution, it cannot commit until its predecessor tasks T, T, and T also finish and commit. In the meantime, depending on
Speculation, Thread-Level
Proc#
1
2
3
S
4
Time T0
T1
T2
T3
T4
Stall
T5
T6 T7
Nonspeculative task timeline
Commit token transfer
Speculation, Thread-Level. Fig. A set of tasks executing on four processors. The figure shows the nonspeculative task timeline and the transfer of the commit token
the hardware support, processor may have to stall or may be able to start executing speculative task T. The example also shows how the nonspeculative task status changes as tasks finish and commit, and the passing of the commit token. Memory accesses issued by a speculative task must be handled carefully. Stores generate Speculative Versions of data that cannot simply be merged with the nonspeculative state of the program. The reason is that they may be incorrect. Consequently, these versions are stored in a Speculative Buffer local to the processor running the task – e.g., the first-level cache. Only when the task becomes nonspeculative are its versions safe. Loads issued by a speculative task try to find the requested datum in the local speculative buffer. If they miss, they fetch the correct version from the memory subsystem, i.e., the closest predecessor version from the speculative buffers of other tasks. If no such version exists, they fetch the datum from memory. As tasks execute in parallel, the system must identify any violations of cross-task data dependences. Typically, this is done with special hardware or software support that tracks, for each individual task, the data that the task wrote and the data that the task read without first writing it. A data-dependence violation is flagged when a task modifies a datum that has been read earlier by a successor task. At this point, the consumer
task is squashed and all the data versions that it has produced are discarded. Then, the task is re-executed. Figure shows an example of a data-dependence violation. In the example, each iteration of a loop is a task. Each iteration issues two accesses to an array, through an un-analyzable subscripted subscript. At run-time, iteration J writes A[] after its successor iteration J+ reads A[]. This is a Read After Write (RAW) dependence that gets violated due to the parallel execution. Consequently, iteration J+ is squashed and restarted. Ordinarily, all the successor tasks of iteration J+ are also squashed at this time because they may have consumed versions generated by the squashed task. While it is possible to selectively squash only tasks that used incorrect data, it would involve extra complexity. Finally, as iteration J+ reexecutes, it will re-read A[]. However, at this time, the value read will be the version generated by iteration J. Note that WAR and WAW dependence violations do not need to induce task squashes. The successor task has prematurely written the datum, but the datum remains buffered in its speculative buffer. A subsequent read from a predecessor task (in a WAR violation) will get a correct version, while a subsequent write from a predecessor task (in a WAW violation) will generate a version that will be merged with main memory before the one from the successor task.
S
S
Speculation, Thread-Level
for (i=0; i
Iteration J
Iteration J+1
Iteration J+2
... = A[4] + ... ... = A[2] + ... . tion la o . vi RAW . A[5] = ... A[2] = ...
... = A[5] + ... . . . A[6] = ...
Speculation, Thread-Level. Fig. Example of a data-dependence violation
However, many proposed TLS schemes, to reduce hardware complexity, induce squashes in a variety of situations. For instance, if the system has no support to keep different versions of the same datum in different speculative buffers in the machine, cross-task WAR and WAW dependence violations induce squashes. Moreover, if the system only tracks accesses on a per-line basis, it cannot disambiguate accesses to different words in the same memory line. In this case, false sharing of a cache line by two different processors can appear as a data-dependence violation and also trigger a squash. Finally, while TLS can be applied to various code structures, it is most often applied to loops. In this case, tasks are typically formed by a set of consecutive iterations. The rest of this article is organized as follows: First, the article briefly classifies TLS schemes. Then, it describes the two major problems that any TLS scheme has to solve, namely, buffering and managing speculative state, and detecting and handling dependence violations. Next, it describes the initial efforts in TLS, other uses of TLS, and machines that use TLS.
there have been several proposals for TLS compilers (e.g., [, , , ]). Very few schemes rely on the hardware to identify the tasks (e.g., []). Several schemes, especially in the early stages of TLS research, proposed software-only approaches to TLS (e.g., [, , , ]). In this case, the compiler typically generates code that causes each task to keep shadow locations and, after the parallel execution, checks if multiple tasks have updated a common location. If they have, the original state is restored. Most proposed TLS schemes target small sharedmemory machines of about two to eight processors (e.g., [, , , ]). It is in this range of parallelism that TLS is most cost effective. Some TLS proposals have focused on smaller machines and have extended a superscalar core with some hardware units that execute threads speculatively [, ]. Finally, some TLS proposals have targeted scalable multiprocessors [, , ]. This is a more challenging environment, given the longer communication latencies involved. It requires applications that have significant parallelism that cannot be analyzed statically by the compiler.
Classification of Thread-Level Speculation Schemes
Buffering and Managing Speculative State
There have been many proposals of TLS schemes. They can be broadly classified depending on the emphasis on hardware versus software, and the type of target machine. The majority of the proposed schemes use hardware support to detect cross-task dependence violations that result in task squashes (e.g., [, , , , , , , , , , , , , , , ]). Typically, this is attained by using the hardware cache coherence protocol, which sends coherence messages between the caches when multiple processors access the same memory line. Among all these hardware-based schemes, the majority rely on a compiler or a software layer to identify and prepare the tasks that should be executed in parallel. Consequently,
The state produced by speculative tasks is unsafe, since such tasks may be squashed. Therefore, any TLS scheme must be able to identify such state and, when necessary, separate it from the rest of the memory state. For this, TLS systems use structures, such as caches [, , , , ], and special buffers [, , , ], or undo logs [, , ]. This section outlines the challenges in buffering and managing speculative state. A more detailed analysis and a taxonomy is presented by Garzaran et al. [].
Multiple Versions of the Same Variable in the System Every time that a task writes for the first time to a variable, a new version of the variable appears in the
Speculation, Thread-Level
system. Thus, two speculative tasks running on different processors may create two different versions of the same variable [, ]. These versions need to be buffered separately, and special actions may need to be taken so that a reader task can find the correct version out of the several coexisting in the system. Such a version will be the version created by the producer task that is the closest predecessor of the reader task. A task has at most a single version of any given variable, even if it writes to the variable multiple times. The reason is that, on a dependence violation, the whole task is undone. Therefore, there is no need to keep intermediate values of the variable.
Multiple Speculative Tasks per Processor When a processor finishes executing a task, the task may still be speculative. If the TLS buffering support is such that the processor can only hold state from a single speculative task, the processor stalls until the task commits. However, to better tolerate task load imbalance, the local buffer may have been designed to buffer state from several speculative tasks, enabling the processor to execute another speculative task. In this case, the state of each task must be tagged with the ID of the task. Multiple Versions of the Same Variable in a Single Processor When a processor buffers state from multiple speculative tasks, it is possible that two such tasks create two versions of the same variable. This occurs in loadimbalanced applications that exhibit private data patterns (i.e., WAW dependences between tasks). In this case, the buffer will have to hold multiple versions of the same variable. Each version will be tagged with a different task ID. This support introduces complication to the buffer or cache. Indeed, on an external request, extra comparisons will need to be done if the cache has two versions of the same variable. Merging of Task State The state produced by speculative tasks is typically merged with main memory at task commit time; however, it can instead be merged as it is being generated. The first approach is called Architectural Main Memory (AMM) or Lazy Version Management; the second one is called Future Main Memory (FMM) or Eager Version Management. These schemes differ on whether the main
S
memory contains only safe data (AMM) or it can also contain speculative data (FMM). In AMM systems, all speculative versions remain in caches or buffers that are kept separate from the coherent memory state. Only when a task becomes nonspeculative can its buffered state be merged with main memory. In a straightforward implementation, when a task commits, all the buffered dirty cache lines are merged with main memory, either by writing back the lines to memory [] or by requesting ownership for them to obtain coherence with main memory []. In FMM systems, versions from speculative tasks are merged with the coherent memory when they are generated. However, to enable recovery from task squashes, when a task generates a speculative version of a variable, the previous version of the variable is saved in a log. Note that, in both approaches, the coherent memory state can temporarily reside in caches, which function in their traditional role of extensions of main memory.
Detecting and Handling Dependence Violations Basic Concepts The second aspect of TLS involves detecting and handling dependence violations. Most TLS proposals focus on data dependences, rather than control dependences. To detect (cross-task) data-dependence violations, most TLS schemes use the same approach. Specifically, when a speculative task writes a datum, the hardware sets a Speculative Write bit associated with the datum in the cache; when a speculative task reads a datum before it writes to it (an event called Exposed Read), the hardware sets an Exposed Read bit. Depending on the TLS scheme supported, these accesses also cause a tag associated with the datum to be set to the ID of the task. In addition, when a task writes a datum, the cache coherence protocol transaction that sends invalidations to other caches checks these bits. If a successor task has its Exposed Read bit set for the datum, the successor task has prematurely read the datum (i.e., this is a RAW dependence violation), and is squashed []. If the Speculative Write and Exposed Read bits are kept on a per-word basis, only dependences on the same word can cause squashes. However, keeping and maintaining such bits on a per-word basis in caches, network
S
S
Speculation, Thread-Level
messages, and perhaps directory modules is costly in hardware. Moreover, it does not come naturally to the coherence protocol of multiprocessors, which operate at the granularity of memory lines. Keeping these bits on a per-line basis is cheaper and compatible with mainstream cache coherence protocols. However, the hardware cannot then disambiguate accesses at word level. Furthermore, it cannot combine different versions of a line that have been updated in different words. Consequently, cross-task RAW and WAW violations, on both the same word and different words of a line (i.e., false sharing), cause squashes. Task squash is a very costly operation. The cost is threefold: overhead of the squash operation itself, loss of whatever correct work has already been performed by the offending task and its successors, and cache misses in the offending task and its successors needed to reload state when restarting. The latter overhead appears because, as part of the squash operation, the speculative state in the cache is invalidated. Figure a shows an example of a RAW violation across tasks i and i+j+. The consumer task and its successors are squashed.
Producer i
i+j
Consumer i+j+1 i+j+2
Time
Producer i
i+j
Consumer i+j+1 i+j+2
Techniques to Avoid Squashes Since squashes are so expensive, there are techniques to avoid them. If the compiler can conclude that a certain pair of accesses will frequently cause a data-dependence violation, it can statically insert a synchronization operation that forces the correct task ordering at runtime. Alternatively, the machine can have hardware support that records, at runtime, where dependence violations occur. Such hardware may record the program counter of the read or writes involved, or the address of the memory location being accessed. Based on this information, when these program counters are reached or the memory location is accessed, the hardware can try one of several techniques to avoid the violation. This section outlines some of the techniques that can be used. A more complete description of the choices is presented by Cintra and Torrellas []. Without loss of generality, a RAW violation is assumed. Based on past history, the predictor may predict that the pair of conflicting accesses are engaged in false sharing. In this case, it can simply allow the read to proceed and then the subsequent write to execute silently, without sending invalidations. Later, before the
Producer i
i+j
Consumer i+j+1 i+j+2
i+j
Consumer i+j+1 i+j+2
RD
RD
RD
Producer i
RD
AW
R
WR
WR Sqsh
WR
Release
Sqsh
WR Release
Commit Commit
Commit
b c d a Useful work
Squash overhead
Wasted correct work
Checking overhead
Possibly incorrect work
Stall overhead
Speculation, Thread-Level. Fig. RAW data-dependence violation that results in a squash (a) or that does not cause a squash due to false sharing or value prediction (b), or consumer stall (c and d)
Speculation, Thread-Level
consumer task is allowed to commit, it is necessary to check whether the sections of the line read by the consumer overlap with the sections of the line written by the producer. This can be easily done if the caches have per-word access bits. If there is no overlap, it was false sharing and the squash is avoided. Figure b shows the resulting time line. When there is a true data dependence between tasks, a squash can be avoided with effective use of value prediction. Specifically, the predictor can predict the value that the producer will produce, speculatively provide it to the consumer’s read, and let the consumer proceed. Again, before the consumer is allowed to commit, it is necessary to check that the value provided was correct. The timeline is also shown in Fig. b. In cases where the predictor is unable to predict the value, it can avoid the squash by stalling the consumer task at the time of the read. This case can use two possible approaches. An aggressive approach is to release the consumer task and let it read the current value as soon as the predicted producer task commits. The time line is shown in Fig. c. In this case, if an intervening task between the first producer and the consumer later writes the line, the consumer will be squashed. A more conservative approach is not to release the consumer task until it becomes nonspeculative. In this case, the presence of multiple predecessor writers will not squash the consumer. The time line is shown in Fig. d.
Initial Efforts in Thread-Level Speculation An early proposal for hardware support for a form of speculative parallelization was made by Knight [] in the context of functional languages. Later, the Multiscalar processor [] was the first proposal to use a form of TLS within a single-chip multithreaded architecture. A software-only form of TLS was proposed in the LRPD test []. Early proposals of hardware-based TLS include the work of several authors [, , , , ].
Other Uses of Thread-Level Speculation TLS concepts have been used in environments that have goals other than trying to parallelize sequential programs. For example, they have been used to speed up explicitly parallel programs through Speculative Synchronization [], or for parallel program
S
debugging [] or program monitoring []. Similar concepts to TLS have been used in systems supporting hardware transactional memory [] and continuous atomic-block operation [].
Machines that Use Thread-Level Speculation Several machines built by computer manufacturers have hardware support for some form of TLS – although the specific implementation details are typically not disclosed. Such machines include systems designed for Java applications such as Sun Microsystems’ MAJC chip [] and Azul Systems’ Vega processor []. The most high-profile system with hardware support for speculative threads is Sun Microsystems’ ROCK processor []. Other manufacturers are rumored to be developing prototypes with similar hardware.
Bibliography . Akkary H, Driscoll M () A dynamic multithreading processor. In: International symposium on microarchitecture, Dallas, November . Azul Systems. Vega Processor. http://www.azulsystems.com/ products/vega/processor . Chaudhry S, Cypher R, Ekman M, Karlsson M, Landin A, Yip S, Zeffer H, Tremblay M () Simultaneous speculative threading: a novel pipeline architecture implemented in Sun’s ROCK Processor. In: International symposium on computer architecture, Austin, June . Cintra M, Martínez JF, Torrellas J () Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In: International symposium on computer architecture, Vancouver, June , pp – . Cintra M, Torrellas J () Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors. In: Proceedings of the th High-Performance computer architecture conference, Boston, Feb . Figueiredo R, Fortes J () Hardware support for extracting coarse-grain speculative parallelism in distributed sharedmemory multiprocesors. In: Proceedings of the international conference on parallel processing, Valencia, Spain, September . Frank M, Lee W, Amarasinghe S () A software framework for supporting general purpose applications on raw computation fabrics. Technical report, MIT/LCS Technical Memo MIT-LCSTM-, July . Franklin M, Sohi G () ARB: a hardware mechanism for dynamic reordering of memory references. IEEE Trans Comput ():–
S
S
Speculation, Thread-Level
. Garcia C, Madriles C, Sanchez J, Marcuello P, Gonzalez A, Tullsen D () Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In: Conference on programming language design and implementation, Chicago, Illinois, June . Garzarán M, Prvulovic M, Llabería J, Viñals V, Rauchwerger L, Torrellas J () Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors. ACM Trans Archit Code Optim . Garzaran MJ, Prvulovic M, Llabería JM, Viñals V, Rauchwerger L, Torrellas J () Using software logging to support multiversion buffering in thread-level speculation. In: International conference on parallel architectures and compilation techniques, New Orleans, Sept . Gopal S, Vijaykumar T, Smith J, Sohi G () Speculative versioning cache. In: International symposium on high-performance computer architecture, Las Vegas, Feb . Gupta M, Nim R () Techniques for speculative run-time parallelization of loops. In: Proceedings of supercomputing , ACM Press, Melbourne, Australia, Nov . Hammond L, Willey M, Olukotun K () Data speculation support for a chip multiprocessor. In: International conference on architectural support for programming languages and operating systems, San Jose, California, Oct , pp – . Herlihy M, Moss E () Transactional memory: architectural support for lock-free data structures. In: International symposium on computer architecture, IEEE Computer Society Press, San Diego, May . Knight T () An architecture for mostly functional languages. In: ACM lisp and functional programming conference, ACM Press, New York, Aug , pp – . Krishnan V, Torrellas J () Hardware and software support for speculative execution of sequential binaries on a chipmultiprocessor. In: International conference on supercomputing, Melbourne, Australia, July . Krishnan V, Torrellas J () A chip-multiprocessor architecture with speculative multithreading. IEEE Trans Comput ():– . Liu W, Tuck J, Ceze L, Ahn W, Strauss K, Renau J, Torrellas J () POSH: A TLS compiler that exploits program structure. In: International symposium on principles and practice of parallel programming, San Diego, Mar . Marcuello P, Gonzalez A () Clustered speculative multithreaded processors. In: International conference on supercomputing, Rhodes, Island, June , pp – . Marcuello P, Gonzalez A, Tubella J () Speculative multithreaded processors. In: International conference on supercomputing, ACM, Melbourne, Australia, July . Martinez J, Torrellas J () Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In: International conference on architectural support for programming languages and operating systems, San Jose, Oct . Prvulovic M, Garzaran MJ, Rauchwerger L, Torrellas J () Removing architectural bottlenecks to the scalability
.
.
.
.
.
.
.
. . .
.
.
.
.
of speculative parallelization. In: Proceedings of the th international symposium on computer architecture (ISCA’), New York, June , pp – Prvulovic M, Torrellas J () ReEnact: using thread-level speculation to debug data races in multithreaded codes. In: International symposium on computer architecture, San Diego, June Rauchwerger L, Padua D () The LRPD test: speculative runtime parallelization of loops with privatization and reduction parallelization. In: Conference on programming language design and implementation, La Jolla, California, June Rundberg P, Stenstrom P () Low-cost thread-level data dependence speculation on multiprocessors. In: Fourth workshop on multithreaded execution, architecture and compilation, Monterrey, Dec Sohi G, Breach S, Vijaykumar T () Multiscalar processors. In: International Symposium on computer architecture, ACM Press, New York, June Steffan G, Colohan C, Zhai A, Mowry T () A scalable approach to thread-level speculation. In: Proceedings of the th Annual International symposium on computer architecture, Vancouver, June , pp – Steffan G, Mowry TC () The potential for using threadlevel data speculation to facilitate automatic parallelization. In: International symposium on high-performance computer architecture, Las Vegas, Feb Torrellas J, Ceze L, Tuck J, Cascaval C, Montesinos P, Ahn W, Prvulovic M () The bulk multicore architecture for improved programmability. Communications of the ACM, New York Tremblay M () MAJC: microprocessor architecture for java computing. Hot Chips, Palo Alto, Aug Tsai J, Huang J, Amlo C, Lilja D, Yew P () The superthreaded processor architecture. IEEE Trans Comput ():– Vijaykumar T, Sohi G () Task selection for a multiscalar processor. In: International symposium on microarchitecture, Dallas, Nov , pp – Zhai A, Colohan C, Steffan G, Mowry T () Compiler optimization of scalar value communication between speculative threads. In: International conference on architectural support for programming languages and operating systems, San Jose, Oct Zhang Y, Rauchwerger L, Torrellas J () Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors. In: Proceedings of the th International symposium on high-performance computer architecture (HPCA), Phoenix, Feb , pp – Zhang Y, Rauchwerger L, Torrellas J () Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In: Proceedings of the th international symposium on high-performance computer architecture, Orlando, Jan , pp – Zhou P, Qin F, Liu W, Zhou Y, Torrellas () iWatcher: efficient architectural support for software debugging. In: International symposium on computer architecture, IEEE Computer society, München, June
Speculative Parallelization of Loops
Speculative Multithreading (SM) Speculation, Thread-Level
Speculative Parallelization Speculation, Thread-Level Speculative Parallelization of Loops
Speculative Parallelization of Loops Lawrence Rauchwerger Texas A&M University, College Station, TX, USA
Synonyms Optimistic loop parallelization; Parallelization; Speculative Parallelization; Speculative Run-Time Parallelization; Thread-level data speculation (TLDS); TLS
Definition Speculative loop (thread) level parallelization is a compiler run-time technique that executes optimistically parallelized loops, verifies the correctness of their execution and, when necessary, backtracks to a safe state for possible re-execution. This technique includes a compiler (static) component for the transformation of the loop for speculative parallel execution as well as a run-time component which verifies correctness and re-executes when necessary.
Discussion Introduction The correctness of optimizing program transformations, such as loop parallelization, relies on compiler or programmer analysis of the code. Compiler-performed analysis is preferred because it is more productive and is not subject to human error. However, it usually involves a very complex symbolic analysis which often fails to produce an optimizing transformation. Sometimes the
S
outcome of the code analysis depends on the input values of the considered program block or on computed values, neither of which are available during compilation. Manual code analysis relies on the use of a higher level of semantic analysis which is usually more powerful but not applicable if it depends on run-time computed or input values. To overcome this limitation, the analysis can be (partially) performed at run-time. Runtime analysis can succeed where static analysis fails because it has access to the actual values with which the program symbolic expressions are instantiated and thus can make aggressive, instance-specific optimization (e.g., loop parallelization) decisions. There are essentially two ways of performing the run-time analysis that can (in)validate loop parallelization (and, in general any other optimization): ● ●
Before executing the parallel version of a loop During execution of the parallel version of the loop
The analysis that is performed before loop execution can be of various degrees of complexity: from constant time to time proportional to the original loop computation. The parallel execution of a loop before run-time analysis is completed is called speculative (aka optimistic) execution. It represents a speculation because the outcome of the analysis may, in the end, invalidate the (optimistic) parallel execution of the loop (and its results). In this case the state before the speculation has to be restored and the loop is re-executed in a safe manner, e.g., sequentially. Figure shows the global flowchart of the speculative parallelization process.
Fundamentals of Loop Parallelization A loop can be executed in parallel, without synchronizations, if and only if the desired outcome of the loop does not depend upon the relative execution order of the data accesses across its iterations. This problem has been modeled with the help of data dependence analysis [, , , , ]. There are three possible types of dependences between two statements that access the same memory location: flow (read after write – RAW), anti (write after read – WAR), and output (write after write – WAW). Flow dependences express a fundamental relationship about the data flow in the program. Anti and output dependences, also known as memory-related
S
S
Speculative Parallelization of Loops
Source Code Compiler Static analysis Run−time transformations Compile time Run−time
Checkpoint Speculative parallel execution Test
Pass
Fail
Restore Sequential execution
Speculative Parallelization of Loops. Fig. Speculative run-time parallelization
dependences, are caused by the reuse of storage for program variables. When flow data dependences exist between loop iterations, the loop cannot be executed in parallel because the original (sequential) semantics of the loop cannot be preserved. For example, the iterations of the loop in Fig. a must be executed in sequential order because iteration i+ needs the value that is produced in iteration i, for ≤ i < n. The simplest and most desired outcome of the data dependence analysis is when there are no anti, output, or flow dependences. In this case, all the iterations of the loop are independent and can be executed in any order, e.g., concurrently. Such loops are known as DOALL loops. In the absence of flow dependences, which are fundamental to a program’s algorithms, the anti and/or output dependences can be removed through privatization (or renaming) [], a very effective loop transformation. Privatization creates, for each processor cooperating on the execution of the loop, private copies of the program variables that can give rise to anti-or output dependences (see, e.g., [, , , , ]). Figure b exemplifies a loop that can be executed in parallel after applying the privatization transformation: The antidependences between statement S2 of iteration i and statement S1 of iteration i + , for ≤ i < n/, are removed by privatizing the temporary variable tmp.
This transformation can be applied to a loop variable if it can be proven, statically or dynamically, that every read access to it (e.g., elements of array A) is preceded by a write access to the same variable within the same iteration. Of course, variables that are never written (read only) cannot generate any data dependence. Intuitively, variables are privatizable when they are used as workspace (e.g., temporary variables) within an iteration. A semantically higher level transformation is the parallelization of reduction operations. Reductions are operations of the form x = x ⊗ exp, where ⊗ is an associative operator and x does not occur in exp or anywhere else in the loop. A simple, but typical example of a reduction is statement S1 in Fig. c. The operator ⊗ is exemplified by the + operator, the access to the array A(:) is a read, modify, write sequence, and the function performed by the loop is a prefix sum of the values stored in A. This type of reduction is sometimes called an update and occurs quite frequently in programs. A reduction can be readily transformed into a parallel operation using, e.g., a recursive doubling algorithm [, ]. When the operator takes the form x = x + exp, the values taken by variable x can be accumulated in private storage followed by a global reduction operation. There are also other, less scalable methods that use unordered critical sections [, ]. When the operator is also commutative, its substitution with a parallel algorithm can be done with fewer restrictions (e.g., dynamic scheduling of DOALL loops can be employed). Thus, the difficulty encountered by compilers in parallelizing loops with reductions arises not from transforming the loop for parallel execution but from correctly identifying and validating reduction patterns in loops. This problem has been handled at compile time mainly by syntactically pattern matching the loop statements with a template of a generic reduction, and then performing a data dependence analysis of the variable under scrutiny to guarantee that it is not used anywhere else in the loop except in the reduction statement and thus does not cause additional dependences [].
Compiler Limitation and Run-time Parallelization In essence, the parallelization of DO (for) loops depends on proving that their memory reference
Speculative Parallelization of Loops
do i=1, n A(K(i)) = A(K(i)) + A(K(i−1)) if (A(K(i)) .eq. 0) then B(i) = A(L(i) endif enddo
a
do i = 1, n/2 tmp = A(2∗i) A(2∗i) = A(2∗i−1) S2 A(2∗i−1) = tmp enddo
do i=1, n do j = 1, m S1: A(j) = A(j) + exp() enddo enddo
b
c
S1:
S
Speculative Parallelization of Loops. Fig. Examples of representative loops targeted by automatic parallelization
patterns do not carry data dependences or that there are legal transformations that can remove possible data dependences. The burden of proof relies on the static analysis performed by autoparallelizing compilers. However, there are many situations when this is not possible because either symbolic analysis is not powerful enough or the memory reference pattern is input or data dependent. The technique that can overcome such problems is run-time parallelization because, at run-time, all input values are known and the symbolic evaluation of complex expressions is vastly simplified. In this scenario, the compiler generates, conceptually at least, a parallel and a serial version of the loop as well code for the dynamic dependence analysis using the loop’s memory references. If the decision about the serial or parallel character of the loop is (and can be) made before its execution, then the results computed and written by the loop to memory can be considered to be always correct. Such a technique is called “inspector/executor” [–] because the memory references are first “inspected” and then executed in a safe manner, whether sequentially or concurrently. In this context, an “inspector” is obtained by extracting a loop slice that generates (and perhaps records) the relevant memory addresses which can then reveal possible loop-carried data dependences. It is important for such “inspectors” to be scalable and not to become serial bottlenecks. If no dependences are found, then the loop can be scheduled (and executed) as a DOALL, i.e., all its iterations can be executed concurrently. If dependences are found, then the “executor” of the loop has to enforce them using ordered synchronizations at the memory or iteration level such that sequential semantics is preserved. A frequently used solution [, , ] to this problem has been to first construct the loop dependence graph from the memory trace obtained by the inspector and then to use it to compute a parallel execution schedule.
Finally, the loop is executed in parallel according to the schedule. The computation of this execution schedule can also be interleaved with the executor []. If the reference pattern does not change within a larger scope than the considered loop, then the schedule can be reused, thus reducing the impact of its overhead. There have been several variations of this technique which were reported in []. If a loop is executed in parallel before its data dependences are uncovered, then it can cause out of order memory references which may lead to incorrect computation. Such an execution model is called speculative execution, also known as optimistic execution, because its performance is based on the the optimistic assumption that such dependences do in fact not materialize or are quite infrequent. To ensure that even when dependences may occur, the final computation produces sequentially equivalent results, the speculative execution model includes a restart from a safe (correct) state mechanism. Such a mechanism implies either saving state (checkpointing) before its speculative modification or writing into a temporary memory which has to be later merged (committed) into the global state of the program. For example, the references to array A in the loop in Fig. a depend on some input values stored in array K and cannot be statically analyzed by the compiler. An inspector for this loop would analyze the contents of array K and decide whether the loop is parallel and then execute it accordingly. A speculative approach is to execute the loop in parallel while at the same time recording all the references to A. After the loop has ended, the memory trace is checked for data dependences and, if any are found, the results are discarded and the loop is re-executed sequentially or in some other safer mode. Alternatively, the memory references can be checked as they occur (“on-the-fly”) and, if dependences are
S
S
Speculative Parallelization of Loops
detected the execution can be aborted at that point, the program state repaired and the loop restarted in a safe manner. There are advantages and disadvantages to using either of the run-time parallelization methods. In the “inspector/executor” approach, the inspector does not modify the program state and thus does not require a restart mechanism with its associated memory overhead. On the other hand, a speculative approach may need to discard its results and restart from a safe state. This implies the allocation of additional memory for checkpointing or buffering state. Inspectors always add to the critical path of the program because they have to be inserted serially before the parallelized loop. Speculative parallelization performs the needed inspection of the memory references during the parallel execution almost independently from the actual computation and thus can be almost overlapped with it (assuming available resources). On the other hand, the checkpoint or commit phase introduces some overhead which may add to the critical path. What is perhaps the most important feature of speculative parallelization is its general applicability, even when the memory reference pattern is dependent on the computation of the loop. For example, the code snippet in Fig. shows that the reference to array NUSED is dependent on its value which in turn, may have been modified in a different iteration (because the indirection array IHITS is not known at compile time). There are two ways to look at speculation: optimistically assuming that a loop is fully parallel and can be executed as a DOALL or, pessimistically, assuming that the loop has dependences and must be executed as a DOACROSS, i.e., with synchronizations. When a DOALL is expected, then the overhead of the data checking can be done once, after the speculative loop
read (IHITS(:)) do k= 1, LST j = IHITS(1,k) if (NUSED(j).LE.1) then NUSED(j)=NUSED(j) −1 endif enddo
Speculative Parallelization of Loops. Fig. A loop example where only speculative parallelization is possible: array indexes are computed during loop execution
has finished. If, however, dependences are expected, then they need to be detected early so that the resulting incorrect computation can be minimized. Early detection means frequent memory reference checks, well before the speculative loop has finished. These speculative approaches are bridged (transformed into one another) through the variation of the frequency (granularity) of the memory reference checks from once per loop to once per reference. A related, but different limitation of compilers is their inability to analyze and parallelize most while loops. The reason is twofold: ●
The classical loop dependence analysis looks for dependences in a bounded iteration space. However, the upper bound of a while loop can only be conservatively established at compile time which results in overly restrictive decisions, possibly inhibiting parallelization. ● The fully parallel execution of a while loop without data dependences (e.g., do loops with possible premature exits) may not be limited to the original iteration space. Iterations may be executed beyond their sequential upper bound, i.e., “overshoot” and thus incorrectly modify the global state. A possible solution is to use a speculative parallel execution framework which allows discarding any unnecessary work and its effects. Speculation can also be used to estimate an upper bound of the iteration space of while loops.
DOALL Speculative Parallelization: The LRPD Test The optimistic version of speculative parallelization executes a loop in parallel and tests subsequently if any data dependences could have occurred. If this validation test fails, then the loop is re-executed in a safe manner, starting from a safe state, e.g., sequentially from a previous checkpoint. This approach, known as the LRPD Test (Lazy Reduction and Privatization Doall Test), [, ] is sketched in the next paragraph. To qualify more loops as parallel, array privatization and reduction parallelization can be speculatively applied and their validity tested after loop termination. Consider a do loop for which the compiler cannot statically determine the access pattern of a shared array A (Fig. a). The compiler allocates the shadow arrays for
S
Speculative Parallelization of Loops
do i=1, 5 z = A(K(i)) if (B(i) .eq. .true.) then A(L(i)) = z +C(i) endif enddo B(1:5) = (1 0 1 0 1) K(1:5) = (1 2 3 4 1) L(1:5) = (2 2 4 4 2)
a
doall i=1, 5 markread(K(i)) z = A(K(i)) if (B(i) .eq. .true.) then markwrite(L(i)) A(L(i)) = z +C(i) endif enddoall
b
Shadow arrays 2 3 4
PD test
1
Aw Ar
0 1
1 1
0 1
1 1
Anp
1
1
1
1
Aw (:) ∧ Ar (:) Aw (:) ∧ Anp (:)
0
1
0
1
0
1
0
1
tw
tm
3
2
c
Speculative Parallelization of Loops. Fig. Do loop (a) transformed for speculative execution, (b) the markwrite and markread operations update the appropriate shadow arrays, (c) shadow arrays after loop execution. In this example, the test fails
marking the write accesses, Aw , and the read accesses, Ar , and an array Anp , for flagging non-privatizable elements. The loop is augmented with code (Fig. b) that during the speculative execution will mark the shadow arrays every time A is referenced (based on specific rules). The result of the marking can be seen in Fig. c. The first time an element of A is written during an iteration, the corresponding element in the write shadow array Aw is marked. If, during any iteration, an element in A is read, but never written, then the corresponding element in the read shadow array Ar is marked. Another shadow array Anp is used to flag the elements of A that cannot be privatized: An element in Anp is marked if, in any iteration, the corresponding element in A has been written only after it has been read. A post-execution analysis, illustrated in Fig. c, determines whether there were any cross-iteration dependencies between statements referencing A as follows. If any(Aw (:)∧Ar (:)) is true, (any returns the “OR” of its vector operand’s elements, i.e., any(v( : n)) = (v() ∨ v() ∨ . . . ∨ v(n))) then there is at least one flow- or anti-dependence that was not removed by privatizing A (some element is read and written in different iterations). If any(Anp (:)) is true, then A is not privatizable (some element is read before being written in an iteration). If tw, the total number of writes marked during the parallel execution, is not equal to tm, the total number of marks computed after the parallel execution, then there is at least one output dependence (some element is overwritten); however, if A is privatizable (i.e., if any(Anp (:)) is false), then these dependencies were removed by privatizing A. The addition of an Arx field to the shadow structure and some simple marking logic can extend the previous
algorithm to validate parallel reductions at run-time [, ]. For this speculative technique two safe restart methods can be used [, ]. ●
The compiler either generates a checkpoint of all global, modified variables before the starting the speculative loop or does it, on demand, before a variable is modified the first time.
●
The compiler generates code to allocate temporary, private storage for the global, modified variables. If the test fails the private storage is de-allocated, else its contents are merged into the global storage.
The data structures used for the checkpointing and shadowing can be appropriately chosen depending on the dense (e.g., arrays) or sparse reference characteristics of the loop (e.g., hash tables, linked lists) []. Compiler analysis can reduce the overhead of speculation by establishing equivalence classes (in the data dependence sense) of the interesting memory references and then tracking only one representative per class []. The speculative LRPD test has been later modified into the R-LRPD (Recursive LRPD) test [] to improve its performance for loops with dependences. When a speculative parallel loop is block scheduled, the R-LRPD test can detect the iteration that is the source of the earliest dependence and thus validate the execution of the loop up to it. The remainder of the loop is then speculatively re-executed in a recursive manner until all work has finished, thus guaranteeing, at worst, a serial time complexity.
S
S
Speculative Parallelization of Loops
DOACROSS Speculative Parallelization If dependences are expected, then speculation should be verified frequently, so that incorrect iterations can be restarted as soon as violations have occurred, thus reducing the overall speculation overhead. Approaches that track dependencies at low level (access/iteration), usually simulate the operation of a cache coherence protocol-based (hardware-based) TLS (Thread Level Speculation). Figure depicts a sliding-window approach: The updates are recorded in per-iteration private storage and are merged (committed) to global storage by the oldest executing iteration, referred to as the master. Since the master represents a nonspeculative thread, i.e., it cannot be invalidated by any future
iteration, its updates are known to be correct and the copy-out operation is safe. Then, the next iteration becomes nonspeculative and can eventually commit its updates, etc. While implementations are diverse, one approach [, ] could be that () a read access returns the value written by the closest predecessor iteration (forwarding), or the nonspeculative value if no such write exists, and () a write access to memory location l signals a flow dependence violation if a successor iteration has read from location l. This behavior is achieved by inspecting the shadow vectors Aw and Ar , which record per-iteration write/read accesses. For example, in Fig. , the call of the function validate by iteration
P == Processor; It == Iteration
doacross i=1, 5 markread(K(i)) z = A(K(i)) if (B(i) .eq. .true.) then A(L(i))=z+C(i) markwrite(L(i)) if (.not. validate(i, L(i))) then rollback() endif endif wait_until_master() commit() enddo
Read−after−write (RAW) violation Iteration executed in private space It = 1 P=1 It = 2 P=2 It = 3 P=3
Serial commit by the master iteration
It = 4 It = 2 P = 4 Rollback P = 2 It = 3 it 2,3,4 P=3 It = 4 P=4
It = 3 P=3 It = 4 P=4 It = 5 P=1
It = 5 P=1
It = 4 Rollback P = 4 It = 5 it 4,5 P=1
It = 5 P=1
Execution time Aw Ar
Aw Ar
Aw Ar
Aw Ar
Aw Ar
It = 1
It = 2
It = 3
It = 4
It = 5
1 2 3 4 5
Speculative Parallelization of Loops. Fig. Serial-commit, sliding-window-based Thread Level Speculation execution. Dashed, red arrows starting from Aw /Ar represent the source/sink of flow dependences
Speculative Parallelization of Loops
1, which writes A[], detects a flow dependence violation because iteration 2 already read A[]. The master is always correct, so iteration 1 commits its updates, but restarts iterations 2, 3, and 4. Similarly, iterations 3 and 4 are the source and sink of another flow dependence violation. While maintaining per-iteration shadow vectors is prohibitively expensive in many cases, a sliding window of size C will reduce the memory overhead to more manageable levels. In particular, since only C consecutive iterations may execute concurrently, only O(C) sets of shadow vectors need to be maintained, and then recycled. Speculative DOACROSS methods are best suited for loops with more frequent data dependences because they can detect dependences earlier thus reducing wasted computation. However, verifying the dependence violation for each memory reference can be quite expensive because it requires global synchronizations. Furthermore, the merge (commit) phase is done in iteration order which constitutes a serial bottleneck. Loops with frequent dependences are not candidates for scalable parallelization regardless of the methods used to detect and process them. They cause, aside from lack of parallelism, frequent back tracking which rapidly degrades performance to possible negative levels.
While Loop Speculative Parallelization While loops and do loops with conditional premature exits arise frequently in practice and techniques for extracting their available parallelism [, ] are highly desirable. In the most general form, a while loop can be defined as a loop that includes at least one recurrence, a remainder, and at least one termination condition. The dominating recurrence, which precedes the rest of the computation is called the dispatching recurrence, or simply the dispatcher (Fig. a). Sometimes the termination conditions form part of one of the recurrences, but they can also occur in the remainder, e.g., conditional exits from do loops. Assuming, for simplicity, that the remainder is fully parallelizable, there are two potential problems in the parallelization of while constructs: ●
Evaluating the recurrences. If the recurrences cannot be evaluated in parallel, then the iterations of the
S
loop must be started sequentially, leading in the best case to a pipelined execution. ● Evaluating the termination conditions. If the termination conditions (loop exits) cannot be evaluated independently by all iterations, the parallelized while loop could continue to execute beyond the point where the original sequential loop would stop, i.e., it can overshoot. Evaluating the recurrences concurrently. In general, the terms of the dispatcher must be evaluated sequentially, e.g., the pointer chasing when traversing a linked list. Because the values of the dispatcher (the pointer) must be evaluated in sequential order, iteration i of the loop cannot be initiated until the dispatcher for iteration i − has been computed (see Fig. b). However, if the dispatching recurrence is associative then it can be evaluated in parallel using, e.g., a parallel prefix algorithm (see Fig. c). Better yet, when the dispatcher takes the form of an induction, its values can be computed concurrently by evaluating its symbolic form and thus allowing all iterations of the while loop to execute in parallel. A typical example is represented by a do loop (see Fig. d, e). Evaluating the termination conditions in parallel. Another difficulty with parallelizing while loops is that the termination condition (terminator) of the loop may be overshot, i.e., the loop would continue to execute beyond its sequential (original) counterpart. The terminator is defined as remainder invariant, or RI, if it is only dependent on the dispatcher and values that are computed outside the loop. If it is dependent on some value computed in the loop, then it is considered to be remainder variant or RV. If the terminator is RV, then iterations larger than the last (sequentially) valid iteration could be performed in a parallel execution of the loop, i.e., iteration i cannot decide if the terminator is satisfied in the remainder of some iteration i′ < i. Overshooting may also occur if the dispatcher is an induction, or an associative recurrence, and the terminator is RI. An exception in which overshooting would not occur is if the dispatcher is a monotonic function, and the terminator is a threshold on this function, e.g., d(i) = i , and tc(i) = (d(i) < V), where V is a constant, and d(j) and tc(j) denote the dispatcher and the terminator, respectively, for the jth iteration. Overshooting can also
S
S
Speculative Parallelization of Loops
initialize dispatcher
dispatcher
while (not termination condition) dispatcher increment
do work associatied with current dispatcher dispatcher = next dispatcher (increment)
a
termination condition
endwhile
b
pointer tmp = head(list) while (tmp .ne. null)
integer r = 1 while (f(r) .lt. V)
WORK(tmp) tmp = next(tmp)
WORK(r) r = a*r + b
c
endwhile Equivalent while loop
integer
do i = 1 , n WORK(i) enddo
i =1
while ((f(i) .lt. V) .and. (i .le. n))
if (f(i) .lt. V) exit
d
endwhile
WORK(i) i=i+1
e
endwhile
Speculative Parallelization of Loops. Fig. (a) The general structure of while loops (b) Pointer chasing (c) Threshold terminator with monotonic dispatcher (d) DO loop with premature exit and (e) its equivalent while loop
be avoided when the dispatcher is a general recurrence, and the terminator is RI. For example, the dispatcher tmp is a pointer used to traverse a linked list, and the terminator is (tmp = null) (see Fig. b). The parallelization potential of the dispatcher is summarized by the taxonomy of while loops given in Table . Speculative while loop Parallelization. Executing a loop outside its intended iteration space may result in undesired global state modification which has to be later undone. Such parallel execution can be regarded as speculative and is handled similarly to the previously mentioned speculative do loop parallelization. The effects of overshooting in speculative parallel while loops can be undone after loop termination by restoring the (sequentially equivalent) correct state from a trace of the time (iteration) stamped memory references recorded during speculative execution. This solution may, however, have a large memory overhead for the time-stamped memory trace. Alternatively, shared variables can be written into temporary storage, and then copied out only if their time stamps are less than or equal to the last valid iteration. There are various techniques for reducing memory overhead that can take advantage of the specific memory reference pattern, e.g., sparse patterns can be stored
in hash tables. Further optimizations include strip mining the loop, or using a “sliding window” of iterations whose advance can be throttled adaptively. An alternative to time-stamping is to attempt to extract a slice from the original while loop that can precompute the iteration space. When such a transformation is possible (e.g., traversal of a linked list) then the remainder of the while loop can be executed as a doall. While loops with statically unknown cross-iteration dependences. When the data dependences of a While loop cannot be conclusively analyzed by a static compiler, the loop parallelization can be attempted by combining the LRPD test (applied to the remainder loop) with the techniques for while loop parallelization described above. When it can be determined statically (see the taxonomy Table ) that the parallelized while loop will not overshoot, then the shadow variable tracing instrumentation for the LRPD test can be inserted directly into the while loop. In the more complex case when overshooting may occur (see the taxonomy Table ) the LRPD test can be combined with the while loop parallelization methods by augmenting its shadow arrays with the minimum
Speculative Parallelization of Loops
Speculative Parallelization of Loops. Table A taxonomy of while loops and their dispatcher’s potential for parallel execution. In the table, mono is monotonic, OV is overshoot, P is parallel, N is no, Y is yes, and pp means parallelizable with a parallel prefix computation. RV is remainder variant and RI is remainder invariant Dispatcher Induction Loop terminator
mono OV
Recurrence other
P
OV
P
associative
general
OV
OV
P
P
RI
N
Y
Y
Y
Y
Y-pp
N
N
RV
Y
Y
Y
Y
Y
Y-pp
Y
N
iteration that marked each element. The post-execution analysis of the LRPD test has to ignore the shadow array entries with minimum time stamps greater than the last valid iteration. A special and unwelcome situation arises when the termination condition of the while loop is dependent (data or control) upon a variable which has not been found by the LRPD test to be independent. The speculative parallel execution of such a while loop may incorrectly compute its iteration space, or even worse, the termination condition might never be met (an infinite loop). Such loops are very hard to parallelize efficiently. In general, while loops do not usually lend themselves to scalable parallelization due to their inherent nontrivial recurrence and overshooting potential which is expensive to mitigate.
Speculative Parallelization as a Parallel Programming Paradigm The speculative techniques presented thus far concern themselves with the automatic transformation of sequential loops into parallel ones. They can validate, at run-time, if an instance of a loop executed in parallel does not have potential dependences and thus has exactly the same output as its sequential original. This validation does not, and could not, realistically check if the parallel and sequential execution have the same results. However, it imposes conditions on the memory reference trace which can ensure that the parallel
S
loop respects the same dependence graph as the sequential one. In general, the speculative parallelization presented thus far represents a gamble that its results are equal to those of the sequential execution because it executes the same fine-grain algorithm, i.e., the same fine-grain dependence graph as the sequential original loop. This approach has lent itself well to automatic compiler implementation. There is, however, a higher semantic level, albeit more complex, use of speculative parallelization. Instead of speculating on the execution of a fine-grain dependence graph (constructed from memory references), the programmer can generate the same results by following the correct execution of a coarser, semantically higher level dependence graph. For example, the nodes of such a coarse graph could represent the methods of a container and the edges the order in which they will be invoked. The programmer may also specify highlevel properties of the methods involved. For example, the order of “inquire” type method invocations may be declared as interchangeable (commutative) similar with the read operations on an array. Other methods may be required to respect the program order, e.g., add and delete. In general, the cause-and-effect relation of the operations can be user defined. The result is a partial order of operations, a coarse dependence graph. A speculative parallelization of a program at this level of abstraction will ensure that its execution respects the higher level dependence graph. It implements a more relaxed execution model thus possibly improving performance. The Galois system implements this approach to speculative parallel programming []. It is based on two key abstractions: () optimistic iterators that can iterate over sets of iterations in any order and () a collection of assertions about properties of methods in class libraries. In addition a run-time scheme can detect and undo potentially unsafe operations due to speculation. The set iterator abstracts the possibility selecting for execution elements of a set of iteration in any serial order (similar to the DOANY construct []). When an iteration executes it may add new iterations to the iteration space, hence allowing for fully dynamic execution. This is inherently possible because the addition of work can be inserted anywhere in the unordered set without changing the semantics of the loop. The set iterator
S
S
Speculative Parallelization of Loops
can become an optimistic iterator if the generated work can create conflicts at the iteration level. To exploit parallelism, the system relies on the semantic commutativity property of methods which must be declared by the programmer through special class interface method declarations. Finally, when the program executes, the methods are invoked like transactions (in isolation) and the commutativity property of the sequence of invoked methods is asserted. If it fails, then it is possible that operations have been executed in an illegal order and therefore have to be undone. The execution is rolled back using anti-methods, i.e., methods that undo the effects of the offending method. For example, an add operation can be undone by a delete operation. The entire system is reminiscent of Jefferson’s Virtual Time system []. Speculating about higher level operations using higher level abstractions allows the exploitation of algorithmic parallelism otherwise impossible to exploit at the memory reference level. It requires, however, the user to reason more about the employed algorithm.
Future Directions Run-time parallelization in general, and speculative parallelization in particular, are likely to be essential techniques for both automatic and manual parallelization. Speculation has its drawbacks: The success rate of speculation is rather unpredictable and thus affects performance in a nondeterministic manner. Speculation may waste resources, including power, when it does not produce a speedup. However, given the ubiquitous parallelism encountered today, it is sometimes the only avenue of performance improvement. It would be of great benefit if meaningful statistics about the success rate of speculative parallelization could be found. Speculation will continue to be used for manual parallelization of irregular programs [], at least until good parallel algorithms will be found. In this case, high-level speculation is likely to produce better results and result in more expressive programs!. Speculative parallelization has been integrated into the Hybrid Analysis compiler framework [, ]. This framework seamlessly integrates static and dynamic analysis to extract the minimum sufficient conditions that need to be evaluated at run-time in order to validate a parallel execution of a loop. By carefully staging these sufficient conditions in order of their run-time
complexity, the compiler often minimizes the need for a full speculative parallelization, thus reducing the nondeterminism of the code performance. In conclusion, speculative parallelization is a globally applicable parallelization technique that can make the difference between a fully and a partially parallelized program, i.e., between scalable and non-scalable performance.
Related Entries Debugging Dependences Dependence Analysis Parallelization, Automatic Race Conditions Run Time Parallelization Speculation, Thread-Level
Bibliographic Notes and Further Reading Speculative run-time parallelization has been first mentioned in the context of processes of parallel discrete event simulations by D. Jefferson []. In his virtual time concurrent processes are launched asynchronously and are tagged with their logical time stamp. When such processes communicate, they compare time stamps to check if their order respects the logical clock of the program (i.e., if data dependences are not violated). If a violation is detected, then anti-messages will recursively undo the effects of the incorrect computations and restart from a safe point. The LRPD test, i.e., the speculative parallelization of loops, was introduced in []. It has later been modified to parallelize loops with dependences []. Compiler optimizations [, ] have lowered its overhead. The Hybrid Analysis framework has integrated the LRPD test in a compiler framework []. A significant amount of later work [, , ] has followed the hardware based approach to speculative parallelization presented in [, , , ]. Other related work reduces communication overhead via a master–slave model [] in which the master executes an optimistic (fast) approximation of the code, while the slaves verify the master’s correctness. Another framework [, ] exploits method-level parallelism for Java applications. The design space of speculative
Speculative Parallelization of Loops
parallelization has been further widened by allowing tunable memory overhead [] by mapping more than one memory reference to the same “shadow” structure, but with the penalty of generating false dependence violations. Software Transactional Memory [, ] can be and often is implemented using speculative parallelization techniques. Debugging of parallel programs in general and detecting memory reference “anomalies” in particular [] is a related topic to speculative parallelization and data dependence violation detection.
.
.
.
.
Bibliography . Banerjee U () Dependence analysis for supercomputing. Kluwer, Boston . Burke M, Cytron R, Ferrante J, Hsieh W () Automatic generation of nested, fork-join parallelism. J Supercomput :– . Chen MK, Olukotun K () Exploiting method level parallelism in single threaded java programs. In: International conference on parallel architectures and compilation techniques PACT’, IEEE, Paris, pp – . Cintra M, Llanos DR () Toward efficient and robust software speculative parallelization on multiprocessors. In: International conference on principle and practice of parallel computing PPoPP’, ACM, San Diego, pp – . Collard J-F () Space-time transformation of while-loops using speculative execution. In: Scalable high performance computing conference, IEEE, Knoxville, pp – . Dang F, Yu H, Rauchwerger L () The R-LRPD test: speculative parallelization of partially parallel loops. In: International parallel and distributed processing symposium, Florida . Eigenmann R, Hoeflinger J, Li Z, Padua D () Experience in the automatic parallelization of four perfect-benchmark programs. Lecture notes in computer science . Proceedings of the fourth workshop on languages and compilers for parallel computing, Santa Clara, pp – . Hammond L, Willey M, Olukotun K () Data speculation support for a chip multiprocessor. In: th international conference on architectural support for programming languages and operating systems, San Jose, pp – . Herlihy M, Shavit N () The art of multiprocessor programming. Morgan Kaufmann, London . Jefferson DR () Virtual time. ACM Trans Program Lang Syst ():– . Kazi IH, Lilja DJ () Coarsed-grained thread pipelining: a speculative parallel execution model for shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst (): . Kruskal C () Efficient parallel algorithms for graph problems. In: Proceedings of the international conference on parallel processing, University Park, pp –, Aug . Kuck DJ, Kuhn RH, Padua DA, Leasure B, Wolfe M () Dependence graphs and compiler optimizations. In: Proceedings of the
.
. .
.
.
.
.
.
.
.
.
S
th ACM symposium on principles of programming languages, Williamsburg, pp – Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Paul Chew L () Optimistic parallelism requires abstractions. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementa-tion, PLDI ’, ACM, New York, pp – Thomson Leighton F () Introduction to parallel algorithms and architectures: arrays, trees, hypercubes. Morgan Kaufmann, London Li Z () Array privatization for parallel execution of loops. In: Proceedings of the th international symposium on computer architecture, Gold Coast, pp – Maydan DE, Amarasinghe SP, Lam MS () Data dependence and data-flow analysis of arrays. In: Proceedings th workshop on programming languages and compilers for parallel computing, New Haven Oancea CE, Mycroft A, Harris T () A lightweight in-place implementation for software thread-level speculation. In: International symposium on parallelism in algorithms and architectures SPAA’, ACM, Calgary, pp – Padua DA, Wolfe MJ () Advanced compiler optimizations for supercomputers. Commun ACM :– Papadimitriou S, Mowry TC () Exploring thread-level speculation in software: the effects of memory access tracking granularity. Technical report, CMU Patel D, Rauchwerger L () Implementation issues of looplevel speculative run-time parallelization. In: Proceedings of the th international conference on compiler construction (CC’), Amsterdam. Lecture notes in computer science, vol . Springer, Berlin Rauchwerger L () Run-time parallelization: its time has come. Parallel Comput (–):. Special issues on languages and compilers for parallel comput Rauchwerger L, Amato N, Padua D () Run–time methods for parallelizing partially parallel loops. In: Proceedings of the th ACM international conference on supercomputing, Barcelona, Spain, pp – Rauchwerger L, Amato N, Padua D () A scalable method for run-time loop parallelization. Int J Parallel Prog (): – Rauchwerger L, Padua DA () The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans Parallel and Distrib Syst ():– Rauchwerger L, Padua DA (). Parallelizing WHILE loops for multi-processor systems. In: Proceedings of th international parallel processing symposium, Santa Barbara Rauchwerger L, Padua DA () The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In: Proceedings of the SIGPLAN conference on programming language design and implementation, La Jolla pp – Rundberg P, Stenstrom P () Low-cost thread-level data dependence speculation on multiprocessors. In: th workshop on multithreaded execution, architecture and compilation, Monterey
S
S
Speculative Run-Time Parallelization
. Rus S, Pennings M, Rauchwerger L () Sensitivity analysis for automatic parallelization on multi-cores. In: Proceedings of the ACM international conference on supercomputing (ICS), Seattle . Rus S, Hoeflinger J, Rauchwerger L () Hybrid analysis: static & dynamic memory reference analysis. Int J Parallel Prog ():– . Saltz J, Mirchandaney R () The preprocessed doacross loop. In: Schwetman HD (ed) Proceedings of the international conference on parallel processing, Software, vol II. CRC Press, Boca Raton, pp – . Saltz J, Mirchandaney R, Crowley K () The doconsider loop. In: Proceedings of the international conference on supercomputing, Irakleion, pp – . Saltz J, Mirchandaney R, Crowley K () Run-time parallelization and scheduling of loops. IEEE Trans Comput ():– . Schonberg E () On-the-fly detection of access anomalies. In: Proceedings of the SIGPLAN conference on programming language design and implementation, Portland, pp – . Shavit N, Touitou D () Software transactional memory. In: Proceedings of the fourteenth annual ACM symposium on principles of distributed computing, PODC ’, ACM, New York, pp – . Sohi GS, Breach SE, Vijayakumar TN () Multiscalar processors. In: nd international symposium on computer architecture, Santa Margherita . Steffan JG, Mowry TC () The potential for using threadlevel data speculation to facilitate automatic parallelization. In: Proceedings of the th international symposium on highperformance computer architecture, Las Vegas . Tu P, Padua D () Array privatization for shared and distributed memory machines. In: Proceedings nd workshop on languages, compilers, and run-time environments for distributed memory machines, Boulder . Tu P, Padua D () Automatic array privatization. In: Proceedings th annual workshop on languages and compilers for parallel computing, Portland . Tu P, Padua D () Efficient building and placing of gating functions. In: Proceedings of the SIGPLAN conference on programming language design and implementation, La Jolla, pp – . Welc A, Jagannathan S, Hosking A () Safe futures for Java. In: International conference object-oriented programming, systems, languages and applications OOP-SLA’, ACM, New York, pp – . Wolfe M () Optimizing compilers for supercomputers. MIT Press, Boston . Wolfe M () Doany: not just another parallel loop. In: Proceedings th annual workshop on programming languages and compilers for parallel computing, New Haven. Lecture notes in computer science, vol . Springer, Berlin . Yu H, Rauchwerger L () Run-time parallelization overhead reduction techniques. In: Proceedings of the th international conference on compiler construction (CC ), Berlin Germany. Lecture notes in computer science vol . Springer, Heidelberg
. Zhang Y, Rauchwerger L, Torrellas J () Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors. In: Proceedings of the th international symposium on high-performance computer architecture (HPCA), Las Vegas, pp – . Zhu C, Yew PC () A scheme to enforce data dependence on large multiprocessor systems. IEEE Trans Softw Eng (): – . Zilles C, Sohi G () Master/slave speculative parallelization. In: International symposium on microarchitecture Micro-, IEEE, Los Alamitos, pp – . Zima H () Supercompilers for parallel and vector computers. ACM Press, New York
Speculative Run-Time Parallelization Speculation, Thread-Level Speculative Parallelization of Loops
Speculative Threading Speculation, Thread-Level
Speculative Thread-Level Parallelization Speculation, Thread-Level
Speedup Metrics
SPIKE Eric Polizzi University of Massachusetts, Amherst, MA, USA
Definition SPIKE is a polyalgorithm that uses many different strategies for solving large banded linear systems
SPIKE
in parallel. Existing parallel algorithms and software using direct methods for banded matrices are mostly based on LU factorizations. In contrast, SPIKE uses a novel decomposition method (i.e., DS factorization) to balance communication overhead with arithmetic cost to achieve better scalability than other methods. The SPIKE algorithm is similar to a domain decomposition technique that allows performing independent calculations on each subdomain or partition of the linear system, while the interface problem leads to a reduced linear system of much smaller size than that of the original one. Direct, iterative, or approximate schemes can then be used to handle the reduced system in a different way depending on the characteristics of the linear system and the parallel computing platform.
Discussion Introduction Many science and engineering applications, particularly those involving finite element analysis, give rise to very large sparse linear systems. These systems can often be reordered to produce either banded systems or low-rank perturbations of banded systems in which the width of the band is but a small fraction of the size of the overall problem. In other instances, banded systems can act as effective preconditioners to general sparse systems, which are solved via iterative methods. Direct methods for solving linear systems AX = F are commonly based on the LU decomposition that represents a matrix A as a product of lower and upper triangular matrices A = LU. Consequently, solving AX = F can be achieved by solutions of two triangular systems
. Solution of block-diagonal system DG = F. Because D consists of decoupled systems of each of the diagonal block Ai , they can be solved in parallel without requiring any communication between the individual systems. . Solution of the system SX = G. This system has a wonderful characteristic that it is also decoupled to a large extent. Except for a reduced system (near the interface of each of the identity blocks), the rest are independent from one another. The natural way to tackle this system is to first solve the reduced system via some parallel algorithms that
B1 Partitioned
C2
A2
B2 C3
SPIKE. Fig. A banded matrix with a conceptual partition
LG = F and UX = G. A parallel LU decomposition for banded linear systems has also been proposed by Cleary and Dongarra in [] for the ScaLAPACK package []. The central idea behind the SPIKE algorithm is a different decomposition for banded linear systems, introduced by A. Sameh in the late s [], which is ideally suited for parallel implementation as it naturally leads to lower communication cost. Several enhancements and variants of the SPIKE algorithm have since been proposed by Sameh and coauthors in [–]. In the case when A is a banded matrix as depicted in Fig. , SPIKE is using a direct partitioning in the context of parallel processing. SPIKE relies on the decomposition of a given banded matrix A into the product of a block-diagonal matrix D, and another matrix S which has the structure of an identity matrix with some extra “spikes” (and hence the name of the algorithm). This DS factorization procedure is illustrated in Fig. . Solving AX = F can then be accomplished in two steps:
A1
A=
S
A3
S
S
SPIKE
A
D
=
A1
S
A1 I
V1
B1 C2
A2
A2 I
V2
W3
I
W2 B2 A3
C3
A3
SPIKE. Fig. SPIKE factorization where A = DS, S = D− A. The blocks in the block-diagonal matrix D are supposed non-singular
require inter-processor communications, followed by retrieval of the rest of the solution without requiring further inter-processor communications.
Solving the system AX = F now consists of two steps: (a) solve
DG = F
()
(b) solve
SX = G.
()
The SPIKE Algorithm: Basics As illustrated in Fig. , a (N × N) banded matrix A can be partitioned into a block tridiagonal form {Cj , Aj , Bj }, where Aj is the (nj × nj ) diagonal block j, and Bj (i.e., Cj ) is the (ku × ku) (i.e., (kl × kl)) right block (i.e., left block). Using p partitions, it comes that nj is roughly equal to N/p. In order to ease the description of the SPIKE algorithm but without loss of generality, the size off-diagonal blocks are both supposed equal to m (kl = ku = m). The size of the bandwidth is then defined by b = m + where b << nj . Each partition j ( j = , . . . , p) can be associated to one processor or one node allowing multilevel of parallelism. Using the DS factorization illustrated in Fig. , the obtained spike matrix S has a block tridiagonal form {Wj , Ij , Vj }, where Ij is the (nj ×nj ) identity matrix, Vj and Wj are the (n×m) right and left spikes. The spikes Vj and Wj are solutions of the following linear systems: ⎡ ⎤ ⎢⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⋮⎥ ⎢ ⎥ Aj Vj = ⎢ ⎥ , ⎢ ⎥ ⎢⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢B j ⎥ ⎣ ⎦
and
⎡ ⎤ ⎢Cj ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎥ ⎢ ⎥ Aj W j = ⎢ ⎥ . ⎢ ⎥ ⎢⋮⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎥ ⎣ ⎦
respectively for j = , . . . , p − and j = , . . . , p.
The solution of the linear system DG = F in Step (a) yields the modified right-hand side G needed for Step (b). In case of assigning one partition to each processor, Step (a) is performed with perfect parallelism. To solve SX = G in Step (b), one should observe that the problem can be reduced further by solving a system of much smaller size, which consists of the m rows of S immediately above and below each partitioning line. Indeed, the spikes Vj and Wj can also be partitioned as follows ⎡ (t) ⎤ ⎢ Vj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ′ ⎥ Vj = ⎢ Vj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ (b) ⎥ ⎢Vj ⎥ ⎦ ⎣ (t)
⎡ (t) ⎤ ⎢ Wj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ′ ⎥ W j = ⎢ Wj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ (b) ⎥ ⎢Wj ⎥ ⎦ ⎣
and
(b)
(t)
()
(b)
where Vj , Vj′ , Vj , and Wj , Wj′ , Wj , are the top m, the middle nj − m and the bottom m rows of Vj and Wj , respectively. Here,
()
(b)
= [
Im ]Vj ;
Wj
(t)
= [Im
]Vj ;
Wj
Vj
(t)
= [Im
]Wj ,
()
(b)
= [
Im ]Wj .
()
and Vj
SPIKE
Similarly, if Xj and Gj are the jth partitions of X and G, it comes ⎡ (t) ⎤ ⎢ Xj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ′ ⎥ Xj = ⎢ Xj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ (b) ⎥ ⎢ Xj ⎥ ⎦ ⎣
⎡ (t) ⎤ ⎢ Gj ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ′ ⎥ Gj = ⎢ Gj ⎥ . ⎥ ⎢ ⎥ ⎢ ⎢ (b) ⎥ ⎢G j ⎥ ⎦ ⎣
and
(b)
V
(t)
Im
V Im (t)
W
(b) W
(b)
V
Im Im (t)
W
⎤ ⎡ (b) ⎤ ⎡ (b) ⎤ ⎥ ⎢X ⎥ ⎢G ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ (t) ⎥ ⎢ (t) ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ X ⎥ ⎢ G ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ (b) ⎥ ⎢ (b) ⎥ ⎥ ⎢X ⎥ ⎢G ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥=⎢ ⎥. ⎥ ⎢ ⎥ (t) ⎥ ⎢ ⎢ (t) ⎥ ⎢ (t) ⎥ V ⎥ ⎥ ⎢ X ⎥ ⎢ G ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ (b) ⎥ ⎢ ⎢ (b) ⎥ ⎢ (b) ⎥ V ⎥ ⎥ ⎢X ⎥ ⎢G ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ (t) ⎥ ⎢ (t) ⎥ ⎢ ⎥ ⎢ ⎥ Im ⎥ ⎦ ⎣ X ⎦ ⎣ G ⎦ ()
Finally, once the solution of the reduced system is obtained, the global solution X can be reconstructed (b) (t) from Xk (k = , . . . , p − ) and Xk (k = , . . . , p) either by computing ⎧ (t) ⎪ ⎪ X′ = G′ − V′ X , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (t) (b) ⎨Xj′ = G′j − Vj′ Xj+ − Wj′ Xj− , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (b) ⎪ ⎪Xp′ = G′p − Wj′ Xp− , ⎪ ⎩ or by solving
j = , . . . , p −
⎧ ⎡ ⎤ ⎪ ⎢⎥ ⎪ ⎪ ⎢ ⎥ (t) ⎪ ⎪ ⎪ ⎢ ⎥ Bj X , A X = F − ⎪ ⎪ ⎢ ⎥ ⎪ ⎪ ⎥ ⎢ I ⎪ m ⎪ ⎣ ⎦ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎤ ⎡ ⎡ ⎤ ⎪ ⎪ ⎢⎥ ⎢ ⎥ ⎪ ⎪ ⎢ ⎥ (t) ⎢Im ⎥ (b) ⎥ Bj Xj+ − ⎢ ⎥ Cj Xj− , ⎨ Aj X j = Fj − ⎢ ⎥ ⎢ ⎢ ⎥ ⎪ ⎪ ⎥ ⎥ ⎢ ⎢ ⎪ I m ⎪ ⎣ ⎦ ⎣ ⎦ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎤ ⎡ ⎪ ⎪ ⎢Im ⎥ ⎪ ⎪ ⎢ ⎥ (b) ⎪ ⎪ Ap Xp = Fp − ⎢ ⎥ Cj X p− . ⎪ ⎪ ⎢ ⎥ ⎪ ⎪ ⎥ ⎢ ⎪ ⎪ ⎣ ⎦ ⎩
the properties of the linear system as well as the architecture of the parallel platform. More specifically, the following stages of the SPIKE algorithm can be handled in several ways resulting in a polyalgorithm:
()
It is then possible to extract from a block tridiagonal reduced linear system () of size (p − )m, which involves only the top and bottom elements of Vj , Wj , Xj , and Gj . As example, the reduced system obtained for the case of four partitions (p = ) is given by ⎡ ⎢ Im ⎢ ⎢ (t) ⎢ ⎢ W ⎢ ⎢ (b) ⎢W ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
S
()
. Factorization of the diagonal blocks Aj . Depending on the sparsity pattern of the matrix and the size of the bandwidth, these diagonal blocks could be considered either as dense or sparse within the band. For the dense banded case, a number of strategies based on the LU decomposition of each Ai can be applied here. This include variants such as LU with pivoting, LU without any pivoting but diagonal boosting, as well as a combination of LU and UL decompositions, either with or without pivoting. For the sparse banded case, it is common to use a sparse direct linear system solver to reorder and then factorize the diagonal blocks. However, solving the various linear systems for Aj can also be achieved using an iterative solver with preconditioner. Finally, each partition in the decomposition can be associated with one or several processors (one node), enabling multilevel parallelism. . Computation of the spikes. If the spikes Vj and Wj are determined entirely, the reduced system () can be solved explicitly and equation () can be used to retrieve the entire solution. In contrast, if equation () is used to retrieve the solution, the spikes may not be computed but only for the top and bottom (m × m) blocks of Vj and Wj needed to form the reduced system. It should be noted that the determination of the top and bottom spikes is also not explicitly needed for computing the (t) (b) (t) actions of the multiplications with Wj , Wj , Vj , (b)
and Vj . These latter can be realized “on-the-fly”
j = , . . . , p −
()
SPIKE: A Hybrid and Polyalgorithm Multiple options are available for efficient parallel implementation of the SPIKE algorithm depending on
⎛I ⎞ ⎛I ⎞ ⎜ m ⎟ Cj , ( I ) A− ⎜ m ⎟ Cj , using (Im ) A− j ⎜ ⎟ j ⎜ ⎟ m ⎝⎠ ⎝⎠ ⎛I ⎞ ⎛I ⎞ ⎜ m ⎟ Bj , ( I ) A− ⎜ m ⎟ Bj , (Im ) A− j ⎜ ⎟ j ⎜ ⎟ m ⎝⎠ ⎝⎠ respectively. . Solution scheme for the reduced system. One of the earliest concerns with the SPIKE algorithm for large number of partitions was to propose a reliable and efficient parallel strategy for solving the reduced
S
S
SPIKE
system (). Krylov subspace-based iterative methods have been the first candidates to fulfill this purpose, while giving to SPIKE its hybrid nature. These iterative methods are often used in conjunction with a block Jacobi preconditioner (i.e., diagonal blocks of the reduced system) if the bottom of the Vj spikes and the top of the Wj spikes are computed explicitly. In turn, the matrix-vector multiplication operations of the iterative technique can be done explicitly or implicitly (“on-the-fly”). In order to enhance robustness and scalability for solving the reduced system, two new highly efficient direct methods have been introduced by Polizzi and Sameh in [, ]. These SPIKE schemes, which have been named “truncated scheme” for handling diagonally dominant systems, and “recursive scheme” for non-diagonally dominant systems, are presented in the next sections. Here again, a number of different strategies exists for solving the reduced system. As mentioned above, and in order to minimize memory references, it is sometimes advantageous to factorize the diagonal blocks Aj using LU without any pivoting but adding a diagonal boosting if a “zero-pivot” is detected. Hence, A is not exactly the product DS and rather takes the form A = DS + R, where R represents the correction which, even if nonzero, is by design small in some sense. Outer iterations via Krylov subspace schemes or iterative refinement, are then necessary to obtain sufficient accuracy as SPIKE would act on M = DS (i.e., the approximate SPIKE decomposition for M is used as effective preconditioner). Finally, a SPIKE-balance scheme has also been proposed by Golub, Sameh, and Sarin in [], for addressing the case where the block diagonal Aj are nearly singular (i.e., ill-conditioned), and when even the LU decomposition with partial pivoting is expected to fail.
dominance, dd, of the matrix A is greater than , where dd is given by dd = min
∣Ai,i ∣ . ∑j≠i ∣Ai,j ∣
()
It is possible to show from equation (), that the magnitude of the elements of the right spikes Vj decay from bottom to top, while the elements of the left spikes Wj decay in magnitude from top to bottom [, ]. Since the size n of Aj is much larger than the size m of the blocks Bj and Cj , the bottom blocks of the left spikes (b) (t) Wj and the top blocks of the right spikes Vj can be approximately set equal to zero. In fact, the zero accuracy machine is ensured to be reached either in the case of a pronounced decay (i.e., high value for dd), or for large ratio nj /m. Thus, it follows that the offdiagonal blocks of the reduced system () are equal to zero, and the solution of this new block-diagonal “truncated” reduced system can be obtained by solving p − independent m × m linear systems in parallel: ⎤⎡ ⎤ ⎡ ⎤ ⎡ (b) ⎥ ⎢ (b) ⎥ ⎢ (b) ⎥ ⎢ I ⎢ m Vj ⎥ ⎢Xj ⎥ ⎢Gj ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ (t) ⎥ = ⎢ (t) ⎥ . ⎢ (t) ⎢ ⎢ ⎥ ⎥ ⎥ ⎢W ⎢ j+ Im ⎥ ⎢ Xj+ ⎥ ⎢ Gj+ ⎥ ⎦⎣ ⎦ ⎣ ⎦ ⎣
()
These systems can be solved directly using a block-LU factorization, where the solution steps consist (t) (b) of the following: (a) Form E = Im − Wj+ Vj , (b) Solve (t)
(t)
(t)
(b)
EXj+ = Gj+ − Wj+ Gj
(t)
to obtain Xj+ , (c) Compute
The Truncated SPIKE Scheme for Diagonally Dominant Systems
(b) (t) Xj(b) = G(b) − Vj Xj+ . j Solving the reduced system via the truncated SPIKE algorithm for diagonally dominant systems demonstrates then linear scalability with the number partitions. The truncated scheme is often associated with an outer-iterative refinement step to increase the solution accuracy. Within the framework of the truncated scheme, two other major contributions have also been proposed for improving computing performance and scalability of the factorization stage: (a) a LU/UL strategy, and (b) a new unconventional partitioning scheme.
The truncated SPIKE scheme is an optimized version of the SPIKE algorithm with enhanced use of parallelism for handling diagonally dominant systems. These systems may arise from several science and engineering applications, and are defined if the degree of diagonally
LU/UL Strategy The truncated scheme facilitates different new options for the factorization step that make possible to avoid the computation of the entire spikes. As illustrated in
SPIKE
L
=
U Vjb
0 Bj
SPIKE. Fig. The bottom of the spike Vj can be computed using only the bottom m × m blocks of L and U. Similarly, the top of the spike Wj may be obtained if one performs the UL-factorization
Fig. , computational solve times can be drastically reduced by using the LU factorization without pivoting on each diagonal block Aj . Obtaining the top block of Wj , however, would still require computing the entire spike with complete forward and backward sweeps. Another approach consists of performing also the ULfactorization of the block Aj without pivoting. Similar to the LU-factorization, this allows obtaining the top block of Wj involving only the top m × m blocks of the ˙ and L˙ matrices. Numerical experiments indicate new U that the time consumed by this LU/UL strategy is much less than that taken by performing only one LU factorization per diagonal block and generating the entire left spikes.
Unconventional Partitioning Schemes A new partitioning scheme can be introduced to avoid performing both LU and UL factorization for a given Aj ( j = , . . . , p − ). This scheme acts on a new parallel distribution of the system matrix, which considers less partition than processors. Practically, if k represents an even number of processors (or nodes in the case of multilevel parallelism), the new number of partitions will be equal to p = (k + )/. The new block matrices Aj , j = , . . . , p can be associated to the first p processors while processors p + to k hold another copy of the block matrix Aj , j = , . . . , p − . Figure illustrates the new partitioning of the matrix right-hand side and solution, for the case k = and p = . The diagonal blocks Aj associated to processors to p − are then factorized using LU without pivoting, while a UL factorization is used for the diagonal blocks associated with processors p to k. In the example of ˙ j L˙ j ← k = , Lj Uj ← Aj for j = , (processors , ), and U
S
Aj for j = , (processors , ). As described above using (b) the LU/UL strategy, the Vj ( j = , . . . , p − ) can be obtained with minimal computational efforts via the LU (t) solve step on processors to p − , while the Wj ( j = , . . . , p) can be obtained in the similar way but now on different processors using the UL solve step. Using this new partitioning scheme, the size of the partitions does increase but the number of arithmetic operations by partition decreases as well as the size of the truncated reduced system. This scheme achieves better balance between the computational cost of solving the subproblems and the communication overhead. In addition to increasing scalability results for large number of processors, the scheme also addresses the “bottleneck” of the small number of processors case as described below.
Speed-Up Performances on Small Number of Processors One of the main focus in the development of parallel algorithms for solving linear systems aims at achieving linear scalability on large number of processors. However, the emergence of multicore computing platforms in these recent years has brought new emphasis for parallel algorithms on achieving net speedup over the corresponding best sequential algorithms on small number of processors/cores. Clearly, parallel algorithms often inherit extensive preprocessing stages with increased memory references or arithmetics, leading to counter-performances on small number of cores. For parallel banded solvers, in particular, four to eight cores may be usually needed to solve linear systems as fast as the best corresponding sequential solver in LAPACK []. It is then important to note that the new partitioning scheme for the truncated SPIKE algorithm has also been designed to address this issue while offering a speedup of two from only two cores. While the two-cores (two-partitions) case can take advantage of a single LU or UL factorization for A and A , respectively, the efforts to solve the reduced system become minimal (i.e., as compared to a LU decomposition on the overall system, the number of arithmetic operations is essentially divided by two in the SPIKE factorization and solve stages). When the number of cores increases, and without accounting for the communication costs (which are minimal for the truncated scheme), the
S
S
SPIKE
A1
X1
B1
F1
(1)
F2
(2, 4)
F3
(3)
C2 A =
A2
X
B2 C3
A3
=
X2
F
=
X3
SPIKE. Fig. Illustration of the unconventional partitioning of the linear system for the truncated SPIKE algorithm in the case of processors. A is sent to processor , A to processors and , and A to processor .
speedup are expected ideally equal to the number of partitions: × on two cores, × on four cores, × on eight cores, etc. Thereafter, the SPIKE performance will approach linear scalability as the number of cores increases.
The Recursive SPIKE Scheme for Non-Diagonally Dominant Systems For non-diagonally dominant systems and large number of partitions, solving the reduced system () using Krylov subspace iterative method with or without preconditioner may result in high interprocessor communication cost. Interestingly, the truncated scheme for the two-cores (two-partitions) case is as well applicable for non-diagonally dominant systems since the reduced system () contains only one diagonal block. It should be noted, however, that this scheme may necessitate outer-refinement steps, since the diagonal boosting used to handle the “zero-pivot” in the LU and UL factorization stages, are more likely to appear for non-diagonally dominant systems. For larger number of partitions, a new direct approach named “recursive” scheme has been proposed for solving the reduced system and enhancing robustness, accuracy, and scalability. This recursive scheme consists of successive iterations of the SPIKE algorithm from systems to reduced systems, resulting in better balance between the costs of computation and communication. This scheme assumes that the original number of (conventional) partitions is given by p = d (d > ). The bottom and top blocks of the Vj and Wj spikes are then computed explicitly to form the reduced system. In practice, a modified version of the reduced system is
preferred which also includes the top block Vt and bottom block Wpb . For the case p = , the original reduced system () is now represented by the following “reduced spike matrix”: (t) ⎛ Im V ⎜ ⎜ (b) ⎜ Im V ⎜ ⎜ ⎜ (t) ⎜ W Im ⎜ ⎜ ⎜ (b) ⎜ W Im ⎜ ⎜ ⎜ (t) ⎜ W ⎜ ⎜ ⎜ (b) ⎜ W ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
(t)
V
(b)
V
(t)
Im
V
(b)
Im V (t)
W
(b) W
Im Im
⎤ ⎡ ⎤ ⎞⎡ ⎥ ⎢ X(t) ⎥ ⎢ G(t) ⎥ ⎢ ⎥ ⎟⎢ ⎥ ⎢ ⎢ ⎟ ⎢ (b) ⎥ ⎢ (b) ⎥ ⎟ ⎢X ⎥ ⎢G ⎥ ⎟⎢ ⎥ ⎢ ⎥ ⎥ ⎟⎢ ⎢ (t) ⎥ ⎟ ⎢ (t) ⎥ ⎢G ⎥ ⎟ ⎢X ⎥ ⎥ ⎢ ⎟⎢ ⎥ ⎢ ⎥ ⎥ ⎟⎢ ⎢ (b) ⎥ ⎟ ⎢ (b) ⎥ ⎢G ⎥ ⎟ ⎢X ⎥ ⎥ ⎢ ⎥ ⎟⎢ ⎥=⎢ ⎥ ⎟⎢ ⎢ (t) ⎥ . ⎟ ⎢ (t) ⎥ ⎥ ⎢ ⎟ ⎢ X ⎥ ⎥ ⎢ G ⎥ ⎟⎢ ⎥ ⎢ ⎥ ⎟⎢ ⎢ (b) ⎥ ⎟ ⎢ (b) ⎥ ⎥ ⎢ ⎟ ⎢X ⎥ ⎥ ⎢G ⎥ ⎟⎢ ⎥ ⎢ ⎥ ⎟⎢ ⎢ (t) ⎥ ⎟ ⎢ (t) ⎥ ⎥ ⎢ ⎟ ⎢ X ⎥ ⎥ ⎢ G ⎥ ⎟⎢ ⎥ ⎢ ⎟ ⎢ (b) ⎥ ⎥ ⎢ (b) ⎥ ⎥ ⎢ ⎥ ⎠⎢ ⎣X ⎦ ⎣G ⎦
() This reduced spike system matrix contains p partitions with p diagonal block identities. The system can be easily redistributed in parallel using only p/ partitions which are factorized by SPIKE recursively up until obtaining two partitions only. It can be shown [] that the two partitions case leading to a m × m linear system presented in (), constitutes the basic computational kernel of the recursive SPIKE scheme.
The SPIKE Solver: Current and Future Implementation Since the publications of the first SPIKE algorithm in the late seventies [, ], many variations and new schemes have been implemented. In recent years, a comprehensive MPI-Fortran SPIKE package for distributed memory architecture has been developed by the author.
SPIKE
This implementation includes, in particular, all the different family of SPIKE algorithms: recursive, truncated, and on-the-fly schemes. These SPIKE solvers rely on a hierarchy of computational modules, starting with the data locality-rich BLAS level-, up to the blocked LAPACK [] algorithms for handling dense banded systems, or up to the direct sparse solver PARDISO [] for handling sparse banded systems, with SPIKE being on the outermost level of the hierarchy. The package also includes new primitives for banded matrices that make efficient use of BLAS level- routines. Those include banded triangular solvers with multiple righthand sides, banded matrix-matrix multiplications, and LU, UL factorizations with diagonal boosting strategy. In addition, the large number of options/decision schemes available for SPIKE created the need for the automatic generation of a sophisticated runtime decision tree “SPIKE-ADAPT” that has been developed by Intel. This adaptive layer indicates the most appropriate version of the SPIKE algorithm capable of achieving the highest performance for solving banded systems that are dense within the band. The relevant linear system parameters in this case are system size, number of nodes/processors to be used, bandwidth of the linear system, and degree of diagonal dominance. SPIKE and SPIKE-ADAPT have been regrouped into one package, named “Intel Adaptive Spike-Based Solver,” which has been released to the public in June on the Intel whatif web site []. The SPIKE package also includes a SPIKE-PARDISO scheme for addressing banded linear systems with large sparse bandwidth while offering a basic distributed version of the current shared memory PARDISO package. The capabilities and domain applicability of the SPIKE-PARDISO scheme have recently been significantly enhanced by Manguoglu, Sameh, and Schenk in [] to address general sparse systems. In this approach, a weighted reordering strategy is used to extract efficient banded preconditioners that are solved via SPIKEPARDISO including new specific PARDISO features for computing the relevant bottom and top tips of the spikes. While the current parallel distributed SPIKE package does offer HPC users a new and valuable tool for solving large-scale problems arising from many areas in science and engineering, the growing size of the number of cores in compute node forestalls distributed programming model (i.e., MPI) for many users. On the
S
other hand, the scalability of the LAPACK banded algorithms on multicore node or SMP is first and foremost dependent on the threaded capabilities of the underlying BLAS routines. A new implementation of the SPIKE solver recently initiated by the author is concerned with a shared memory programming model (i.e., OpenMP) that can consistently match the LAPACK functions for solving banded systems. This SPIKE Open-MP project is expected to offer high efficient threaded alternatives for solving banded linear systems on current and emerging multicore architectures.
Related Entries BLAS (Basic Linear Algebra Subprograms) Collective Communication Dense Linear System Solvers Linear Algebra, Numerical Load Balancing, Distributed Memory Metrics PARDISO Preconditioners for Sparse Iterative Methods ScaLAPACK
Bibliography Notes and Further Reading As mentioned in the introduction, the main ideas of the SPIKE algorithm has been introduced in the late s [, ], since then, many improvements and variations have been proposed [–, ]. In particular, the highly efficient truncated and recursive schemes for solving the reduced system presented here are discussed in more detail in [, ]. All the main SPIKE algorithm variations have been regrouped into a comprehensive MPI-based SPIKE solver package in [], where the associated SPIKE’s user guide contains more detailed information on the capabilities of the different SPIKE schemes and their domain of applicability.
Bibliography . Cleary A, Dongarra J () Implementation in ScaLAPACK of divide and conquer algorithms for banded and tridiagonal linear systems. University of Tennessee Computer Science Technical Report, UT-CS-- . Blackford LS, Choi J, Cleary A, Dazevedo E, Demmel J, Dhillon I, et al () ScaLAPACK users guide. Society for Industrial and Appl. Math, Philadelphia . Sameh A () Numerical parallel algorithms: a survey. In: Kuck D, Lawrie D, Sameh A (eds) High speed computer and algorithm organization. Academic, New York, pp –
S
S
Spiral
. Sameh A, Kuck D () On stable parallel linear system solvers. J ACM :– . Sameh A () On two numerical algorithms for multiprocessors. In: Proceedings of NATO adv res workshop on high-speed comp. Series F: computer and systems sciences, vol . Springer, Berlin, pp – . Lawrie D, Sameh A () The computation and communication complexity of a parallel banded system solver. ACM Trans Math Software ():– . Dongarra J, Sameh A () On some parallel banded system solvers. Parallel Comput :– . Berry M, Sameh A () Multiprocessor schemes for solving block tridiagonal linear systems. Int J Supercomput Appl (): – . Sameh A, Sarin V () Hybrid parallel linear solvers. Int J Comput Fluid Dyn :– . Polizzi E, Sameh A () A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput ():– . Polizzi E, Sameh A () SPIKE: A parallel environment for solving banded linear systems. Comput Fluids :– . Golub G, Sameh V, Sarin V () A parallel balance scheme for banded linear systems. Numer Linear Algebr Appl ():– . Demko S, Moss WF, Smith PW () Decay rates for inverses of band matrices. Math Comput ():– . Mikkelsen CCK, Manguoglu M () Analysis of the truncated spike algorithm. SIAM J Matrix Anal Appl ():– . Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, et al () LAPACK users guide, rd edn. Society for Industrial and Appl. Math, Philadelphia . Schenk O, Grtner K () Solving unsymmetric sparse systems of linear equations with PARDISO. J Future Gener Comput Syst ():– . A distributed memory version of the SPIKE package can be obtained from http://software.intel.com/en-us/articles/inteladaptive-spike-based-solver/ . Manguoglu M, Sameh A, Schenk O () PSPIKE: a parallel hybrid sparse linear system solver. Lecture notes in computer science, vol . Springer, Berlin, pp –
Spiral Markus Püschel , Franz Franchetti , Yevgen Voronenko ETH Zurich, Zurich, Switzerland Carnegie Mellon University, Pittsburgh, PA, USA
Definition Spiral is a program generation system (software that generates other softwares) for linear transforms and an increasing list of other mathematical functions. The goal of Spiral is to automate the development and porting of performance libraries. Linear transforms include the
discrete Fourier transform (DFT), discrete cosine transforms, convolution, and the discrete wavelet transform. The input to Spiral consists of a high-level mathematical algorithm specification and selected architectural and microarchitectural parameters. The output is performance-optimized code in a high-level language such as C, possibly augmented with vector intrinsics and threading instructions.
Discussion Introduction The advent of computers with multiple cores, SIMD (single-instruction multiple-data) vector instruction sets, and deep memory hierarchies has a dramatic effect on the development of high-performance software. The problem is particularly apparent for functions that perform mathematical computations, which form the core of most data or information processing applications. Namely, on a current workstation, the performance difference between a straightforward implementation of an optimal (minimizing operations count) algorithm and the fastest possible implementation is typically – times. As an example, consider Fig. , which shows the performance (in gigafloating point operations per second) of four implementations of the discrete Fourier transform for varying input sizes on a quadcore Intel Core i. Each one uses a fast algorithm with roughly the same operations count. Yet the difference between the slowest and the fastest is – times. The bottom line is the code from Numerical Recipes []. The best standard C code is about five times faster due to memory hierarchy optimizations and constant precomputation. Proper use of explicit vector intrinsics instructions yields another three times. Explicit threading for the four cores, properly done, yields another three times for large sizes. The plot shows that the compiler cannot perform these optimizations as is true for most mathematical functions. The reason lies in both the compiler’s lack of domain knowledge needed for the necessary transformations and the large set of optimization choices with uncertain outcome that the compiler cannot assess. Hence the optimization task falls with the programmer and requires considerable skill. Further, the optimizations are usually platform specific, and hence have to be repeated with every new generation of computers.
Spiral
S
DFT (single precision) on Intel Core i7 (4 cores) Performance [Gflop/s] vs. input size 40 Best vector and parallel code 35 30 25 Multiple threads: 3x 20 Best vector code
15 10
Vector instructions: 3x
Best scalar code
5
Numerical recipes
Memory hierarchy: 5x
0 16
64
256
1k
4k
16k
64k
256k
1M
Spiral. Fig. Performance of different implementations of the discrete Fourier transform (DFT) and reason for the performance difference (From [])
Spiral overcomes these problems by completely automating the implementation and optimization processes for the functions it supports. Complete automation means that Spiral produces source code for a given function given only a very high-level representation of the algorithms for this function and a high-level platform description. After algorithm and platform knowledge are inserted, Spiral can generate various types of code including for fixed and general input size, threaded or vectorized. The approach taken by Spiral is based on the following key principles: ●
Algorithm knowledge for a given mathematical function is represented in the form of breakdown rules in a domain-specific language. Each rule represents a divide-and-conquer algorithm. The language is based on mathematics, and is declarative and platform independent. These properties enable the mapping to various forms of parallelism from algorithm knowledge that is inserted only once. They also enable the derivation of the library structure for general input size implementations by computing the so-called recursion step closure. ● Platform knowledge is organized into paradigms. A paradigm is a feature of a platform that requires structural optimization and possibly source code extensions. Examples include shared memory parallelism and SIMD vector processing. Each paradigm
consists of a set of parameterized rewrite rules and base cases expressed in the same language as the algorithm knowledge. The base cases constitute a subset of the domain-specific language that maps well to a paradigm. The rewrite rules interact with the breakdown rules to produce algorithms that are base cases, which means they are structurally optimized for the considered paradigm. Examples of parameters include the SIMD vector length or the cacheline size. Paradigms are designed to be composable. ● Spiral uses empirical search to automatically explore choices in a feedback loop. This is done by generating candidate implementations and evaluating their performance. Even though theoretically unsatisfying, search enables further optimization for intricate microarchitectural details that may be unknown or are not well understood. In summary, Spiral integrates techniques from mathematics, programming languages, compilers, automatic performance tuning, and symbolic computation. The entire Spiral system combines aspects of a compiler, generative programming, and an expert system. The remainder of this section describes the framework underlying Spiral and the inner workings of the actual system. The presentation focuses on linear transforms; extensions of Spiral beyond transforms are briefly discussed in the end.
S
S
Spiral
Algorithm Representation
butterfly matrix ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎥, F = ⎢ ⎢ ⎥ ⎢ −⎥ ⎢ ⎥ ⎣ ⎦
Linear transforms. A linear transform is a function x ↦ Mx, where M is a fixed matrix, x is the input vector, and y = Mx the output vector. Different transforms correspond to different matrices M. For simplicity, M is referred to as transform in the following. Most transforms M are square n × n, which implies that x and y are of length n. Most transforms exist for all n = , , . . . . Possibly the most well-known transform is the DFT, defined by the n × n matrix: √ −πi/n DFTn = [ω kℓ , i = −. n ]≤k,ℓ
the stride permutation matrix Lnk , defined for n = km by the underlying permutation ℓ nk : im + j ↦ jk + i, ≤ i < k, ≤ j < m,
()
and several others. Matrix operators include the matrix product, the direct sum ⎡ ⎤ ⎢A ⎥ ⎢ ⎥ ⎥, A⊕B = ⎢ ⎢ ⎥ ⎢ B⎥ ⎢ ⎥ ⎣ ⎦
Other examples include the discrete Hartley transform, DHTn = [cos(πkℓ/n) + sin(πkℓ/n)]≤k,ℓ
and the tensor product the discrete cosine transform (DCT) of type , DCT-n = [cos (k (ℓ + ) π/n)]≤k,ℓ
A ⊗ B = [ak,ℓ B]≤k,ℓ
for A = [ak,ℓ ]≤k,ℓ
Most important are the tensor products where A or B is the identity: ⎡ ⎤ ⎢B ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ In ⊗ B = ⎢ ⋱ ⎥ , ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ B ⎣ ⎦ and, for example,
⎡ ⎤ ⎡ ⎢a b ⎥ ⎢aI ⎢ ⎥ ⎢ ⎢ ⎥ ⊗ I = ⎢ ⎢ ⎥ ⎢ ⎢ c d⎥ ⎢ cI ⎢ ⎥ ⎢ ⎣ ⎦ ⎣
⎡ ⎤ ⎢a ⎥ b ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ a ⎥ b ⎢ ⎥ ⎢ ⎥ ⎤ ⎢ ⎥ ⎥ ⎥ bI ⎥ ⎢ a b ⎢ ⎥ ⎥=⎢ ⎥. ⎥ ⎢ ⎥ ⎥ ⎥ dI ⎥ ⎢ c d ⎥ ⎦ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ c d ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ c d ⎣ ⎦
A (partial) description of SPL in Backus–Naur form is provided in Table . Algorithms as SPL breakdown rules. Using SPL, the algorithm knowledge in Spiral is captured by breakdown rules. A breakdown rule represents a one-step divideand-conquer algorithm of a transform. This means the transform is factorized into sparse matrices involving other, typically smaller, transforms.
Spiral
S
Spiral. Table A subset of SPL in Backus–Naur form; n, k are positive integers, ai are real or complex numbers ⟨spl⟩ ::= ⟨generic⟩ ∣ ⟨basic⟩ ∣ ⟨transform⟩ ∣ ⟨spl⟩ ⋅ ⋅ ⋅ ⋅ ⋅ ⟨spl⟩ ∣ (product) ⟨spl⟩ ⊕ . . . ⊕ ⟨spl⟩ ∣ (direct sum) ⟨spl⟩ ⊗ ⋅ ⋅ ⋅ ⋅ ⊗⟨spl⟩ ∣ (tensor product) … ⟨generic⟩ ::= diag(a , . . . , an− ) ∣ . . . ⟨basic⟩ ::= In ∣ Lnk ∣ F ∣ . . . ⟨transform⟩ ::= DFTn ∣ DHTn ∣ DCT-n ∣ . . .
The most well-known example is the general-radix Cooley–Tukey fast Fourier transform (FFT): n DFTn → (DFTk ⊗Im )Tm (Ik ⊗ DFTm )Lnk ,
n = km, () n where Tm is the diagonal matrix of twiddle factors. For n = = × , the factorization is visualized in Fig. together with the associated data-flow graph. The smaller DFT ’s are boxes of equal shades of gray. To terminate the recursion, base cases are needed. For example, for two-powers n, a size two base case is sufficient: DFT → F .
()
A few points are worth noting about this representation of transform algorithms: ● ● ●
● ●
The representation () is point free, i.e., the input vector is not present. The representation () is declarative. Since the rule () is a matrix equation, it can be manipulated using matrix identities. For example, both sides can be inverted or transposed, to obtain an inverse or transposed transform algorithm. A breakdown rule may have degrees of freedom. An example is the choice of k in (). A rule like () does not specify how to compute the smaller transforms. This implies that rules have to be applied recursively until an algorithm is completely specified. Because of this and the availability of different rules for the same transform, there is a large set of choices. In other words, the relatively few existing rules yield a very large space of possible algorithms. This makes rules a very efficient representation of algorithm knowledge. For example, for n = ℓ , () alone yields Θ( ℓ /ℓ / ) different algorithms, all with roughly the same operations count.
Spiral contains about breakdown rules for about transforms, some of which are auxiliary. The most important rules for the DFT, without complete specification, are shown in Table . Note the occurrence of auxiliary transforms.
Spiral Program Generation: Overview The task performed by Spiral is to translate the algorithm knowledge (represented as in Table ) for a given transform into optimized source code (we assume C/C++) for a given platform. The exact approach for generating the code depends on the type of code that has to be generated. The most important distinctions are the following: ●
Fixed input size versus general input size: If the input size is known (e.g., “DFT of size ” as shown in Table a and b), the algorithm to be used and other decisions can be determined at program generation time and can be inlined. The result is a function containing only loops and basic blocks of straightline code. If the input size is not known, it becomes an additional input and the implementation becomes recursive (Table c). The actual algorithm, i.e., recursive computation, is now chosen at runtime once the input size is known. ● Straightline code versus loop code (fixed input size only): Straightline code (Table a) is only suitable for small sizes, but can be faster, due to reduced overhead and increased opportunities for algebraic simplifications. Loop code (Table b) requires additional optimizations that merge redundant loops. ● Scalar code versus parallel code: Code that is parallelized for SIMD vector extensions or multiple cores requires specific optimizations and the use of explicit vector intrinsics or threading directives.
S
S
Spiral
y
DFT4
I4
T 16 4
I4
DFT4
x
L16 4
DFT16 →
a
b
Matrix factorization
Data-flow graph
Spiral. Fig. Cooley–Tukey FFT () for = × as SPL rule and as (complex) data-flow graph (from right to left). Some lines are bold to emphasize the strided access of the DFT s (From [])
Spiral. Table A selection of breakdown rules representing algorithm knowledge for the DFT. rDFT is an auxiliary transform and has two parameters. RDFT is a version of the real DFT DFTn
→
(DFTk ⊗Im )Tmn (Ik ⊗ DFTm )Lnk ,
(Cooley–Tukey FFT)
n = km
(Prime-factor FFT)
n = km, gcd(k, m) =
(Rader FFT)
n prime
(Bluestein FFT)
n > m
DFTn
→
Vn− (DFTk ⊗Im )(Ik ⊗ DFTm )Vn , Wn−(I ⊕DFTp−)En (I ⊕DFTp−)Wn , B′n Dm DFTm D′m DFTm D′′ m Bn ,
DFTn
→
Pk,⊺ m (DFTm ⊕ (Ik− ⊗i Cm rDFTm, i/k )) (RDFTk ⊗Im ),
n = km
RDFTn
→
(Pk,⊺ m ⊗ I ) (RDFTm ⊕ (Ik− ⊗i Dm rDFTm, i/k )) (RDFTk ⊗Im ),
n = km
rDFTn, u
→
Ln m (Ik ⊗i rDFTm, (i+u)/k ) (rDFTk, u ⊗Im ),
n = km
DFTn
→
DFTn
→
Spiral. Table Code types (a) Fixed input size, unrolled
(b) Fixed input size, looped
(c) General input size library, recursive
void dft_4(cpx ∗Y, cpx ∗X){ cpx s, t, t2, t3; t = (X[0] + X[2]); t2 = (X[0] - X[2]); t3 = (X[1] + X[3]); s = _I_∗(X[1] - X[3]); Y[0] = (t + t3); Y[2] = (t - t3); Y[1] = (t2 + s); Y[3] = (t2 - s); }
void dft_4(cpx ∗Y, cpx ∗X){ cpx T[4]; cpx W[2] = {1, _I_}; for(int i = 0; i <= 1; i++) { cpx w = W[i]; T[2∗i] = (X[i] + X[i+2]); T[2∗i+1] = w∗(X[i] - X[i+2]); } for(int j = 0; j <= 1; j++) { Y[j] = T[j] + T[j+2]; Y[2+j] = T[j] - T[j+2]; } }
struct dft : public Env{ dft(int n); // constructor void compute(cpx ∗Y, cpx ∗X); int _rule, f, n; char ∗_dat; Env ∗ch1, ∗ch2; };
The program generation process is explained in the next four sections corresponding to four different code types of increasing difficulty. The order matches the historic development, since for each move to the next code type
void dft::compute(cpx ∗Y, cpx ∗X){ ch2->compute(Y, X, n, f, n, f); ch1->compute(Y, Y, n, f, n, n/f); }
at least one new idea had to be introduced. The types and main ideas (in parentheses) are ●
Fixed input size straightline code (SPL, breakdown rules, feedback loop)
Spiral
● ●
Fixed input size loop code (Σ-SPL, loop merging) Fixed input size parallel code (paradigms, tagged rewriting) ● General input size code (recursion step closure, parameterization) Spiral generates code for fixed input size transforms (first three bullets), as shown in Fig. . The input is the transform symbol (e.g., “DFT”) and the size (e.g., “”). The output is a C function that computes the transform (y = DFT x in this case). Depending on code type, not all blocks in Fig. may be used. The block diagram for the general input size code is shown later.
Fixed Input Size: Straightline Code Given as input to Spiral is a transform symbol (“DFT”) and the input size. The program generation does not need the parallelization and loop optimization blocks. Further, no Σ-SPL is needed, which means the SPL-to-Σ-SPL block and the Σ-SPL-to-code block are joined to one SPL-to-code block. Algorithm generation. Spiral uses a rewrite system that recursively applies the breakdown rules (e.g., Table ) to generate a complete SPL algorithm for the transform. As mentioned before, there are many choices
S
due to the choice of rule and the degree of freedom in some rules (e.g., k in ()). SPL to C code and optimization. The SPL expression is then compiled into actual C code using the internal SPL compiler, which recursively applies the translation rules sketched in Table . All loops are unrolled and code-level optimizations are applied. These include array scalarization, constant propagation, and algebraic simplification. Performance evaluation. The runtime of the resulting code is measured and fed into the search block that controls the algorithm generation. Search. The search drives a feedback loop that generates and evaluates different algorithms to find the fastest. Dynamic programming has proven to work best in many cases, but other techniques including evolutionary search or bandit-based Monte Carlo exploration have also been studied.
Fixed Input Size: Loop Code The approach to generating straightline code can also be used to generate loop code (Table yields loops), but the code will be inefficient. The problem: Loop merging. To illustrate the problem, consider the SPL expression (I ⊗ F )L . Transform (“DFT”) input size (“128”)
Platform knowledge (paradigms)
parallelization
loop optimizations
Algorithm generation algorithm (SPL) SPL→Σ-SPL loop merging algorithm ( Σ-SPL) Σ-SPL→C code Code optimizations
S Search
Algorithm knowledge (breakdown rules)
source code (C) Spiral (fixed input size)
Performance evaluation
C implementation: DFT_128(*y, *x) { … }
Spiral. Fig. Spiral program generator for fixed input size functions. For straightline code, no Σ-SPL is needed and SPL is translated directly into C code
S
Spiral
Spiral. Table Translation of SPL to code. The subscript of A, B specifies the (square) matrix size. x[b:s:e] denotes (Matlab style) the subvector of x starting at b, ending at e, and extracted at stride s. D is a diagonal matrix, whose diagonal elements are stored in an array with the same name SPL expression S
Pseudo code for y = Sx
An Bn
Im ⊗ An
for (i=0; i<m; i++)
Am ⊗ In
for (i=0; i
Dn
for (i=0; i
Lkm k
for (i=0; i
F
y[0] = x[0] + x[1]; y[1] = x[0] - x[1]; y
a
x
Two stages
y
b
x
One stage
Spiral. Fig. The loop merging problem for (I ⊗ F )L
The application of Table yields the code visualized in Fig. a:
// Input: double x[8], output: y[8] double t[8]; for(int i=0; i<4; i++) { for (int j=0; j<2; j++) { t[i*2+j] = x[j*4+i]; } } for (int j=0; j<4; j++) { y[2*j] = t[2*j] + t[2*j+1]; y[2*j+1] = t[2*j] - t[2*j+1]; }
This is known to be suboptimal since the permutation (first loop) can be fused with the subsequent computation loop, thus eliminating one pass through the data (Fig. b): //Input: double x[8], output: y[8] for (int j=0; j<4; j++) { y[2*j] = x[j] + x[j+4]; y[2*j+1] = x[j] - x[j+4]; }
This transformation cannot be expressed in SPL and, in the general case, is difficult to perform on C code. To solve this problem, Σ-SPL was developed, an extension
S
Spiral
of SPL that can express loops. The loop merging is then performed by rewriting Σ-SPL expressions. Σ-SPL. Σ-SPL adds four basic components to SPL: . . . .
These are defined next. An integer interval is denoted by In = {, . . . , n − }, and an index mapping function f with domain In and range IN is denoted by
: In → IN ; i ↦ b + is,
for s∣N.
()
Their translation into actual code (which also defines the matrices) is shown in Table . For example, ],
Ik ⊗ A = [
A
⋱
A
]=[
A
] + ⋅⋅⋅ + [
A
]
= ∑ S(hin, )AG(hin, )
G( f n→N ), S( f n→N ), P( f n ), and diag ( f n→C ).
to represent loops. The Ai are restricted such that no two Ai have a nonzero entry in the same row. The following example shows how ⊗ is converted into a sum. A is assumed to be n × n, and domain and range in the occurring stride functions are omitted for simplicity.
+ S(h(k−)n, )AG(h(k−)n, )
Permutations are written as f n→n = f n , such as the stride permutation in (). A scalar function f : In → C; i ↦ f (i) maps an integer interval to the domain of complex or real numbers, and is abbreviated as f n→C . Scalar functions are used to describe diagonal matrices. Σ-SPL adds four types of parameterized matrices to SPL (gather, scatter, permutation, diagonal):
⋱
∑ Ai
k−
An example is the stride function
n−
= S(h, )AG(h, ) + . . .
f n→N : In → IN ; i ↦ f (i).
G(hn→N , ) = [
Finally, Σ-SPL adds the iterative matrix sum
i=
Index mapping functions Scalar functions Parameterized matrices Iterative sum ∑
hn→N b,s
⊺ S(hn→N , ) = G(h, ) .
Spiral. Table Translation of ∑-SPL to code ∑-SPL expression S Code for y = Sx G(f n→N )
for(i=0; i
S(f n→N )
for(i=0; i
P(f n )
for(i=0; i
diag (f n→C )
for(i=0; i
k− Ai ∑i=
for(i=0; i
i=
Intuitively, the conversion to Σ-SPL makes the loop structure of y = (Ik ⊗ A)x explicit. In each iteration i, G(⋅) and S(⋅) specify how to read and write a portion of the input and output, respectively, to be processed by A. Loop merging using Σ-SPL and rewriting. Using ΣSPL, the loop merging problem identified before in the example (I ⊗ F )L is solved by the loop optimization block in Fig. as follows:
(I ⊗ F )L → (∑ S(hi, )F G(hi, )) P(ℓ ) i=
→ ∑ (S(hi, )F G(ℓ ○ hi, )) i=
→ ∑ (S(hi, )F G(hi, )) i=
The first step translates SPL into Σ-SPL. The second step performs the loop merging by composing the permutation ℓ with the index functions of the subsequent gathers. The third step simplifies the resulting index functions. After that, actual C code is generated using Table . Besides the added loop optimizations block in Fig. , the program generation for loop code operates iteratively exactly as for straightline code.
Fixed Input Size: Parallel Code As was illustrated in Fig. , for compute functions, compilers usually fail to optimally (or at all) exploit the parallelism offered by a platform. Hence, the task falls
S
S
Spiral
with the programmer, who has to leave the standard C programming model and insert explicit threading or OpenMP loops for shared memory parallelism and socalled intrinsics for vector instruction sets. However, doing so in a straightforward way does not necessarily yield good performance. The problem: Algorithm structure. To illustrate the problem, consider a target platform with four cores that share a cache with a cache block size of two complex numbers. The first goal is to obtain parallel code with four threads for I ⊗ F visualized in Fig. a. The computation is data parallel; hence, the loop suggested in Table can be replaced, for example, by an OpenMP parallel loop. Note that each processor “owns” as working set exactly one cache block; hence, the parallelization will be efficient. Now consider again the SPL expression (I ⊗ F )L visualized in Fig. b. The computation is again data parallel, but the access pattern has changed such that always two processors access the same cache block. This produces false sharing, which triggers the cache coherency protocol and reduces performance. The problem is obviously the permutation L . Since the rules (e.g., those in Table ) contain many, and various, permutations, a straightforward mapping to parallel code will yield highly suboptimal performance. To solve this problem inside Spiral, another rewrite system is introduced to restructure algorithms before mapping to parallel code. The restructuring will be different for different forms of parallelism, called paradigms. y
x
Paradigms and tagged rewriting. A paradigm in Spiral is a feature of the target platform that requires structural optimization. Typically, a paradigm is a form of parallelism. Examples include shared memory parallelism (SMP) and SIMD parallelism. A paradigm may be parameterized, for example, by the vector length ν for SIMD parallelism. In Spiral, a paradigm manifests itself by another rewrite system provided by the additional parallelism block in Fig. (and backend extensions in the Σ-SPL to C code block to produce the actual code). The goal of the new rewrite system is to structurally optimize a given SPL expression into a form that can be efficiently mapped to a given paradigm. The rewrite system is built from three main components: ●
Tags encode the paradigm and relevant parameters. Examples include the tags “vec(ν)” for SIMD vector extensions and the tag “smp(p, μ)” for SMP. The meaning of the parameters is explained later. ● Base cases are SPL constructs that can be mapped well to a given paradigm. As illustrated above, one example is any Ip ⊗ An for p-way SMP. ● Tagged rewrite rules are mathematical identities that translate general SPL expressions toward base cases. An example is the rule (assuming p∣n) (Ip ⊗ (In/p ⊗ Am )) Lmn . Am ⊗ In → Lmn m n XYY Y Y Y YZY Y Y Y Y \ ^ ^ smp(p,μ)
smp(p,μ)
The rule extracts the p-way parallel loop (base case) Ip ⊗ (In/p ⊗ Am ) from Am ⊗ In . The stride permutamn tions Lmn m and Ln are handled by further rewriting.
y
x
Cache block
a
I4
F2
smp(p,μ)
Cache block
b
(I4
F2) L4
Spiral. Fig. Mapping SPL constructs to four threads. Each thread computes one F . Both computations are data parallel, but (a) produces no false sharing, whereas (b) does
Spiral
Example: SMP. For SMP, the tag smp(p, μ) contains the number of processors p and the cache block size μ. Base cases include Ip ⊗ An and P ⊗ I μ , where P is any permutation. P ⊗ I μ moves data in blocks of size μ; hence false sharing is avoided. From these, other base cases can be built recursively as captured by the sketched grammar in Table . Some SMP rewrite rules are shown in Table . Note that the rewriting is not unique, and not every sequence of rules terminates. Once all tags disappear, the rewriting terminates. Example: SIMD. For SIMD, the tag vec(ν) contains only the vector length ν. The most important base case is An ⊗ Iν , which can be mapped to vector code by generating scalar code for An and replacing every operation by its corresponding ν-way vector operation. Other ν ν base cases include Lν ν , L , and Lν , which are generated automatically from the instruction set []. Similar to Table , the entire set of vector base cases is specified
Spiral. Table smp(p, μ) base cases in Backus–Naur form; n is a positive integer, ai are real or complex numbers ⟨smp⟩ ::= ⟨generic⟩ ∣ ⟨basic⟩ ∣ ⟨smp⟩ ⋅ ⋅ ⋅ ⋅ ⋅ ⟨smp⟩ ∣
(product)
⟨smp⟩ ⊕ . . . ⊕ ⟨smp⟩ ∣ (direct sum) In ⊗ ⟨smp⟩ ∣
(tensor product)
… ⟨generic⟩ ::= diag(a , . . . , an− ) ∣ . . . ⟨basic⟩ ::= Ip ⊗ An ∣ P ⊗ I μ ∣ . . .
Spiral. Table Examples of smp(p, μ) rewrite rules AB
→
Am ⊗ In !"" " " "# " " " " $
→
smp(p,μ)
smp(p,μ)
A
B
smp(p,μ) smp(p,μ) (Lmp m ⊗ In/p ) (Ip
⊗ (Am ⊗ In/p )) (Lmp p ⊗ In/p ) !"" " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " # " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " $ smp(p,μ)
Lmn m %
→
Im ⊗ An !"" " " "# " " " " $
→
⎧ mn/p ⎪ (Ip ⊗ Lm/p ) (Lpn ⎪ p ⊗ Im/p ) ⎪ ⎪ ⎪ ⎪ !"" " " " " " " " " " " " #" " " " " " " " " " " " $ ⎪ !" " " " " " " " " " " " " # " " " " " " " " " " " " $ ⎪ ⎪ smp(p,μ) ⎪ smp(p,μ) ⎨ pm mn/p ⎪ ⎪ (Lm ⊗ In/p ) (Ip ⊗ Lm ) ⎪ ⎪ ⎪ ⎪ !" " " " " " " " " " " " " # " " " " " " " " " " " " $ ⎪ !"" " " " " " " " " " " " # " " " " " " " " " " " " $ ⎪ ⎪ ⎪ smp(p,μ) ⎩ smp(p,μ) Ip ⊗ (Im/p ⊗ An )
(P ⊗ In ) !"" " " " "# " " " " " $
→
(P ⊗ In/μ ) ⊗ I μ
smp(p,μ)
smp(p,μ)
smp(p,μ)
S
by a grammar recursively built from the above special constructs. Parallelization by rewriting. In Spiral, parallelization adds the new parallelization block in Fig. . The parallelization rules are applied interleaved with the breakdown rules to generate SPL algorithms that have the right structure for the desired paradigm. For example, for the DFT it may operate as follows: DFTmn → ((DFTm ⊗In )Tnmn (Im ⊗ DFTn )Lmn m ) XYY Y Y YZ Y Y Y Y\ XYY Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y YZ Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y\
smp(p,μ)
smp(p,μ)
...
→ ( DFTm ⊗In ) Tnmn (Im ⊗ DFTn ) Lnm m XYY Y Y Y Y Y Y Y Y Y Y Y Y Y ZY Y Y Y Y Y Y Y Y Y Y Y Y Y \ _ XYY Y Y Y Y Y Y Y Y Y Y Y Y Y Y Z Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y \ ^ ...
smp(p,μ)
smp(p,μ)
smp(p,μ)
smp(p,μ)
mp
→ ((Lm ⊗ In/pμ ) ⊗ I μ ) (Ip ⊗ (DFTm ⊗In/p )) mp
mn (Ip ⊗ (Im/p ⊗ DFTn )) ((Lp ⊗ In/pμ ) ⊗ I μ )Tm mn/p
pn
(Ip ⊗ Lm/p ) ((Lp ⊗ Im/pμ ) ⊗ I μ) First, Spiral applies the breakdown rule (). Then the parallelization rules transform the resulting SPL expression in several steps. Note how the final expression has only access patterns (permutations) of the form P ⊗ I μ and all computations are in the form Ip ⊗ A (and mn ). The smaller DFTs can be expanded the diagonal Tm in different ways, for example, by rewriting for SIMD. Further choices are used for search. The remaining operation of Spiral including Σ-SPL conversion and search proceeds as before.
General Input Size An implementation that can compute a transform for arbitrary input size is fundamentally different from one for fixed input size (compare Table b and c). If the input size n is fixed, for example, n = , the computation is (x,y) -> dft_4(y,x)
and all decisions such as the choice of recursion until base cases are reached can be made at implementation time. In an equivalent implementation (called library) for general input size n, (n,x,y) -> dft(n,y,x)
the recursion is fixed only once the input size is known. Formally, the computation now becomes n -> ((x,y) -> dft(n,y,x))
S
S
Spiral
which is an example of function currying. A C++ implementation is sketched in Table c, where the two steps would take the form dft * f = new dft(n); // initialization f->compute(y, x); // computation
The first step determines the recursion to be taken using search or heuristics, and precomputes the twiddle factors needed for the computation. The second step performs the actual computation. The underlying assumption is that the cost of the first step is amortized by a sufficient number of computations. This model is used by FFTW [] and the libraries generated by Spiral. To support the above model, the implementation needs recursive functions. The major problem is that the optimizations introduced before operate in nontrivial ways across function boundaries, thus creating more functions than expected. The challenge is to derive these functions automatically. The problem: Loop merging across function boundaries. To illustrate the problem, consider the Cooley– Tukey FFT (). A direct recursive implementation would consist of four steps corresponding to the four matrix factors in (). Two of the steps would call smaller DFTs: void dft(int n, cpx *y, cpx *x) { int k = choose_factor(n); int m = n/k; cpx *t1 = Permute x with L(n,k); // t2 = (I_k tensor DFT_m)*t1 for(int i=0; i
Note how even this simple implementation is not self-contained. A new function dft_stride is needed that accesses the input in a stride and produces the output at the same stride (see the data flow in Fig. ). However, as explained before, loops should be merged where possible. For fixed size code, Spiral would merge the first loop with the second, and the third loop with the fourth, using Σ-SPL rewriting. The same can be done in the general size recursive implementation, but the merging crosses function boundaries: void dft(int n, cpx *y, cpx *x) { int k = choose_factor(n); // t1 = (I_k tensor DFT_m)L(n,k)*x for(int i=0; i < k; ++i) dft_iostride(m, k, 1, t1 + m*i, x + m*i); // y = (DFT_k tensor I_m) Tˆn_m // diagonal entries of T are now // precomputed in precomp_f[] for(int i=0; i < m; ++i) dft_scaled(k, m, precomp_f[i], y + i, t1 + i); }
// to be implemented void dft_iostride(int n, int istride, int ostride, cpx *y, cpx *x); void dft_scaled(int n, int stride, cpx *d, cpx *y, cpx *x);
Now there are two additional functions: dft_iostride reads at a stride and writes at a different stride, and dft_scaled first scales the input and then performs a DFT at a stride. So at least three functions are needed with different signatures. However, the two additional functions are also implemented recursively, possibly spawning new functions. Calling these functions recursion steps, the main challenge is to automatically derive the complete set of recursion steps needed, called the “recursion step closure.” Further, for each recursion step in the closure, the signature has to be derived. Recursion step closure by Σ-SPL rewriting. Spiral derives the recursion step closure using Σ-SPL and the same rewriting system that is used for loop merging.
Spiral
For example, the two additional recursion steps in the optimized implementation above are automatically obtained from () as follows. Recursion steps are marked by overbraces. aYY Y Y Y b Y Y Y Y c ` ` DFTn → (DFTn/k ⊗Ik )Tkn (In/k ⊗ DFTk )Lnn/k k− aYY Y Y Y b Y Y Y Y c → (∑ S(hi,k ) DFTn/k G(hi,k )) i=
` ⎛n/k− ⎞ diag (f ) ∑ S(hjk, ) DFTk G(hjk, ) P(ℓ nn/k ) ⎝ j= ⎠ k− aYY Y Y Y b Y Y Y Y c → ∑ S(hi,k ) DFTn/k diag (f ○ hi,k )G(hi,k ) i=
` ∑ S(hjk, ) DFTk G(hj,n/k )
n/k− j=
k− aYY Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y bY Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y c
→ ∑ S(hi,k ) DFTn/k diag (f ○ hi,k )G(hi,k ) i= n/k− aYY Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Yb Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Yc
∑ S(hjk, ) DFTk G(hj,n/k )
()
j=
The first step applies the breakdown rule (). The second step converts to Σ-SPL. The third step performs loop merging as explained before. The fourth step expands the braces to include the context. The two expressions under the braces correspond to the two functions
S
dft_iostride and dft_scaled. The process is now repeated for the expression under the braces until closure is reached. In this example, only one additional function is needed, i.e., the recursion step closure consists of four mutually recursive functions. The derivation of the recursion steps also yields a Σ-SPL specification of the actual recursion, i.e., their implementation by a recursive function (e.g., () for DFTn ). For the best performance, the braces may be extended to also include the loop represented by the iterative sum. Moving the loop into the function enables better C/C++ compiler optimizations. If the implementation is vectorized or parallelized, the initial breakdown rules are first rewritten as explained before and then the closure is computed. The size of the closure is typically increased in this case. Program generation for general input size: Overview. The overall process is visualized in Fig. . The input to Spiral is now a (sufficient) set of breakdown rules for a given transform or transforms. The rules are parallelized if desired, using the appropriate paradigms; then the recursion step closure is computed, which also yields the actual recursions. The resulting recursion steps need base cases for termination. These are generated using the algorithm generation block from the fixed input size Spiral (Fig. )
Algorithm knowledge (breakdown rules)
Platform knowledge (paradigms) Parallelization
Recursion step closure Recursion steps and recursions ( Σ-SPL)
Base case generation (fixed input size Spiral) Base case algorithms ( Σ-SPL)
Spiral (general input size)
Hot/cold partitioning Σ-SPL→C code
C library
Spiral. Fig. Spiral program generator for general input size libraries
S
S
Spiral
for a range of small sizes (e.g., n ≤ ) to improve performance. These, the recursion steps, and the recursions are fed into the final block to generate the final library. Among other junctions, the block performs the hot/ cold partitioning that determines which parameters in a recursion step are precomputed during initialization and which become parameters of the actual compute function. Finally, the actual code is generated (which now includes recursive functions) and integrated into a common infrastructure to obtain a complete library. Many details are omitted in this description and are provided in [, ].
Extensions A major question is whether the approach taken by Spiral can be extended beyond the domain of linear transforms, while maintaining both the basic principles outlined in the introduction and the ability to automatically perform the necessary transformations and reasoning. First progress in this direction was made in [] with the introduction of the operator language (OL). OL generalizes SPL by considering operators that may be nonlinear and may have more than one vector input or output. Important constructs such as the tensor product are generalized to operators. First results on program generation for functions such as radar imaging, Viterbi decoding, matrix multiplication, and the physical layer functions of wireless communication protocols have already been developed.
Related Entries ATLAS (Automatically Tuned Linear Algebra Software) FFT (Fast Fourier Transform) FFTW
Bibliographic Notes and Further Reading Spiral is based on early ideas on using tensor products to map FFT algorithms to parallel supercomputers []. The first paper describing SPL and the SPL compiler is []. See also [] for basic block optimizations for transforms. The first complete basic Spiral system including SPL algorithm generation and search was presented in [], with a more extensive treatment in []
and probably the best overview paper [], which fully develops SPL for a variety of transforms. The path to complete automation in the transform domain continued with Σ-SPL and loop merging [], the introduction of rewriting systems for SIMD vectorization [, ] and base case generation [], SMP parallelization [], and distributed memory parallelization [, ]. The final step to generating general size, parallel, adaptive libraries was made in [, ]. The generated libraries are modeled after FFTW [], which is written by hand but uses generated basic blocks []. The most important extensions of Spiral are the following. Extensions to generate Verilog for fieldprogrammable gate-arrays (FPGAs) are presented in [, ]. Search techniques other than dynamic programming are developed in [, ]. The use of learning to avoid search was studied in [, ]. Finally, [, , ] make the first steps toward extending Spiral beyond the transform domain including the first OL description. The Spiral project website with more information and all publications is given in []. A good introduction to FFTs using tensor products is given in the books [, ]. A comprehensive overview of algorithms for Fourier/cosine/sine transforms is given in [, ]. A good introduction to mapping FFTs to multicore platforms is given in [].
Bibliography . Spiral project website. www.spiral.net . Bonelli A, Franchetti F, Lorenz J, Püschel M, Ueberhuber CW () Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: International symposium on parallel and distributed processing and application (ISPA), Lecture notes in computer science, vol . Springer, Berlin, pp – . Chellappa S, Franchetti F, Püschel M () High performance linear transform program generation for the Cell BE. In: Proceedings of the high performance embedded computing (HPEC), Lexington, – September . de Mesmay F, Chellappa S, Franchetti F, Püschel M () Computer generation of efficient software Viterbi decoders. In: International conference on high performance embedded architectures and compilers (HiPEAC), Lecture notes in computer science, vol . Springer, Berlin, pp – . de Mesmay F, Rimmel A, Voronenko Y, Püschel M () Banditbased optimization on graphs with application to library performance tuning. In: International conference on machine learning (ICML), ACM international conference proceedings series, vol . ACM, New York, pp –
SPMD Computational Model
. de Mesmay F, Voronenko Y, Püschel M () Offline library adaptation using automatically generated heuristics. In: International parallel and distributed processing symposium (IPDPS) . Franchetti F, de Mesmay F, McFarlin D, Püschel M () Operator language: a program generation framework for fast kernels. In: IFIP working conference on domain specific languages (DSL WC), Lecture notes in computer science, vol . Springer, Berlin, pp – . Franchetti F, Püschel M () A SIMD vectorizing compiler for digital signal processing algorithms. In: International parallel and distributed processing symposium (IPDPS). pp – . Franchetti F, Püschel M () Generating SIMD vectorized permutations. In: International conference on compiler construction (CC), Lecture notes in computer science, vol . Springer, Berlin, pp – . Franchetti F, Püschel M, Voronenko Y, Chellappa S, Moura JMF () Discrete Fourier transform on multicore. IEEE Signal Proc Mag ():– . Franchetti F, Voronenko Y, Püschel M () Formal loop merging for signal transforms. In: Programming languages design and implementation (PLDI). ACM, New York, pp – . Franchetti F, Voronenko Y, Püschel M () FFT program generation for shared memory: SMP and multicore. In: Supercomputing (SC). ACM, New York . Franchetti F, Voronenko Y, Püschel M () A rewriting system for the vectorization of signal transforms. In: High performance computing for computational science (VECPAR), Lecture notes in computer science, vol . Springer, Berlin, pp – . Frigo M () A fast Fourier transform compiler. In: Proceedings of the programming language design and implementation (PLDI). ACM, New York, pp – . Frigo M, Johnson SG () The design and implementation of FFTW. Proc IEEE ():– . Johnson J, Johnson RW, Rodriguez D, Tolimieri R () A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. IEEE Trans Circ Sys :– . McFarlin D, Franchetti F, Moura JMF, Püschel M () High performance synthetic aperture radar image formation on commodity architectures. Proc SPIE : . Milder PA, Franchetti F, Hoe JC, Püschel M () Formal datapath representation and manipulation for implementing DSP transforms. In: Design automation conference (DAC). ACM, New York, pp – . Nordin G, Milder PA, Hoe JC, Püschel M () Automatic generation of customized discrete Fourier transform IPs. In: Design automation conference (DAC). ACM, New York, pp – . Press WH, Flannery BP, Teukolsky SA, Vetterling WT () Numerical recipes in C: the art of scientific computing, nd edn. Cambridge University Press, Cambridge . Püschel M, Moura JMF () Algebraic signal processing theory: Cooley-Tukey type algorithms for DCTs and DSTs. IEEE Trans Signal Proces ():–
S
. Püschel M, Moura JMF, Johnson J, Padua D, Veloso M, Singer B, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson RW, Rizzolo N () SPIRAL: code generation for DSP transforms. Proc IEEE (Special Issue on Program Generation, Optimization, and Adaptation) ():– . Püschel M, Singer B, Veloso M, Moura JMF () Fast automatic generation of DSP algorithms. In: International conference on computational science (ICCS), Lecture notes in computer science, vol . Springer, Berlin, pp – . Püschel M, Singer B, Xiong J, Moura JMF, Johnson J, Padua D, Veloso M, Johnson RW () SPIRAL: a generator for platformadapted libraries of signal processing algorithms. J High Perform Comput Appl ():– . SingerB,VelosoM()Learningtogeneratefastsignalprocessing implementations.In:Internationalconferenceonmachinelearning (ICML). Morgan Kaufmann, San Francisco, pp – . Singer B, Veloso M () Stochastic search for signal processing algorithm optimization. In: Supercomputing (SC). ACM, New York, p . Tolimieri R, An M, Lu C () Algorithms for discrete Fourier transforms and convolution, nd edn. Springer, Berlin . Van Loan C () Computational framework of the fast Fourier transform. SIAM, Philadelphia . Voronenko Y () Library generation for linear transforms. Ph.D. thesis, Electrical and Computer Engineering, Carnegie Mellon University . Voronenko Y, de Mesmay F, Püschel M () Computer generation of general size linear transform libraries. In: International symposium on code generation and optimization (CGO). IEEE Computer Society, Washington, DC, pp – . Voronenko Y, P¨uschel M () Algebraic signal processing theory: Cooley–Tukey type algorithms for real DFTs. IEEE Trans Signal Proces ():– . Xiong J, Johnson J, Johnson RW, Padua D () SPL: a language and compiler for DSP algorithms. In: Programming languages design and implementation (PLDI). ACM, New York, pp –
SPMD Computational Model Frederica Darema National Science Foundation, Arlington, VA, USA
Definition of the Subject The Single Program – Multiple Data (SPMD) parallel programming paradigm is premised on the concept that all processes participating in the execution of a program work cooperatively to execute that program, but at any given instance different processes may execute different instruction-streams, and act on different data
S
S
SPMD Computational Model
and on different sections in the program, and whereby these processes dynamically self-schedule themselves, according to the program workflow and through synchronization constructs embedded in the application program. SPMD programs comprise of serial, parallel, and replicate sections.
Introduction The (SPMD) Single Program-Multiple Data model [–] is premised on the concept that all processes participating in the (parallel) execution of a program work cooperatively to execute this program, but at any given instance, through synchronization constructs embedded in the application program, different processes may execute different instruction-streams, and act on different data and on different sections in the program; thus the name of the model: Single Program – Multiple Data. The model was proposed by the author in January [], as a means for expressing and enabling parallel execution of applications on highly parallel MIMD computational platforms. The term “single-program” was used for emphasis on the parallel execution of a given program (i.e., concurrent execution of tasks in a given program), in distinction from the environments of that time, where OS-level concurrent tasks would (concurrently) execute different programs on the multiprocessors of that time (e.g., IBM ). The initial motivation for the SPMD model was to enable the expression of the (then) high degrees of parallelism supported by the IBM Research Parallel Processor Prototype (RP) [] parallel computer system. The model was first implemented in the Environment for Parallel Execution (EPEX) [–] programming environment (one of the first generalpurpose parallel programming environments, and of course the first to implement SPMD). This entry is an excerpt of a more comprehensive treatise on the SPMD [] which puts the motivation for the SPMD in the context of the landscape of the computer platforms and software support of the early- to mid-s, and discusses experiences, effectiveness, and impact of the SPMD in expediting adoption of parallelism in that time frame and as it has influenced derivative programming environments in the intervening years. The SPMD model fostered a new approach to parallel programming, differing than the Fork & Join (or Master–Slave) model predominantly pursued by others at around the timeframe. In SPMD, all processes
that participate in the parallel execution of a program commence execution of the program, and each process is dynamically self-scheduled and selects work-tasks to execute (self-allocated work), based on parallelization directives embedded in the program and according to the program workflow. In the SPMD model, a process represents a separate instantiation of the program, and processes participating in the cooperative execution of a program (also referred to as parallel processes) execute distinct instruction streams in a coordinated way through these embedded parallelization directives (also referred to as synchronization directives). With respect to parallel execution, in general a program consists of sections that are executed by one process (serial sections) and sections that can be executed by multiple cooperating processes (i.e., parallel sections, that can be executed by several processes in a cooperative and concurrent manner, and replicate sections, where the computations are executed by every process; parallel sections may include one or more levels of nested parallel sections). Regardless of the section considered, the beginning and end of serial and of parallel sections (and nested parallel sections) are points of synchronization of the processes (synchronization points), that is, points of coordination and control of the execution path of the participating processes. The definition in [, ]: “all processes working together will execute the very same program,” uses the word “same” to emphasize concurrent execution of a given program by these (“multiple”) processes, and not to imply that all these processes execute identical instruction streams (as it has been interpreted by some). In expressing parallelism through the SPMD model, the flow of control is distinguished in two classes: the parallel flow of control is the flow of control followed by the processes while executing a serial or a parallel section in the program; the global flow of control is the flow of control followed by the processes as they step through the synchronization points of the program, according to the program workflow. In parallel execution with the SPMD model, the participating processes follow a different parallel flow of control, but all the processes follow the same global flow of control. Conceptually, the SPMD model has been from the outset a general model, enabling to express parallelism for concurrent execution of distinct instruction streams and allowing application-level parallelization control
SPMD Computational Model
and dynamic scheduling of work-tasks ready to execute (with process self-scheduling). The model is able to support general MIMD task-level parallelism, including nested parallelism, and is applicable to a range of parallel architectures (those considered at the time the model was proposed and other parallel and distributed architectures that have subsequently appeared) in more efficient and general ways than the then proposed alternate parallel programming models, such as those based on Fork-and-Join, SIMD, and dataparallel and systolic-arrays approaches. In fact SPMD allows combining with and implementing such models as parallel execution models invoked under the SPMD rubric. Applied to parallel architectures supporting shared memory (for example RP) the SPMD is a global-view programming model. From its inception, SPMD allowed efficient implementations of expressing parallelism, through parallelization directives inserted in the initial serial application programs, the ensuing “parallelized” program compiled with the standard serial compliers. The parallelization directives enabled application-level dynamic scheduling and control of the parallel execution, with efficient runtime implementations, requiring minimal interaction with the OS, avoiding (the heavy) OS level-synchronization overheads, and without requiring new, parallel OS services. These were important considerations in the time frame, the SPMD showed early-on that it allowed the parallelization of non-trivial programs; many were production-level application programs, for example, from the areas of applied physics, aerospace, and design automation. Thus the SPMD expedited adoption of parallelism in the mid-s, as the model enabled to determine that it was not difficult to express parallelism, and map and execute such non-trivial applications on parallel machines, and thus exploit parallelism and do so flexibly and adaptively, and with efficiency. SPMD enabled creating parallel versions of FORTRAN and C programs, and also enabled the parallel execution of such programs without the need for new parallel languages. To date, there are many environments that have implemented SPMD, primarily for science and engineering parallel computing, but also for commercial applications. The most widely used today being MPI [] for distributed multiprocessing systems or “message-passing” systems, and later used also for shared memory systems; the predecessor of
S
MPI being PVM [], which appeared in the late-s as one of the first popular implementations of SPMD for “message-passing” systems. The SPMD model through its implementation in MPI is widely used for exploiting today’s heterogeneous complex parallel and distributed computational platforms, including computational grids [], which embody many types of processors, multiple levels of memory hierarchy, and multiple levels of networks and communication, and which may employ a combination of SPMD as well as Fork-and-Join programming environments. Other influential programming environments also based on SPMD include OpenMP [] for sharedmemory parallel programming in FORTRAN, C, and C++; Titanium [] for parallelization of Java programs; and Split-C [] and Unified Parallel C (UPC) [] for parallel execution of C programs.
The SPMD Model In the SPMD model, a parallel program conceptually consists of serial, parallel (including nested parallel), and replicate sections. The program data are distinguished into application data and synchronization data; synchronization data are shared among the parallel processes, and application data are distinguished into private and into shared data. In the following are discussed how these parallelization aspects are treated in SPMD: ●
Serial sections are executed by one process (either a designated process, or more generally, and typically, the first one to arrive at that section). The other processes that arrive at this serial section, through the “serial-section synchronization directives” are directed to bypass its execution; there, they either wait at the end of the section for its execution to complete, or they may proceed to seek and execute available work-task(s) in subsequent section(s), if that is allowed by the program workflow dependencies (and with appropriate synchronizations imposed – soft and hard barriers, discussed later on in their EPEX implementation). ● Parallel sections are executed concurrently by processes that arrive at the beginning of such a section. These processes, through “parallel-section synchronization directives” dynamically self-schedule themselves and get allocated with the next available parallel work-task to execute. As each of these
S
S
SPMD Computational Model
processes arriving in a parallel section is selfassigned a work - task, and as each process completes its task of work, these processes can seek the next available work-task to execute in that parallel section; or if all work tasks in the parallel section have been allocated to processes, the remaining process(es) can proceed to the end of that parallel section, and either wait there till all the work of the parallel section is completed, or continue-on to the next section of the program or other subsequent sections of the program as allowed by the program workflow dependencies (and with appropriate synchronizations imposed – soft and hard barriers, discussed later on in their EPEX implementation). Parallel loops in programs are the predominant form of a parallel section and thus major targets of parallel execution; however, SPMD also supports other forms of general task parallelism. Later in this section nested parallelism support in SPMD is discussed. ● Replicate sections of the program are sections allowed to be executed by all processes (it is the default mode of execution that was envisioned when the SPMD was first proposed). Such sections involve parts of the computation (typically small portion of the total amount of the computation) and where allowing all processes to replicate the execution of the computation is more efficient than having one process execute the section (while the others wait) and then make the result of the computation available to other processes (as shared data or as communicated messages). This is applicable, for example, where replicate execution avoids: the serialization overhead, the busy-waiting and polling a synchronization semaphore by the other processes (with potential additional overhead due for example contention in this semaphore, and also potential network contention), and the overhead of making the resulting data available to the other process (e.g., in shared memory architectures, potential of contention upon access of the resulting data placed in shared memory; or network contention upon broadcasting the results to the other processes through message passing, in logically distributed, “message-passing” architectures). Especially for message-passing environments, such overheads can be rather high, and in such cases
the approach of replicate computations can be a more efficient alternative for certain portions of the computation. ● Shared and Private Data: These refer to application data and synchronization data. For parallel machine architectures which support shared as well as private memory, parallelism is expressed by considering two types of data: shared data and private data; each of the parallel processes having read and write capability on the shared data, while private data are exclusive to each process (and typically the private data of a process would reside in the local memory of the processor on which the process executes). Application shared data and synchronization data are declared as shared data. SPMD was originally proposed for architectures supporting a mix of shared and (logically local – private) memory, and the approach followed in the original SPMD implementation was for the default to be the private data (to each parallel process); this decision was made (by the author) on the basis that it is easier (especially for the user) to identify the application shared data and the synchronization data, rather than explicitly defining the private data as other approaches have done. For message-passing architectures, where the SPMD was also applied, the application shared data are partitioned into per process private data, and “sharing” of such data is effected through “messages” exchanged between parallel processes, as needed during the course of the computation. It is the belief of this author that expressing parallelism through messagepassing is more challenging than supporting parallelism through a “shared-memory plus private memory” architecture support, and that is also the case with the kinds of programming models for expressing parallelism (more on this in later discussion on MPI). ● I/O: The SPMD model allows general flexibility in handling I/O. Any of the parallel processes executing a parallel program are allowed to perform I/O, and (as discussed later-on in the context of the EPEX programming environment) all processes can have read/write access to any of the parallel program’s files. One can argue that the typical approach would be for I/O to be treated as a serial action, although that is not necessarily always so. For example, upon
SPMD Computational Model
commencing execution of a program, one process typically would be assigned to read the input data file, and perform the program state initialization. However, there are many other instances in the program execution, for example, within parallel sections, where it may be more efficient to allow each process to read data from a file, and as needed write data back into a file. In this case, for flexibility, random-access file structure would be more desirable, than sequential record files; it is then not necessary to attach a file to a given process, which is a possible but more restrictive approach. In the early implementations of the SPMD, where processes were implemented by heavy-weight mechanisms (such as VMs – Virtual Machines), to enable efficient parallel execution, all processes were created in the beginning of the program execution, and this has been applied to most SPMD implementations since then. Noted, however, that the SPMD model does not preclude by principle the case where additional processes can be created and participate in the program execution; this was allowed through EPEX implemented directives, allowing such additional processes to skip the already executed portion of the entire computation. However, at the time SPMD was proposed and within the systems then available for the implementation of the model, creation of such processes and inheriting program state was expensive, unlike capabilities that were enabled later on, like creation of light-weight threads. Nesting in SPMD can be implemented by allowing enough processes to enter an (outer) parallel loop, for example, allocate several processes – a group of processes – to a given outer loop iteration, allowing only one of these processes to execute the outer loop portion (for example, with appropriate directives treating it as a “nested serial”) and then allow all processes in this group to execute in parallel the inner loop. In fact this author experimented with such approaches, as well as a hybrid of SPMD combined with Fork-andJoin capabilities to implement nesting, but they were never implemented in a production way, due to the overheads of spawning VM/SP virtual processes, but also because in the mid-s timeframe the degrees of parallelism in the applications (e.g., dimensionality of the outer parallel loops in a program) exceeded the numbers of processors in the commercial and in the
S
prototype systems of that time, so exploiting nesting was more of an intellectual endeavor rather than a practical need. With threading capabilities, other implementations can be used to support more elegantly execution of nested parallel computations, for example, by each process spawning threads to execute the inner iterations of the loop, like for example in OpenMP. Later in this entry is discussed the need for nested or multilevel threading capabilities, gang scheduling of threads, partially shared data and active data-distribution notions, in the context of multi-level parallelization hierarchies that will be encountered in the emerging and future parallel architectures. The SPMD model does not restrict how many processes can run on each parallel processor. While often one process executes per processor, there are cases where multiple processes may be spawned to execute per processor, for example, to mask memory latency, shared memory contention, I/O, etc. Also SPMD does not require a given process to be attached to a given processor; however, the typical approach is to bind a process (or a thread to a processor), to avoid cost of migration, context-switch, and better exploitation of (local data) cache; however, there are situations where, as long as there is no thrashing, a process may be moved, for example, to improve performance, and in other cases for fault tolerance or recovery [, ]. Other SPMD features are provided as illustrative examples of the EPEX programming environment discussed next.
The EPEX Programming Environment – Implementation of the SPMD Model The first implementation of SPMD was in the EPEX environment [–], initially implemented on IBM multiprocessors like the dual processor IBM/ and, later, the six-processor IBM/ vector multiprocessor (through MVS/XA []). The IBM/ through the VM/SP OS [] supported execution of multiple virtual machines (VMs), running in time-shared mode (on each node of the ). This capability was used to simulate parallel execution by a large number of processors working together on a single program, each simulated parallel processor and its local memory (correspondingly for each of the parallel processes) being simulated by a VM (a virtual machine). Although in
S
S
SPMD Computational Model
practice the hardware supported -way parallelism (and the hardware supported -way parallelism), the simulation environment created was used to simulate up to parallel processors; of course, in principle the environment could simulate even higher numbers of processors. To simulate execution on parallel platforms, like RP, the VM/SP OS support of the Writable Shared Segments (VM/WSS) capability was used, which allowed to establish a portion of the virtual machines’ memory as shared across a set of VMs (in read and write modalities); this feature allowed to simulate the shared-memory of a parallel system (like RP), while the private memory of each virtual machine simulated the local memory of an RP processor, and the private memory of the corresponding process. The VM/WSS was also used to emulate message-passing environments, by using the WSS-based (virtual) shared memory as a “mailbox” for messages, and thus allowing to also experiment and to demonstrate the use of SPMD as a model also supporting message-passing parallel computation. These capabilities were used to create a simulation platform for the EPEX programming environment, and were also used for the development of other simulation tools for analysis of execution on various parallel machines with shared-plus-local memory systems as well as for “message - passing” distributed systems. While over the last two decades most parallel systems have been used with MPI (a message-passing environment), it behooves elaborating on sharedplus-local memory systems because the emerging multicore-based architectures are expected to support such memory organizations within the likely set of memory hierarchies. Initially, parallelization of programs under EPEX was enabled by implementing the “serial,” “parallel,” and “barrier” parallelization directives through synchronization subroutine calls inserted in the program; the first EPEX implementation was applied to parallelization of FORTRAN programs, and shared data declared explicitly, through FORTRAN Shared COMMON statements. Shortly thereafter, the synchronization subroutine calls were replaced by corresponding parallelization macros, with the development of a preprocessor (source-to-source translator from macros to subroutine calls). Denoting the parallelization directives through macros, and likewise for the declaration of application shared data, allowed converting serial
programs to their parallel counterparts, more easily and elegantly. Additional synchronization constructs were also expressed through other appropriate macros. All these parallelization macros inserted in the program were then expanded by the “EPEX Preprocessor” into the parallel program version, by inserting the SPMD parallelization subroutines (specifically referred to here is the EPEX FORTRAN Preprocessor sourceto-source translator [–]; later a preprocessor for C was also developed []). Then the ensuing “parallelized program” was compiled through a standard serial FORTRAN or C compiler. By ’–’ the SPMD-based EPEX system had been installed in ten IBM sites (including IBM Scientific Centers, such as those in Palo Alto, CAUSA and Rome, Italy), three National and International Labs (LANL, ANL, and CERN); two industrial partners (MartinMarietta and Grumman), and eight universities. The syntax and utilization of the initial EPEX implementation is presented in [–]. The macros implementing SPMD and the EPEX preprocessor are presented in detail in [, –] which document in more detail the preprocessor–based EPEX environment. Here, for reference, an illustrative sample of the set of the EPEX macros is given with a brief description of their use: ●
The traditional FORTRAN DO-loop was designated through the: @DO [stmt#∣label] index = n, n, [n] [CHUNK = chunksize] … loop body … [stmt#] @ENDO [label] [WAIT ∣ NOWAIT] Processes entering the @DO get self-assigned with the next available loop-iteration(s), through synchronization routines utilizing the Fetch-andAdd (F&A) [] RP primitive (implemented in the simulated environment through the Compare&Swap instruction); processes that arrive after all the work in the loop has been allocated or all the work in the loop has been completed proceed to the end of the section (@ENDO); there they may wait for the loop execution to be completed, or proceed to subsequent section(s), as allowed by the program workflow dependencies. The CHUNK option allows processes to get assigned more than one iterations at a time, thus decreasing the overhead of parallelism; the WAIT option requires processes to wait for the
SPMD Computational Model
execution of the preceding section (in this case the loop body) to be completed before continuing to the next section of the program; the NOWAIT option allowed processes that reach the end of the section to continue execution of subsequent section(s), thus allowing multiple concurrent sections of the program (or subroutines) to be executed concurrently, as allowed by dependencies in the program workflow. ● The serial section is bounded by the @SERBEG and @SEREND macros. The default is for the first process to encounter the @SERBEG to execute the section, the other processes proceed to @SEREND, with [WAIT ∣ NOWAIT] options similar to the ones for a parallel loop. The environment also allows designating a given process to execute a given serial section. ● The @WAITFOR macro can be inserted at any point of the program and designates that processes cannot go beyond that point, until a logical condition specified by the argument of the macro is satisfied; for instance, this construct can be used for imposing various kinds of soft barriers – for example, for the processes to wait until execution of the preceding section(s) is completed.
Process Initialization
S
●
The @BARRIER macro can be inserted at any point of the program and forces all processes participating in the execution of the program to arrive at that point of the program (hard barrier). ● The parallel processes are endowed with an identity [@MYNUM]; the MYNUM for each process can be a parameterized assignment. This property can be used in several ways: for example, for designating a given process to perform I/O to a file if that is desired, or execute a given section (e.g., a serial section) that can be designated to be executed by a given process (of course in general this is not mandatory, as the default option for a section, and more generally the next available work-task is to be executed on a “first-arrives-executes” basis). ● @SHARED [application shared data: parameters, arrays] allows the user to designate the application shared data; by default all other data of the application are assumed as private to each parallel process; @SHARED [synchronization data] are created automatically by the preprocessor and designated all the shared data used by the synchronization constructs. A schematic of an example of SPMD execution is shown in Fig. .
(optional) Synchronization (all processes may wait for initialization to complete (first process to arrive at section executes – others bypass)
Serial Section Synchronization (@barrier) Beginning of Parallel Section (@DO/@forall) Parallel Section
(processes self-assign work as they arrive at section executes – others bypass
S
Serial Section
Concurrent Parallel Sections
Synchronization soft-barrier (@barrier/no-wait)
Serial Section
Parallel Section
Serial Section
SPMD Computational Model. Fig. Schematic of an example of SMPD MIMD task-parallel execution
S ●
SPMD Computational Model
Handling of I/O: In [, ] are discussed implementations of ideas for parallel I/O presented in the previous section, which allowed I/O by multiple processes, and not necessarily serializing I/O. A file could be opened, written into and closed, and then made available to other processes. This could be done at the end of a parallel section or at the end of an outmost iteration of the program, depending on the problem needs; such implementations might be enabled via a barrier at the end of the outermost loop, or if no barrier was imposed the file maybe accessed (in read-write mode) by another process (which may need to spin-wait if the file had not been closed by the previous process). The @MYNUM designation allowed to tie a given process to a file if that was desirable. The advent of newer file-systems (such as GPFS [, ]) allows more flexible mechanisms for parallel I/O in SPMD programs, and parallel I/O is supported in programming environments, like OpenMP. Furthermore, as mass storage technologies evolve from disk to solidstate and nano-devices, it is possible that parallel I/O could be implemented through other than file-based methods (perhaps something akin to a “DBMS-like” I/O management).
Through the EPEX parallelization directives, the participating processes executing cooperatively the program step through the synchronization points in the program, and are dynamically allocated the next worktask or execution path to take. In that sense all participating processes follow the same global flow of control in the program. In executing serial or parallel sections, each process is allocated the next available work-task, and thus they may execute different instructions at any given time and act on different parts of the program (including potentially different procedures or subroutines), as consistent with the workflow dependencies in the program. A given process may follow a different parallel flow of control during different passes of an outer iteration of the entire program. The synchronization routines provided in EPEX allowed executing outer iterations of the entire program (that is repeated execution of the serial and parallel sections of the program) without necessarily imposing the need for the programmer to introduce explicit blocking at the end of the outer iteration of the program.
The EPEX programming environment allowed to parallelize and simulate the parallel execution of a large number of applications, to understand their characteristics, and assess the efficacy and effectiveness of parallelism. In the – time-span, over applications were parallelized with SPMD, and executed in parallel under the EPEX programming environment. The set of these applications (mostly numeric, but also non-numeric) included: Fluid Dynamics (e.g., advanced turbulent-flow programs [], and shallow water and heart blood-flow problems); Radiation Transport (Discrete Ordinates methods and Monte-Carlo); Design Automation (Chip placement and Routing by Simulated Annealing [–]; and Fault simulation); Seismic, Reservoir Modeling, Weather Modeling, Pollution Propagation; Physics Applications (Band Structure, Spin-Lattice, Shell Model); Applied Physics and Chemistry Applications (Molecular Dynamics); Epidemic Spread; Computer Graphics and Image Processing (e.g., ray-tracing); Numerical and non-numeric Libraries (FFTs, Linear Solvers, Eigen-solvers, sorting), etc.
Advancing into the Future: Directions, Opportunities, Challenges, and Approaches Emerging Computational Platforms and Emerging Applications Systems. Increasingly, large-scale distributed systems are deploying as their building blocks highperformance multicore processors (and in combination with special-purpose processor chips, like GPUs), as are the emerging petaflops and future hexaflops platforms. In fact, it is also conceivable that this kind of heterogeneity of processors will be eventually embedded in the multicore chip itself. Such systems, with ,s to s of thousands of tightly coupled nodes, will be enabled as Grids-in-a-Box (GiBs). Computational platforms include both high-end systems as well as globally-distributed, meta-computing, heterogeneous, networked and adaptive platforms, ranging from assemblies of networked workstations, to networked supercomputing clusters or combinations thereof, together with their associated peripherals such as storage and visualization systems. All these hardware platforms will have potentially not only multiple levels of processors, but also multiple levels of memory hierarchies (at the cache, main memory and storage
SPMD Computational Model
levels), and multiple levels of interconnecting networks, with multiple levels of latencies (variable at inter-node and intra-node levels) and bandwidths (differing for different links, differing based on traffic). Moving into the exascale domain, the challenges are augmented as we are faced with prospects of addressing billionway concurrency, and in the presence of heterogeneity of processing units in a processing node, increasing degrees in the multiple levels of memory hierarchies, and multiple levels of interconnects hierarchies; these are the Grids-in-a-Box, referred to earlier-on, with significantly added complexity, because of the wider range of granularity, needing to expose concurrency from the fine grain to many more levels of coarser grain. The questions range from: how to express parallelism and optimally map and execute applications on such platforms, to how to enable load balancing, and at what level (or levels) is load balancing applied. It becomes evident that static parallelization approaches are inadequate, that one needs programming models that allow expressing parallelism so that it can be exploited dynamically, for load balancing and hiding latency, and for expressing dynamic flow control and synchronization possibly at multiple levels. The ideas of dynamic runtime compiler [] capabilities discussed below are becoming more imperative. Furthermore, new application paradigms (e.g., DDDAS – Dynamic Data Driven Applications Systems [, ]) that have emerged over the last several years and which entail dynamic on-line integration and feed-back and control between the computations of complex application models and measurement aspects of the application system, leading to SuperGrids [] of integrated computational and instrumentation platforms, as well as other sensory devices and control platforms. In the context of this entry on programming environments and programming models, the implications are that these environments and models will need to support seamlessly execution of applications encompassing dynamically integrated high-end computing with real-time data-acquisition and control components and requirements. New Programming Environments and RuntimeSupport Technologies Such environments require technology approaches which break down traditional barriers in existing software components in the application development support and runtime layers, to deliver QoS in application execution. That is, together
S
with new programming models, new compiler technology is needed, such as the runtime-compiler system (RCS []), where part of the compiler becomes embedded in the runtime and where the compiler interacts with the system monitoring and resource managers, as well as performance models of the underlying hardware and software, using such capabilities for optimizing the mapping of the application on the underlying complex platform(s). This runtime-compiler system is aware of the heterogeneity in the underlying architecture of the platforms, such as multi-level hierarchy of processing nodes, memories, and interconnects, with differing architecture, memory organization, and latencies, and will link to appropriately selected components (dynamic application composition) to generate consistent code at runtime. Representative examples of developing runtime-compiler capabilities are given in [] and []. Together with the “runtime-compiler” capabilities [], called for novel approaches and substantial enhancements in computational models to actualize the distributed applications software, including userprovided assists to facilitate and enhance the runtimecompiler’s ability to analyze task and data dependencies in the application programs, resolve dependencies, and dynamically optimize mapping across a complex set of processing nodes and memory structure of distributed platforms (such as Grids and GiBs, multicore and GPU-based), with multiple levels of processors and processing nodes, interconnects, and memory hierarchies. It is the thesis of this entry that the models needed should facilitate the RCS (runtime compiling system) to map applications, without requiring detailed resource management specifications by the user and without requiring specification of data location (e.g., proximity – or PGAS). Rather the programming models should incorporate advanced concepts such as “active data-distribution,” that is the user specifies that these data are candidates for active or runtime distribution, and the RCS determines at runtime how to map them, determines partial sharing, copy/move between memory hierarchies, through dynamic adaptive resource management, decoupled execution and data location/placement, memory consistency models, and multithreaded hierarchical concurrency. Such capabilities may be materialized through a hybrid combination of language, library models, development of OS-supported “hierarchical threading” capabilities and partial sharing of data approaches.
S
S
SPMD Computational Model
In order to adequately support the future parallel systems, it is advocated here that the new software technologies need to adopt a more integrated view of the architectural layers and software components of a computing system (hardware/software co-design), consisting of the applications, the application support environments (languages, compilers, application libraries, linkers, run-time support, security, visualization, etc.), operating systems (scheduling, resource allocation and management, etc), computing platform architectures, processing nodes and network layers, and also support systems encompassing the computational and application measurement systems. Furthermore, such environments need to include approaches for engineering such systems, at the hardware, systems software, and at the applications levels, so that they execute with optimized efficiency with respect to runtime, quality of service, performance, power utilization, fault tolerance, and reliability. Such capabilities require robust software frameworks, encompassing the systems software layers and the application layers and cognizant of the underlying hardware platform resources.
Summary This entry has provided an overview of the SPMD model, its origin and its initial implementations, its effectiveness and impact in expediting adoption of parallelism in the mid-s, and its use as a vehicle for exploiting parallel and distributed computational platforms for the intervening years. As we consider what are the possible new parallel programming models for new and future computing platforms, as well as present and emerging compute-, data-intensive, dynamically integrated applications executing on such platforms, it is becoming imperative to advance programming models and programming environments by building from past experiences and opening new directions through synergistic approaches of models, advanced runtimecompiler methods, user assists, multi-level threading OS services, and ensuring that the approaches and technologies developed are created in the context of end-toend software architectural frameworks.
Acknowledgment The IBM RP Project became the ground for my inspiration and work on the SPMD model; I’m forever grateful for being part of the RP team and will always value my collaborations with the RP team.
Bibliography . Darema-Rogers F () IBM Internal Communication, Jan . Darema-Rogers F, George D, Norton VA, Pfister G () A VM parallel environment. Proceedings of the IBM Kingston Parallel Processing Symposium, – Nov (IBM Confidential) . Darema-Rogers F, George DA, Norton VA, Pfister GF () A VM based parallel environment. IBM research report, RC . Darema-Rogers F, George DA, Norton VA, Pfister GF () Environment and system interface for VM/EPEX. IBM research report R, // . Darema-Rogers F, George DA, Norton VA, Pfister GF () Using a single-program-multiple-data model for parallel execution of scientific applications. SIAM Conference on Parallel Processing for Scientific Computing, November, , and IBM research report R, // . Darema F, George DA, Norton VA, Pfister GF () A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput :– (received April upon publication release by IBM - IBM Technical Disclosure Bulletin () February ) . Pfister G et al () The research parallel processor prototype (RP). Proceedings of the IBM Kingston Parallel Processing Symposium, – Nov (IBM Confidential); and Proceedings of the ICPP, August . Darema F Historical and future perspectives on the SPMD computational model. – Forthcoming Publication . MPI standard – draft released by the MPI Forum. http:// www.mcs.anl.gov/Projects/mpi/standard.html . PVM – Parallel Visrtual Machine. http://www.csm.ornl.gov/ pvm/pvm_home.html . Foster I, Kesselman C (eds) () The grid: blueprint for a new computing infrastructure. Morgan Kaufmann . OpenMP. http://openmp.org/wp/ . Yelick KA, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, Hilfinger PN, Graham SL, Gay D, Colella P, Aiken A () Concurrency: practice and experience. vol , No. –, September-November. An earlier version was presented at the Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb . Culler DE, Arpaci-Dusseau AC, Goldstein SC, Krishnamurthy A, Lumetta S, Eicken T, Yelick KA () Parallel programming in Split-C. SC, pp – . El-Ghazawi T, Carlson W, Sterling T, Yelick KA () UPC: distributed shared-memory programming. Wiley, Hoboken . Bronevetsky G, Marques D, Pingali K, Stodghill P () Automated application-level checkpointing of MPI programs. ACM SIGPLAN ():– . Bronevetsky G, Marques D, Pingali K, Szwed P, Schulz M () Application-level checkpointing for shared memory programs. ACM Comp Ar ():– . George DA MVS/XA EPEX – Environment for parallel execution. IBM research report RC , // . VM/System Product (VM/SP). http://www-.ibm.com/ibm/ history/exhibits/mainframe/mainframe_PP.html (and MVS/XA also supported)
Stream Programming Languages
. Stone JM, Darema-Rogers F, Norton VA, Pfister GF Introduction to the VM/EPEX FORTRAN Preprocessor. IBM research report RC , // . Stone JM, Darema-Rogers F, Norton VA, Pfister GF The VM/EPEX FORTRAN Preprocessor Reference. IBM research report RC , // . Bolmarcich T, Darema-Rogers F Tutorial for EPEX/FORTRAN Program Parallelization and Execution. IBM research report RC, // . Whet-Ling C, Norton A () VM/EPEX C preprocessor user’s manual. Version .. Technical report RC , IBM T.J. Watson Research Center, Yorktown Heights, NY, October . Gottlieb A, Kruskal CP () Coordinating parallel processors: a partial unification. Computer Architecture News, pp –, October . Darema-Rogers F I/O capabilities in the VM/EPEX system. IBM research report, RC , // . Darema F () Applications environment for the IBM research parallel processor prototype (RP). IBM research report RC , //; and Proceedings of the International Conference on Supercomputing (ICS), (published by Springer, Athens, Greece, June . GPFS: http://www.almaden.ibm.com/StorageSystems/projects/ gpfs . Schmuck F, Haskin R () GPFS: a shared-disk file system for large computing clusters. (pdf). Proceedings of the FAST’ Conference on File and Storage Technologies. USENIX, Monterey, California, USA, pp –. ISBN ---. http:// www.usenix.org/events/fast/full_papers/schmuck/schmuck. pdf. Accessed Jan . ARCD: Pulliam TH Euler and thin layer navier stokes codes: ARCD, ARCD. Computational fluid dynamics, A workshop held at the University of Tennessee Space Institute, UTSI publ. E---, {other related application programs parallelized included: SIMPLE and HYDRO-} . Darema-Rogers F, Kirkpatrick S, Norton VA () Parallel techniques for chip placement by simulated annealing. Proceedings of the International Conference on Computer-Aided Design. pp – . Darema F, Kirkpatrick S, Norton VA () Simulated annealing on shared memory parallel systems. IBM Journal of R&D :– . Jayaraman R, Darema F Error analysis of parallel simulated annealing techniques. Proceedings of the ICCD’, Rye, NY, /-/ . Greening D, Darema F Rectangular spatial decomposition methods for parallel simulated annealing. IBM research report, RC, //, and in the Proceedings of the International Conference on Supercomputing ‘. Crete-Greece . Darema F () Report on cyberifrastructures of cyberapplications-systems & cyber-systems-software, submitted for external publication . DDDAS. www.cise.nsf.gov/dddas . DDDAS. www.dddas.org . Darema F () Grid computing and beyond: the context of dynamic data driven applications systems. Proc
S
IEEE (Special Issue on Grid Computing) ():– . Darema F () New software architecture for complex applications development and runtime support. Int J High-Perform Comput (Special Issue on Programming Environments, Clusters, and Computational Grids For Scientific Computing) () . GRADS project. http://www.hipersoft.rice.edu/grads/ . Adve VS, Sanders WH A compiler-enabled model- and measurement-driven adaptation environment for dependability and performance. – http://www.perform.csl.illinois.edu/projects/ newNSFNGS.html, LLVL compiler – http://llvlm.org
SSE AMD Opteron Processor Barcelona Intel Core Microarchitecture, x Processor Family Vector Extensions, Instruction-Set Architecture (ISA)
Stalemate Deadlocks
State Space Search Combinatorial Search
Stream Processing Stream Programming Languages
Stream Programming Languages Ryan Newton Intel Corporation, Hudson, MA, USA
Synonyms Complex event processing; Event stream processing; Stream processing
S
S
Stream Programming Languages
Definition Stream Programming Languages are specialized to the processing of data streams. A key feature of stream programming languages is that they structure programs not as a series of instructions, but as a graph of computation kernels that communicate via data streams. Stream programming languages are implicitly parallel and contain data, task, and pipeline parallelism. They offer some of the best examples of high-performance, fully automatic parallelization of implicitly parallel code.
Discussion Overview Stream processing is found in many areas of computer science. Accordingly, it has received distinct and somewhat divergent treatments. For example, most work on stream processing in the context of compilers, graphics, and computer architecture has focused on streams with predictable data-rates, such as in digital signal processing (DSP) applications. These can be modeled using synchronous dataflow, are deterministic (being a subclass of Kahn process networks), and are especially attractive as parallel programming models. Specialized languages for dataflow programming go at least as far back the SISAL (Streams and Iteration in a Single Assignment Language) in . Somewhat different are streaming databases that are concerned with asynchronous (unpredictable) streams of timestamped events. Typical tasks include searching for temporal patterns and relating streams to each other and to data in stored tables, which may be termed “complex event processing.” This entry considers a broad definition of stream processing, requiring only that kernel functions and data streams be the dominant means of computation and communication. This includes both highly restrictive models – statically known, fixed graph and datarates – and less restrictive ones. More restrictions allow for better performance and more automatic parallelism, whereas less restrictive models handle a broader class of applications. Even this broad definition has necessarily blurry boundaries. For example, while stream-processing models typically allow only pure message passing communication, they can be extended to allow various forms of shared memory between kernel functions. General purpose programming languages equipped
with libraries for stream processing are likely to fall into this category, as are streaming databases that include shared, modifiable tables.
Example Programming Models This entry classifies stream programming designs along two major axes: Graph Construction Method and Messaging Discipline. The following three fictional programming models will be used to illustrate the design space: ●
StreamStatic: a language with a statically known, fixed graph of kernel functions, as well as statically known, fixed data-rates on edges. ● StreamDynamic: a general purpose language that can manipulate streams as first-class values, constructing and destroying graphs on the fly and allowing arbitrary access to shared memory. ● StreamDB: Kernel graphs that are changed dynamically through transactions. Includes shared memory in the form of modifiable tables. More restricted and with different functionality than StreamDynamic. Snippets of code corresponding to these three languages appear in Figs. –, respectively. Kernels, being nothing more than functions, are usually defined in a manner resembling function definitions. In StreamDynamic the same language is used for graph construction as for kernel definition, whereas StreamStatic employs a distinct notation for graph topologies. Only in StreamDB are the definitions of kernels nonobvious. Whenever a new stream is defined from an # metadata: pop 1 push 1 stream out AddAccum(stream in) { state { int s = 0; } execute() { s++; out.push(s + in.pop()); } }
// A static graph topology, in literal textual notation: IntIn -> AddAccum -> IntOutput;
Stream Programming Languages. Fig. A kernel function in StreamStatic might appear as in the above pseudocode, explicitly specifying the amount of data produced and consumed (pushed and popped) by the kernel during each invocation. This particular kernel carries a state that persists between invocations
Stream Programming Languages
S
-- Streams and tables coexist CREATE TABLE Prices ( Id string PRIMARY KEY, Price double); CREATE INPUT STREAM PriceIncrs ( Id string, Increase double); -- A ’kernel’ is constructed by building new streams from old CREATE STREAM NormedPriceIncrs AS SELECT Id, Increase / 10.0 FROM PriceIncrs; -- Table modified based on streaming data: UPDATE Prices USING NormedPriceIncrs SET Price = Prices.Price + NormedPriceIncrs.Increase, WHERE Prices.Id == NormedPriceIncrs.Id;
Stream Programming Languages. Fig. A pseudocode example of StreamDB code
// A function that creates $N$ copies of a kernel fun pipeline(int n, Function fn, Stream S) { if (n==0) return S; else pipeline(n-1, fn, map(fn,S)); } // Graph wiring is implicit in program execution: pipeline(10, AddAccum, OrigStream)
Stream Programming Languages. Fig. A pseudocode example of StreamDynamic code
old one with a SELECT statement, the expressions used in the SELECT statement form the body of a new kernel function. However, in streaming databases, much more functionality is built into the primitive operators, supporting, for example, a variety of windowing and grouping operations on streaming data. Messaging StreamStatic assumes that each kernel
specifies exactly how many inputs it consumes on each inbound edge, and outputs it produces on outbound edges. StreamDynamic instead uses asynchronous event streams and allows nondeterministic merge, which interleaves the elements of two streams in the realtime order in which they occur. StreamDB is similar to StreamDynamic, but timestamps all stream elements at their source, and then maintains a deterministic semantics with respect to those explicit timestamps.
Methods for Graph Wiring Our three fictional languages use different mechanisms for constructing graphs. StreamStatic uses a literal textual encoding of the graph (Fig. ). StreamDB also uses a direct encoding of a graph in the text of a query,
but in the form of named streams with explicit dependencies (Fig. ). In contrast, in StreamDynamic the graph is implicit, resulting from applying operators such as map to stream objects (Fig. ). There are other common graph construction methods not represented in these fictional languages: first, GUI-based construction of graphs, as found in LabView[] and StreamBase[]; second, an API for adding edges in a general purpose language, along the lines of “connect(X,Y);”. Compared to implicit graph wiring via the manipulation of stream values (as seen in StreamDynamic or in real systems such as FlumeJava[]), APIs of this kind are more explicit about edges and usually more verbose. Finally, an advanced technique for constructing stream graphs is metaprogramming. Metaprogramming, or staged execution, simply refers to any program that generates another program. In the case of streaming, the most common situation is that a static kernel graph is desired, yet the programmer wishes to use loops and abstraction in the construction of (potentially complex) graphs. Therefore the need for
S
S
Stream Programming Languages
metaprogramming in streaming is analogous to hardware description languages, which also have a static target (logic gates) but need abstraction in the source language. While it would be possible to write a program to generate the textual graph descriptions used by, for example, StreamStatic, a much more disciplined form of metaprogramming can be provided by streamprocessing DSLs (domain specific languages), which can integrate the type-checking of the metalanguage and the target kernel language. Indeed, this is the approach taken by DSLs such as WaveScript[] and to a lesser extent StreamIt[].
Parallelism in Stream Programs Once constructed, a kernel graph exposes pipeline, task, and data parallelism. This terminology, used by the StreamIt authors [] among others, distinguishes between producer/consumer parallelism (pipeline) and kernels lacking that relationship, such as the siblings in a fork-join pattern (task parallelism). Data parallelism in this context refers specifically to the ability to execute a kernel simultaneously on multiple elements in a single stream. Data parallelism is not always possible with a stateful kernel function – equivalent to a loop-carried dependence. Stream programs expose abundant parallelism. It is the job of a scheduler or program optimizer to manage that parallelism and map it onto hardware. Fully automatic solutions to this problem remain difficult, but less difficult in a restrictive stream programming model than many other programming models. The key advantage is that stream programs define independent kernels with local data access and explicit, predictable communication. This advantage enables stream programming models like StreamStatic to target a wide range of hardware architectures, including traditional cache-based CPUs (both multicore and vector parallelism), as well as GPUs, and architectures with software-controlled communication, such as the RAW tiled architecture [] or the Cell processor. For example, the DSL StreamIt[] (which subsumes StreamStatic) is an example of a programming language that has targeted all these platforms as well as networked clusters of workstations.
Optimization and Scheduling An effective stream-processing implementation must accomplish three things: adjust granularity, choose between sources of parallelism, and place computations. Granularity here refers to how much data is processed in bulk, as well as whether or not kernels are combined (fused). Second, because stream programs include multiple kinds of parallelism, it is necessary to balance pipeline/task parallelism and data parallelism. Finally, placement and ordering presents a variant of the traditional task-scheduling problem. Consider the two-kernel pipeline shown in Fig. . Even the simple composition of two kernels exposes several trade-offs. Three possible placements of kernels A and B onto four processors are shown in the figure. Placement (a) illustrates a common scenario: fusing kernels and leveraging data parallelism rather than pipeline parallelism. Yet choosing pipeline parallelism (placement (b)) could be preferable if, for example, A and B’s combined working sets exceed the cache capacity of a processor. (Note that in a steady state streams provide enough data to keep A and B simultaneously executing – software pipelining.) Finally, placement (c) illustrates another common scenario – a kernel carries state and cannot be parallelized. In this case, kernel B must be executed serially, so it must not be fused with A or it would serialize A. Note that there is a complicated interplay between batch size, kernel working set, and fusion/placement. One simple model for kernel working set assumes its size is linear in the size of its input x: that is, mx + k where k represents memory that must be read irrespective of batch size. A large k might, for example, suggest the selection of placement (b) in Fig. because if k = batch size could be tuned to make placement (a) attractive. But a phase-ordering problem arises: Solving either batch size or fusion/fission independently (rather than simultaneously) can sacrifice opportunities to achieve optimal placement. A related complication is that while the choices made by a stream-processing system could take place either statically or dynamically, granularity adjustment is more difficult to perform dynamically. For example, while dynamic work-stealing on a shared-memory computer can provide load balance and place kernels onto processors, it cannot deal with fine-grained kernels
Stream Programming Languages
A
Sample fusion, fision, and placement for:
A
B
A
A
P2:
A
B
B
A
P3:
A
B
A
A
P4:
A
B
B
B
Fused
b
B
P1:
a
S
Pipeline before data parallelism
c
B serial
Stream Programming Languages. Fig. Sample placements of two kernels, A and B onto processors P–P
that perform only a few FLOPS per invocation. That said, as long as kernels are compiled to handle batches of data, the exact size of batches can be adjusted at runtime.
Program Transformation If optimizations are performed statically by a compiler, they can take the form of source-to-source transformations. The most important optimizations fall into the following four categories: ●
High-level / Algebraic – domain - specific optimizations, for example, canceling an FFT and an inverse FFT (see Haskell [], WaveScript []) ● Batch-processing / Execution Scaling – choosing batch size ● Fusion – combining and inlining kernels ● Fission / Data Parallelism – choosing N-ways to split a kernel StreamStatic, for example, could follow the example of StreamIT and perform fusion, fission, and scaling as source-to-source program transformations, resulting in a number of kernels that match the target number of processors and leaving only a one-to-one placement problem to be solved by a simulated annealing optimization process. To achieve load balance during this process, StreamIt happens to use static work-estimation for kernels, but profiling is another option. In fact, given more dynamic data-rates, as in StreamDynamic and StreamDB, profile-based optimization and auto-tuning techniques are a good way to enable some of the above
static program transformations. (WaveScript takes this approach [].) To our knowledge no systems attempt to apply high-level / algebraic optimizations dynamically.
Algorithms Dynamic systems along the lines of StreamDynamic tend to use work-stealing, as do most of today’s popular task-scheduling frameworks. Further, many streaming databases use heuristics for kernel/operator migration to achieve load balance in a distributed setting. On the other hand, streaming models with fixed data-rates permit static scheduling strategies and much work has focused there. Scheduling predictable stream programs bears some resemblance to the well-studied problem of mapping a coarse-grained task-graph onto a multiprocessor (surveyed in []). But there is a notable difference: stream programs run continuously and are usually scheduled for throughput (with some exceptions []). Likewise, traditional graph-partitioning methods can be applied, but they do not, in general, capture all the degrees of freedom present in stream programs, whose graphs are not fixed, but are subject to transformations such as data-parallel fission of kernels. An overview of scheduling work on stream processing specifically can be found in [], focusing on embedded and real-time uses. The ideal would be an algorithm that simultaneously optimizes all aspects of a streaming program. A recent paper [] proposed one such system that uses integer linear programming to simultaneously perform fission and processor placement for StreamIt programs.
S
S
Stream Programming Languages
Data Structures and Synchronization Static and dynamic scheduling approaches employ different data structures. Dynamic approaches targeting shared memory multiprocessors are likely to rely on concurrent data structures including queues. In contrast, a completely static schedule typically requires no synchronization on data structures. Rather it may repeat a fixed schedule, using a global barrier synchronization at the end of each iteration of the schedule.
stream processing requires that a program be factored into kernels and that communication become explicit. Once a program is ported, however, it is perhaps easier to assure that it is correct and deterministic. Stream-processing languages provide a complete programming model encompassing computation and communication. In contrast, fork-join models treat the issues of control-flow and work-decomposition, but do not directly address data decomposition or communication.
Relation to Other Parallel Programming Models
Future Directions
Functional Reactive Programming Like StreamDynamic, FRP enables general purpose functional programs (usually in Haskell) to manipulate streams of values. FRP in particular deals not with discrete streams of elements, but with semantically continuous signals. Sampling of signals is performed by the runtime, or separately from the main definition of programs. Most FRP implementations are not high-performance, but parallel implementations and staged implementations that generate efficient code exist. Relation to Data-parallel Models With the increasing popularity of the MapReduce paradigm, there is a lot of interest in programming models for massively parallel manipulation of data. The FlumeJava library [] (built on Google’s MapReduce) bears a lot of resemblance to a stream-processing system. The user composes parallel operations on collections, and FlumeJava applies fusion transformations to pipelines of parallel operations. Similar systems like Cascade and Dryad explicitly construct dataflow graphs. These systems have a different application focus, target hardware, and scale than most of the work on stream processing. Further, they focus on batch processing of data in parallel, not on continuous execution. Relation to Fork-Join Shared-Memory Parallelism Fork-join parallelism in the form of parallel subroutine calls and parallel loops, as found in Cilk [] and OpenMP [], are attractive because of their relative ease of incorporation into legacy codes. In contrast,
This entry has described a family of programming models that have already accomplished a lot. Unfortunately, the systems described above have seen little application to industrial stream-processing problems. Most stream-processing codes are written in general purpose languages without any special support for stream processing. A major future challenge is to increase adoption of the body of techniques developed for stream processing. Libraries, rather than stream-programming languages may have an advantage in this respect. Further, stream programming systems that are wide spectrum may prove desirable in the future – models that can reproduce the best results of more restrictive programming models, while also allowing graceful degradation into more general programming, therefore not confining the user strictly.
Related Entries Cell Processor
Bibliography . http://www.streambase.com/ . Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N () Flumejava: easy, efficient data-parallel pipelines. In: PLDI ’: proceedings of the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – . Dagum L, Menon R () OpenMP: an industry standard API for shared memory programming. IEEE Comp Sci Eng ():– . Gordon MI, Thies W, Amarasinghe S () Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. SIGOPS Oper Syst Rev ():– . Kudlur M, Mahlke S () Orchestrating the execution of stream programs on multicore platforms. In: PLDI ’ proceedings of
Suffix Trees
.
.
.
.
.
.
.
.
the ACM SIGPLAN conference on programming language design and implementation. ACM, New York, pp – Kwok Y-K, Ahmad I () Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput Surv ():– Leiserson CE () The cilk + + concurrency platform. In: DAC ’ proceedings of the th annual design automation conference. ACM, New York, pp – Liu H, Cheng E, Hudak P () Causal commutative arrows and their optimization. In: ICFP ’ proceedings of the th ACM SIGPLAN international conference on functional programming. ACM, New York, pp – Newton RR, Girod LD, Craig MB, Madden SR, Morrisett JG () Design and evaluation of a compiler for embedded stream programs. In: LCTES ’ proceedings of the ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems. ACM, New York, pp – Pillai PS, Mummert LS, Schlosser SW, Sukthankar R, Helfrich CJ () Slipstream: scalable low-latency interactive perception on streaming data. In: NOSSDAV ’ proceedings of the th international workshop on network and operating systems support for digital audio and video. ACM, New York, pp – Taylor MB, Lee W, Miller J, Wentzlaff D, Bratt I, Greenwald B, Hoffmann H, Johnson P, Kim J, Psota J, Saraf A, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A () Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ilp and streams. In: ISCA ’ proceedings of the st annual international symposium on Computer architecture. IEEE Computer Society, Washington, DC, p Travis J, Kring J () LabVIEW for everyone: graphical programming made easy and fun, rd edn. Prentice Hall, Upper Saddle River Wiggers MH () Aperiodic multiprocessor scheduling for real-time stream processing applications. Ph.D. thesis, University of Twente, Enschede, The Netherlands
Strong Scaling Amdahl’s Law
Suffix Trees Amol Ghoting, Konstantin Makarychev IBM Thomas. J. Watson Research Center, Yorktown Heights, NY, USA
Synonyms Position tree
S
Definition The suffix tree is a data structure that stores all the suffixes of a given string in a compact tree-based structure. Its design allows for a particularly fast implementation of many important string operations.
Discussion Introduction The suffix tree is a fundamental data structure in string processing. It exposes the internal structure of a string in a way that facilitates the efficient implementation of a myriad of string operations. Examples of these operations include string matching (both exact and approximate), exact set matching, all-pairs suffix-prefix matching, finding repetitive structures, and finding the longest common substring across multiple strings []. Let A denote a set of characters. Let S = s , s , . . . , sn− , $, where si ∈ A and $ ∉ A, denote a $ terminated input string of length n + . The ith suffix of S is the substring si , si+ , . . . , sn− , $. The suffix tree for S, denoted as T, stores all the suffixes of S in a tree structure. The tree has the following properties: . Paths from the root node to the leaf nodes have a one-to-one relationship with the suffixes of S. The terminal character $ is unique and ensures that no suffix is a proper prefix of any other suffix. Therefore, there are as many leaf nodes as there are suffixes. . Edges spell nonempty strings. . All internal nodes, except the root node, have at least two children. The edge for each child node begins with a character that is different from the starting character of its sibling nodes. . For an internal node v, let l(v) denote the substring obtained by traversing the path from the root node to v. For every internal node v, with l(v) = xα, where x ∈ A and α ∈ A∗ , we have a pointer known as a suffix link to an internal node u such that l(u) = α. An instance of a suffix tree for a string S = ABCABC$ is presented in Fig. . Each edge in a suffix tree is represented using the start and end indices of the corresponding substring in S. Therefore, even though a suffix tree represents n suffixes (each with at most n characters) for a total of Ω(n ) characters, it only requires O(n) space [].
S
S
Suffix Trees
$ [6] ABC [0,2]
$ [6]
3
ABC$ [3,6]
0
BC [1,2]
$ [6]
4
6
C [2]
ABC$ [3,6]
$ [6]
ABC$ [3,6]
1
5
2
Suffix Trees. Fig. Suffix tree for S = ABCABC$ []. Internal nodes are represented using circles and leaf nodes are represented using rectangles. Each leaf node is labeled with the index of the suffix it represents. The dashed arrows represent the suffix links. Each edge is labeled with the substring it represents and its corresponding edge encoding
Applications Over the past few decades, the suffix tree has been used for a spectrum of tasks ranging from data clustering [] to data compression []. The quintessential usage of suffix trees is seen in the bioinformatics domain [, , , , , , ] where it is used to effectively evaluate queries on biological sequence data sets. A hallmark of suffix trees is that in many cases it allows one to process queries in time proportional to the size of the query rather than the size of string.
Suffix Tree Construction Serial Suffix Tree Construction Algorithms Algorithms due to Weiner [], McCreight [], and Ukkonen [] have shown that suffix trees can be built in linear space and time. These algorithms afford a linear time construction by employing suffix links. Ukkonen’s algorithm is more recent and popular because it is easier to implement than the other algorithms. It is an O(n), in-memory construction algorithm. The algorithm is based on the simple, but elegant observation that the suffixes of a string Si = s , s , . . . , si can be obtained from the suffixes of string Si− = s , s , . . . , si− by catenating symbol si at the end of each suffix of Si− and by adding the empty suffix. The suffixes of the whole
string S = Sn = s , s , . . . , sn can then be obtained by first expanding the suffixes of S into the suffixes of S and so on, until the suffixes of Sn are obtained from the suffixes of Sn− . This translates into a suffix tree construction that can be performed by iteratively expanding the leaves of a partially constructed suffix tree. Through the use of suffix links, which provide a mechanism for quickly locating suffixes, the suffix tree can be expanded by simply adding the (i + )th character to the leaves of the suffix tree built on the previous i characters. The algorithm thus relies on suffix links to traverse through all of the sub-trees in the main tree, expanding the outer edges for each input character.
Parallel Suffix Tree Construction Parallel suffix tree construction has been extensively studied in theoretical computer science. Apostolico et al. [], Sühleyman et al. [], and Hariharan [] proposed theoretically efficient parallel algorithms for the problem based on the PRAM model. For example, Hariharan [] showed how to execute McCreight’s [] algorithm concurrently on many processors. These algorithms, however, are designed for the case when the string and the suffix tree fit in main memory. Since the memory accesses of these algorithms exhibit poor locality of reference [], these algorithms are inefficient when either the string or the suffix tree do not fit in main memory. On many real-world data sets, this is often the case. The aforementioned problem has been theoretically resolved by Farach-Colton et al. []. FarachColton et al. [] proposed a divide-and-conquer algorithm that operates as follows. First, construct a suffix tree for suffixes starting at odd positions. To do so, sort all pairs of characters at positions i − and i and replace each pair with its index in the sorted list. This gives us a string of length [n/] with characters in a bigger alphabet. Recursively construct the suffix tree for this string. The obtained tree is essentially the suffix tree for suffixes starting at odd positions in the original string. Then, given the “odd” suffix tree construct an “even” suffix tree of suffixes starting at even positions. Finally, merge the trees. The details of the Farach-Colton et al. [] algorithms are very complex, and there have not been any successful implementations. However, the doubling approach has been successfully used in practice for constructing suffix arrays
Suffix Trees
(see [, ], and section Suffix Arrays). While the authors do not explicitly provide a parallel algorithm, they indicate that the sort and merge phases do lend themselves to a parallel implementation. While the aforementioned algorithms provide theoretically optimal performance, for most algorithms, memory accesses exhibit poor locality of reference. As a consequence, these algorithms are grossly inefficient when either the tree or the string does not fit in main memory. To tackle this problem, this past decade has seen several research efforts that target large suffix tree construction. Algorithms that have been developed in these efforts can be placed in two categories: ones that require the input string to fit in main memory and ones that do not have this requirement. Many of these efforts only provide and evaluate serial algorithms for suffix tree construction. However, by design, as will be discussed, they do indeed lend themselves to a parallel implementation.
Practical Parallel Algorithms for In-Core Strings Hunt et al. [] presented the very first approach to efficiently build suffix trees that do not fit in main memory. The approach drops the use of suffix links in favor of better locality of reference. Typically, the suffix tree is an order of magnitude larger than the string being indexed. As a result, for large input strings, the suffix tree cannot be accommodated in main memory. The method first finds a set of prefixes so as to partition the suffix tree into sub-trees (each prefix corresponds to a sub-tree) that can be built in main memory. The number of times a prefix occurs in a string is used to bound the size of the suffix sub-tree. The approach iteratively increases the size of the prefix until a length is reached where all the suffix sub-trees will fit in memory. Next, for each of these prefixes, the approach builds the associated suffix sub-tree using a scan of the data set. Essentially, for each suffix with the prefix, during insertion, one finds a path in the partially constructed suffix sub-tree that shares the longest common prefix (lcp) with this suffix, and branches from this path when no more matching characters are found. The worst case complexity of the approach is O(n ), but it exhibits O(n log n) average case complexity. This algorithm can be parallelized by distributing the prefixes to the different processors in a round-robin fashion and having each processor build one or more suffix sub-trees of the
S
suffix tree. Kalyanaraman et al. [] presented a parallel generalized suffix tree construction algorithm that in some ways builds upon Hunt’s approach to realize a scalable parallel suffix tree construction. A generalized suffix tree is a suffix tree for not one but a group of strings. While the algorithm was presented in the context of the genome assembly problem, the proposed approach is not tied to genome assembly and can be applied else where. The approach first sorts all suffixes based on their w-length prefixes, where (like Hunt et al.’s approach) w is picked to ensure that the associated suffix sub-trees will fit in main memory. Suffixes with the same prefix are then assigned to the a single bucket. The buckets are then partitioned across the processors such that load is approximately balanced. Each processor then builds a sub-tree of the final suffix tree using a depth-first tree building approach. This is accomplished by sorting all the suffixes in the local bucket and then inserting them in sorted order. The end result in a distributed representation of the generalized suffix tree as a collection of sub-trees. Japp introduced top-compressed suffix trees [] that improves upon Hunt’s approach by introducing a pre-processing stage to remove the need for repeated scans of the input sequence. Moreover, the author employed partitioning and optimized the insertion of suffixes into partitions using suffix links. The method is based on a new linear time construction algorithm for “sparse” suffix trees, which are sub-trees of the whole suffix tree. The new data structures are called the paged suffix tree (PST) and the distributed suffix tree (DST), respectively. Both tackle the memory bottleneck by constructing sub-trees of the full suffix tree independently and are designed for single processor and distributed memory parallel computing environments, respectively. The standard operations on suffix trees of biological importance are shown to be easily translatable to these new data structures.
Practical Parallel Algorithms for Out-of-Core Strings The above mentioned parallel algorithms scale well when the input string fits in main memory. Many realworld input strings (like the human genome), however, do not fit in main memory. Researchers have developed a variety of algorithms to handle this situation. The earlier solutions for this problem follow what is known as
S
S
Suffix Trees
the partition-and-merge methodology. The approach is illustrated in Fig. . st-merge [], proposed by Tian et al. partitions the input string and constructs a suffix tree for each of these partitions in main memory. These suffix trees are then merged to create the final suffix tree. Merging two suffix trees involves finding matching suffixes in the two trees that share the longest common prefix, reusing this path in the new merged tree, and splitting this path when no more matching characters are found. Suffix link recovery can them be accomplished using a postprocessing step. The approach can be parallelized as follows. During the partition phase, each processor can be assigned block of the input string and it can build a suffix tree for the partition. During the merge phase, a processor can be assigned a pair of suffix trees where it is responsible for merging a pair of trees. One can realize a binary merge tree to build the final suffix tree. trellis [], due to Phoophakdee and Zaki, is similar in flavor but differs in the following regards. First, the approach finds a set of variable-length prefixes such that the corresponding suffix sub-trees will fit in main memory. Second, it partitions the input string and constructs a suffix tree for each partition in main memory (like st-merge) and stores the sub-trees for each prefix determined in the first step, separately, on disk. Finally, it merges all the sub-trees associated with each prefix to realize the final set of suffix sub-trees. By
Suffix Trees. Fig. Illustration of the partition and merge approach
design, trellis ensures that each of the suffix sub-trees (built using the merge operation) will fit in main memory. TRELLIS can be parallelized using an approach similar to the parallelization of ST-MERGE. Ghoting and Makarychev [] studied the performance of the “partition-and-merge” approach for very large input strings and showed that the working set for these algorithms scales linearly with the size of the input string. This results in a lot of random disk I/O during the merge phase (the authors of trellis do mention this as well []) when indexing strings that do not fit in main memory. One can even argue that for very large strings, the performance of the “partition-and-merge” way will converge to that of Hunt’s approach [], as merging sub-trees in tantamount to inserting suffixes into a partially constructed suffix tree. To address this challenge, Ghoting and Makarychev presented a serial algorithm wavefront and its parallelization p-wavefront [, ]. p-wavefront is the first parallel algorithm that can build a suffix tree for an input string that does not fit in main memory. p-wavefront diverges from the partition-and-merge methodology to suffix tree construction. Leveraging the structure of suffix trees, the algorithm builds a suffix tree by simultaneously tiling accesses to both the input string and the partially constructed suffix tree. Steps for p-wavefront are presented in Fig. . First, an in-network string cache is built by distributing the input strings across all processors in a round-robin fashion. The algorithm assumes that the input string can fit in collective main memory. This step uses collective I/O to ensure efficient reading of the input string. Next, like trellis, a set of variable length prefixes are found in parallel such that the associated suffix sub-trees fit in memory. These prefixes are then distributed across all processors in a round-robin fashion. The following prefix location discovery phase is used to find the location of each prefix being processed using a collective procedure. During this step, each processor is responsible for finding locations of all prefixes (not just its own) in a partition of the input string and these are collective exchanged with other processors such that each processor has locations of its prefix in the entire input string. Finally, the suffix subtree for each prefix is built in a tiled and iterative manner by processing a pair of blocks of the input string at a time. The end result is an algorithm that can index very large input strings and at the same time maintain a
Suffix Trees
P
P
P
Suffix Trees. Fig. Steps of P-wavefront
bounded working set size and a fixed memory footprint. The proposed methodology was applied to the suffix link recovery process as well, realizing an end-to-end I/O-efficient solution.
Suffix Arrays Closely related to suffix trees is the suffix array []. Suffix array is the alphabetically sorted list of all suffixes of a given string S. For example, the suffix array of the string S = ABCABC$ is as follows Index Suffix
lcp
$
ABC$
ABCABC$
BC$
BCABC$
C$
CABC$
The suffix array only contains pointers to the suffixes in the original string S. Thus, in the example above,
S
the suffix array is the first column of the table, that is, the array {, , , , , , }. Given a suffix tree one can obtain a suffix array in linear time by performing a depth-first search (DFS) (see Fig. ). Suffix arrays are often used instead of suffix trees. The main advantage of suffix arrays is that they have a more compact representation in the memory. Besides the pointers to the suffixes, suffix arrays can also be augmented to contain lengths of the longest common prefixes (lcps) between adjacent strings (see the third column in the example above). In which case, given a suffix array it is easy to reconstruct the suffix tree. Essentially, the longest common prefix corresponds to the lowest common ancestor (lca) in the tree. Relative to suffix trees, the main drawback is that query processing using suffix arrays can be more time consuming as most queries are processed in time that is function of the size of the input string and not the size of the query (as was the case with suffix trees). Researchers have developed several linear time algorithms for directly building suffix arrays without pre-building a suffix tree [, ]. Futamura et al. [] presented the first algorithm for parallel suffix array construction. The approach is similar to the one by Kalyanaraman et al. [] that was described in section Practicel Parallel Algorithms for In-Core Strings. The approach first partitions the suffixes into buckets based on their w-length prefixes. These buckets are then distributed across the processors, where the suffixes in each bucket are sorted locally to obtain a portion of the suffix array. The final suffix array can then be realized by concatenating these distributed suffix arrays. Algorithms for constructing suffix arrays in external memory have been extensively studied and compared by Crauser and Ferragina [] and by Dementiev et al. []. Kärkkäinen et al. [] proposed the DC algorithm, which is based on a similar approach as the Farach et al. []divide-and-conquer algorithm for suffix trees (see section Parallel Suffix Tree Construction). However, instead of dividing suffixes in the input string into suffixes starting at odd and even positions, it divides them in three groups depending on the remainder of the suffix position modulo . Surprisingly, this change significantly simplifies the algorithm. A variant of DC, called pDC, has been implemented by Kulla and Sanders []. Their study as well as the study of Dementiev et al. [] indicate that DC and pDC are currently
S
S
Suffix Trees
the most efficient algorithms for constructing suffix arrays.
construction. Such directions will become increasingly important in the near future given the tendency of packing many cores on a single processor.
Related Entries Bioinformatics Genome Assembly
Bibliographic Notes and Further Reading Research on parallel indexing for sequence/string data sets is still in its infancy. The primary reason for this is that until , sequence data sets were not growing at a rapid pace and most problems could be handled in main memory. However, over the past few years, with advances in sequencing technologies, sequence databases have reached gigantic proportions. For instance, aggressive DNA sequencing efforts have resulted in the GenBank sequence database surpassing the Gbp (one bp (base pair) is one character in the sequence) mark [], with sequences from over , organisms. Further complicating the issue is the fact that researchers not only need the ability to index a single large genome, but a group of large genomes. For example, consider the area of comparative genomics [] where one is interested in comparing different genomes, be they from the same or different species. Here researchers may be interested in comparing the genomes of individuals that are prone to a specific type of cancer to those that are not susceptible. In this case, we need to efficiently build a suffix tree for a group of large genomes, as and when needed. These trends suggest the parallel indexing technology for sequence data sets will be extremely important in the coming decade. Much of this entry focused on parallel construction of suffix trees and suffix arrays. We would like to point the reader to Dan Gusfield’s book [] for a larger overview of applications. There has not been much work on parallelization of query processing using suffix trees and suffix arrays. Another important direction is the design of parallel cache-conscious algorithms for multi-core processors so as to effectively utilize the memory hierarchy on modern processors. Tsirogiannis and Koudas looked at the problem of parallelizing the partition-and-merge approach on chip-multiprocessor architectures [] and developed a cache-conscious algorithm for suffix tree
Bibliography . Apostolico A, Iliopoulos C, Landau G, Schieber B, Vishkin U () Parallel construction of a suffix tree with applications. Algorithmica (–):– . Bray N, Dubchak I, Pachter L () AVID: a global alignment program. Genome research ():– . Burrows M, Wheeler D () A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation. Palo Alto, California . Crauser A, Ferragina P () A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica ():– . Delcher A, Kasif S, Fleischmann R, Peterson J, White O, Salzberg S () Alignment of whole genomes. Nucleic Acids Res ():– . Delcher A, Phillippy A, Carlton J, Salzberg S () Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res () . Dementiev R, Kärkkäinen J, Mehnert J, Sanders P () Better external memory suffix array construction. J Exp Algorithmics (JEA) :– . Farach-Colton M, Ferragina P, Muthukrishnan S () On the sorting-complexity of suffix tree construction. J ACM (): – . Futamura N, Aluru S, Kurtz S () Parallel suffix sorting. In: Proceedings th international conference on advanced computing and communications. Citeseer, pp – . Ghoting A, Makarychev K () Indexing genomic sequences on the IBM Blue Gene. In: SC ’: proceedings of the conference on high performance computing networking, storage and analysis. ACM, New York, pp – . Ghoting A, Makarychev K () Serial and parallel methods for I/O efficient suffix tree construction. In: Proceedings of the ACM international conference on management of data. ACM, New York . Gusfield D () Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge . Hariharan R () Optimal parallel suffix tree construction. In: Proceedings of the symposium on theory of computing. ACM, New York . Hunt E, Atkinson M, Irving R () A database index to large biological sequences. In: Proceedings of th international conference on very large databases. Morgan Kaufmann, San Francisco . Japp R () The top-compressed suffix tree: a disk resident index for large sequences. In: Proceedings of the bioinformatics workshop at the st annual british national conference on databases
SuperLU
. Kalyanaraman A, Emrich S, Schnable P, Aluru S () Assembling genomes on largescale parallel computers. J Parallel Distr Comput ():– . Kärkkäinen J, Sanders P, Burkhardt S () Linear work suffix array construction. J ACM ():– . Ko P, Aluru S () Space efficient linear time construction of suffix arrays. J Discret Algorithms (–):– . Kulla F, Sanders P () Scalable parallel suffix array construction. In: Recent advances in parallel virtual machine and message passing interface: th European PVM/MPI User’s Group Meeting, Bonn, Germany, – September, : proceedings. Springer, New York, p . Kurtz S, Choudhuri J, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R () Reputer: the manifold applications of repeat analysis on a genome scale. Nucleic Acids Res ():– . Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C, Salzberg S () Versatile and open software for comparing large genomes. Genome Bio :(R) . Manber U, Myers G () Suffix arrays: a new method for on-line string searches. In: Proceedings of the first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp – . McCreight E () A space-economical suffix tree construction algorithm. J ACM () . Meek C, Patel J, Kasetty S () Oasis: an online and accurate technique for localalignment searches on biological sequences. In: Proceedings of th international conference on very large databases . NCBI. Public collections of DNA and RNA sequence reach gigabases, . http://www.nlm.nih.gov/news/press_releases/ dna_rna__gig.html. . Phoophakdee B, Zaki M () Genome-scale disk-based suffix tree indexing. In: Proceedings of the ACM international conference on management of data. ACM, New York . Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, Cherry JM, Henikoff S, Skupski MP, Misra S, Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P, Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A, Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F, Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, Hynes RO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, Morrison DK, Mungall C, O’Farrell PH, Pickeral OK, Shue C, Vosshall LB, Zhang J, Zhao Q, Zheng XH, Zhong F, Zhong W, Gibbs R, Venter JC, Adams MD, Lewis S () Comparative genomics of the eukaryotes. Science ():– . Sahinalp SC, Vishkin U () Symmetry breaking for suffix tree construction. In: STOC ’: proceedings of the twentysixth annual ACM symposium on Theory of computing ACM, New York, pp – . Tian Y, Tata S, Hankins R, Patel J () Practical methods for constructing suffix trees. VLDB J ():– . Tsirogiannis D, Koudas N () Suffix tree construction algorithms on modern hardware. In: EDBT ’: Proceedings of the
S
th international conference on extending database Technology. ACM, New York, pp – . Ukkonen E () Constructing suffix trees on-line in linear time. In: Proceedings of the IFIP th work computer congress on algorithms, software, architecture: information processing. North Holland Publishing Co., Amsterdam . Weiner P () Linear pattern matching algorithms. In: Proceedings of th annual symposium on switch and automata theory. IEEE Computer Society, Washington, DC . Zamir O, Etzioni O () Web document clustering: a feasibility demonstration. In: Proceedings of st international conference on research and development in information retrieval. ACM, New York
Superlinear Speedup Metrics
SuperLU Xiaoye Sherry Li , James Demmel , John Gilbert , Laura Grigori , Meiyue Shao Lawarence Berkeley National Laboratory, Berkeley, CA, USA University of California at Berkeley, Berkeley, CA, USA University of California, Santa Barbara, CA, USA Laboratoire de Recherche en Informatique Universite Paris-Sud , Paris, France Umeå University, Umeå, Sweden
Synonyms Sparse gaussian elimination
Definition SuperLU is a general-purpose library for the solution of large, sparse, nonsymmetric systems of linear equations using direct methods. The routines perform LU decomposition with numerical pivoting and solve the triangular systems through forward and back substitution. Iterative refinement routines are provided for improved backward stability. Routines are also provided to equilibrate the system, to reorder the columns to preserve sparsity of the factored matrices, to estimate the condition number, to calculate the relative backward error, to estimate error bounds for the refined solutions,
S
S
SuperLU
and to perform threshold-based incomplete LU factorization (ILU), which can be used as a preconditioner for iterative solvers. The algorithms are carefully designed and implemented so that they achieve excellent performance on modern high-performance machines, including shared-memory and distributed-memory multiprocessors.
SuperLU. Table SuperLU software status Sequential SuperLU_
SuperLU_
SuperLU
MT
DIST
Platform
Serial
Shared-
distributed-
memory
memory
Language
C
(with Fortran
C + Pthreads C + MPI or OpenMP
interface)
Discussion Introduction SuperLU consists of a collection of three related ANSI C subroutine libraries for solving sparse linear systems of equations AX = B. Here A is a square, nonsingular, n × n sparse matrix, and X and B are dense n × nrhs matrices, where nrhs is the number of right-hand sides and solution vectors. The LU factorization routines can handle non-square matrices. Matrix A need not be symmetric or definite; indeed, SuperLU is particularly appropriate for matrices with very unsymmetric structure. All three libraries use variations of Gaussian elimination optimized to take advantage of both sparsity and the computer architecture, in particular memory hierarchies (caches) and parallelism [, ]. All three libraries can be obtained from the following URL: http://crd.lbl.gov/∼xiaoye/SuperLU/ The three libraries within SuperLU are as follows: ●
Sequential SuperLU (SuperLU) is designed for sequential processors with one or more levels of caches []. Routines for both complete and threshold-based incomplete LU factorizations are provided []. ● Multithreaded SuperLU (SuperLU_MT) is designed for shared-memory multiprocessors, such as multicore, and can effectively use – parallel processors on sufficiently large matrices in order to speed up the computation []. ● Distributed SuperLU (SuperLU_DIST) is designed for distributed-memory parallel machines, using MPI for interprocess communication. It can effectively use hundreds of parallel processors on sufficiently large matrices []. Table summarizes the current status of the software. All the routines are implemented in C, with parallel extensions using Pthreads or OpenMP for
Data type
Real/ complex
Real/ complex
Real/ comples
Single/ double
Single/ double
Double
shared-memory programming, or MPI for distributedmemory programming. Fortran interfaces are provided in all three libraries.
Overall Algorithm The kernel algorithm in SuperLU is sparse Gaussian elimination, which can be summarized as follows: . Compute a triangular factorization Pr Dr ADc Pc = LU. Here Dr and Dc are diagonal matrices to equilibrate the system, and Pr and Pc are permutation matrices. Premultiplying A by Pr reorders the rows of A, and postmultiplying A by Pc reorders the columns of A. Pr and Pc are chosen to enhance sparsity, numerical stability, and parallelism. L is a unit lower triangular matrix (Lii = ) and U is an upper triangular matrix. The factorization can also be applied to non-square matrices. − − . Solve AX=B by evaluating X=A− B=(D− r Pr LUPc − − − − Dc ) B=Dc (Pc (U (L (Pr (Dr B))))). This is done efficiently by multiplying from right to left in the last expression: Scale the rows of B by Dr . Multiplying Pr B means permuting the rows of Dr B. Multiplying L− (Pr Dr B) means solving nrhs triangular systems of equations with matrix L by substitution. Similarly, multiplying U − (L− (Pr Dr B)) means solving triangular systems with U. In addition to exact, or complete, factorization, SuperLU also contains routines to perform incomplete factorization (ILU), in which sparser and so cheaper approximations to L and U are computed, which can be
SuperLU
used as a general-purpose preconditioner in an iterative solver. The simplest implementation, used by the simple driver routines in SuperLU and SuperLU_MT, consists of the following steps:
Simple Driver (assumes Dr = Dc = I) . Choose Pc to order the columns of A to increase the sparsity of the computed L and U factors, and hopefully increase parallelism (for SuperLU_MT). Built-in choices are described later. . Compute the LU factorization of APc . SuperLU and SuperLU_MT can perform dynamic pivoting with row interchanges during factorization for numerical stability, computing Pr , L, and U at the same time. . Solve the system using Pr , Pc , L, and U as described above (where Dr = Dc = I). The simple driver subroutines for double precision real data are called dgssv and pdgssv for SuperLU and SuperLU_MT, respectively. The letter d in the subroutine names means double precision real; other options are s for single precision real, c for single precision complex, and z for double precision complex. SuperLU_DIST does not include this simple driver. There is also an expert driver subroutine that can provide more accurate solutions, compute error bounds, and solve a sequence of related linear systems more economically. It is available in all three libraries.
Expert Driver . Equilibrate the matrix A, that is, compute diagonal ˆ = Dr ADc is “better matrices Dr and Dc so that A ˆ − is less sensitive to conditioned” than A, that is, A − ˆ perturbations in A than A is to perturbations in A. ˆ (SuperLU_DIST only), . Preorder the rows of A ˆ by Pr A ˆ where Pr is a permutathat is, replace A tion matrix. This step is called “static pivoting,” and is only done in the distributed-memory algorithm, which allows scaling to more processors. ˆ to increase the sparsity of . Order the columns of A, the computed L and U factors, and increase parallelism (for SuperLU_MT and SuperLU_DIST). ˆ by AP ˆ cT in SuperLU In other words, replace A ˆ by Pc AP ˆ cT in and SuperLU_MT, or replace A SuperLU_DIST, where Pc is a permutation matrix.
S
ˆ SuperLU and . Compute the LU factorization of A. SuperLU_MT can perform dynamic pivoting with row interchanges for numerical stability. In contrast, SuperLU_DIST uses the order computed by the preordering step (Step ), and replaces tiny pivots by larger values for stability; this is corrected by Step . . Solve the system using the computed triangular factors. . Iteratively refine the solution, again using the computed triangular factors. This is equivalent to Newton’s method. . Compute error bounds. Both forward and backward error bounds are computed. The expert driver subroutines for double precision real data are called dgssvx, pdgssvx, and pdgssvx for SuperLU, SuperLU_MT, and SuperLU_DIST, respectively. The driver routines are composed of several lower level computational routines for computing permutations, computing LU factorization, solving triangular systems, and so on. For large matrices, the LU factorization step takes most of the time, although choosing Pc to order the columns can also be time consuming.
Common Features of the Three Libraries Supernodes in the Factors The factorization algorithms in all three libraries use unsymmetric supernodes [], which enable the use of higher level BLAS routines with higher flops-to-byte ratios, and so higher speed. A supernode is a range (r : s) of columns of L with the triangular block just below the diagonal being full, and the same nonzero structure below the triangular block. Matrix U is partitioned rowwise by the same supernodal boundaries. But due to the lack of symmetry, the nonzero pattern of U consists of dense column segments of different lengths. Sparse Matrix Data Structure The principal data structure for a matrix is SuperMatrix, defined as a C structure. This structure contains two levels of fields. The first level defines the three storage-independent properties of a matrix: mathematical type, data type, and storage type. The second level points to the actual storage used to store the compressed matrix. Specifically, matrix A is stored
S
S
SuperLU
in either column-compressed format (aka Harwell– Boeing format), or row-compressed format (i.e., AT stored in column-compressed format) []. Matrices B and X are stored as a single dense matrix of dimension n × nrhs in column-major order, with output X overwriting input B. In SuperLU_DIST, A and B can be either replicated or distributed across all processes. The factored matrices L and U are stored differently in SuperLU/SuperLU_MT and SuperLU_DIST, to be described later.
●
Options Input Argument The options is an input argument to control the behaviour of the libraries. The user can tell the solvers how the linear systems should be solved based on some known characteristics of the system. For example, for diagonally dominant matrices, choosing the diagonal pivots ensures stability; there is no need for numerical pivoting (i.e., Pr can be an identity matrix). In another situation where a sequence of matrices with the same sparsity pattern needs to be factorized, the column permutation Pc (and also the row permutation Pr , if the numerical values are similar) needs to be computed only once, and reused thereafter. In these cases, the solvers’ performance can be much improved over using the default parameter settings.
Numerical Pivoting Both sequential SuperLU and SuperLU_MT use partial pivoting with diagonal threshold. The row permutation Pr is determined dynamically during factorization. At the jth column, let amj be a largest entry in magnitude on or below the diagonal of the partially factored A: ∣amj ∣ = maxi≥j ∣aij ∣. Depending on a threshold u (. ≤ u ≤ .) selected by the user, the code may use the diagonal entry ajj as the pivot in column j as long as ∣ajj ∣ ≥ u ∣amj ∣ and ajj ≠ , or else use amj . If the user sets u = ., amj (or an equally large entry) will be used as the pivot; this corresponds to the classical partial pivoting. If the user has ordered the matrix so that choosing diagonal pivots is particularly good for sparsity or parallelism, then smaller values of u tend to choose those diagonal pivots, at the risk of less numerical stability. Selecting u = . guarantees that the pivot on the diagonal will be chosen, unless it is zero. The code can also use a user-input Pr to choose pivots, as long as each pivot satisfies the threshold for each column. The backward error bound BERR measures how much stability is actually lost. It is hard to get satisfactory execution speed with partial pivoting on distributed-memory machines, because of the fine-grained communication and the dynamic data structures required. SuperLU_DIST uses a static pivoting strategy, in which Pr is chosen before factorization and based solely on the values of the original A, and remains fixed during factorization. A maximum weighted matching algorithm and the code MC64 developed by Duff and Koster [] is currently employed. The algorithm chooses Pr to maximize the product of the diagonal entries, and chooses Dr and Dc simultaneously so that each diagonal entry of Pr Dr ADc is ± and each off-diagonal entry is bounded by in magnitude. On the basis of empirical evidence, when
Performance-Tuning Parameters All three libraries depend on having an optimized BLAS library to achieve high performance [, ]. In particular, they depend on matrix–vector multiplication or matrix–matrix multiplication of relatively small dense matrices arising from the supernodal structure. The block size of these small dense matrices can be tuned to match the “sweet spot” of the BLAS performance on the underlying architecture. These parameters can be altered in the inquiry routine sp_ienv. Example Programs In the source code distribution, the EXAMPLE/ directory contains several examples of how to use the driver routines, illustrating the following usages: ● ●
Solve a system once Solve different systems with the same A, but different right-hand sides ● Solve different systems with the same sparsity pattern of A
Solve different systems with the same sparsity pattern and similar numerical values of A
Except for the case of one-time solution, all the other examples can reuse some of the data structures obtained from a previous factorization, hence, save some time compared to factorizing A from scratch. The users can easily modify these examples to fit their needs.
Differences Between SuperLU/ SuperLU_MT and SuperLU_DIST
SuperLU
this strategy is combined with diagonal scaling, setting very tiny pivots to larger values, and iterative refinement, the algorithm is as stable as partial pivoting for most matrices that have occurred in the actual applications. The detailed numerical experiments can be found in [].
● ● ● ●
Sparsity-Preserving Reordering For unsymmetric factorizations, preordering for sparsity is less understood than that for Cholesky factorization. Many unsymmetric ordering methods use symmetric ordering techniques, either minimum degree or nested dissection, applied to a symmetrized matrix (e.g., AT A or AT + A). This attempts to minimize certain upper bounds on the actual fills. Which symmetrized matrix to use strongly depends on how the numerical pivoting is performed. In sequential SuperLU and SuperLU_MT, where partial pivoting is used, an AT A–based ordering algorithm is preferrable. This is because the nonzero pattern of the Cholesky factor R in AT A = RT R is a superset of the nonzero pattern of the LT and U in Pr A = LU, for any with row interchanges []. Therefore, a good symmetric ordering Pc on AT A that preserves the sparsity of R can be applied to the columns of A, forming APcT , so that the LU factorization of APcT is likely to be sparser than that of the original A. In SuperLU_DIST, an a priori row permutation Pr is computed to form Pr A. With fixed Pr , an (AT + A)based ordering algorithm is preferrable. This is because the symbolic Cholesky factor of AT +A is a much tighter upper bound on the structures of L and U than that of AT A when the pivots are chosen on the diagonal. Note that after Pc is chosen, a symmetric permutation Pc (Pr A)PcT is performed so that the diagonal entries of the permuted matrix remain the same as those in Pr A, and they are larger in magnitude than the off-diagonal entries. Now the final row permutation is Pc Pr . In all three libraries, the user can choose one of the following ordering methods by setting the options.ColPerm option: ● ●
NATURAL: Natural ordering MMD_ATA: Multiple Minimum Degree [] applied to the structure of AT A ● MMD_AT_PLUS_A: Multiple Minimum Degree applied to the structure of AT + A
●
S
METIS_ATA: MeTiS [] applied to the structure of AT A METIS_AT_PLUS_A: MeTiS applied to the structure of AT + A PARMETIS: ParMeTiS [] applied to the structure of AT + A COLAMD: Column Approximate Minimum Degree [] MY_PERMC: Use a permutation Pc supplied by the user as input
COLAMD is designed particularly for unsymmetric matrices when partial pivoting is needed, and does not require explicitly forming AT A. It usually gives comparable orderings as MMD on AT A, and is faster. The purpose of the last option MY_PERMC is to be able to reap the results of active research in the ordering methods. Recently, there is much research on the orderings based on graph partitioning. The user can invoke those ordering algorithms separately, and then input the ordering in the permutation vector for Pc . The user may apply them to the structures of AT A or AT + A. The routines getata() and at_plus_a() in the file get_perm_c.c can be used to form AT A or AT + A.
Task Ordering The Gaussian elimination algorithm can be organized in different ways, such as left-looking (fan-in) or rightlooking (fan-out). These variants are mathematically equivalent under the assumption that the floating-point operations are associative (approximately true), but they have very different memory access and communication patterns. The pseudo-code for the left-looking blocking algorithm is given in Algorithm . Algorithm
Left-looking Gaussian elimination
for block K = to N do () Compute U( : K − , K) (via a sequence of triangular solves) () Update A(K : N, K) ← A(K:N, K) −L( : N, : K − ) ⋅ U( : K − , K) (via a sequence of calls to GEMM) () Factorize A(K : N, K) → L(K : N, K) (may involve pivoting) end for
S
S
SuperLU
SuperLU and SuperLU_MT use the left-looking algorithm, which has the following advantages: ●
In each step, the sparsity changes are restricted within the Kth block column instead of the entire trailing submatrix, which makes it relatively easy to accommodate dynamic compressed data structures due to partial pivoting. ● There are more memory read operations than write operations in Algorithm . This is better for most modern cache-based computer architectures, because write tends to be more expensive in order to maintain cache coherence. The pseudo-code for the right-looking blocking algorithm is given in Algorithm .
Algorithm
Right-looking Gaussian elimination
for block K = to N do () Factorize A(K : N, K) → L(K : N, K) (may involve pivoting) () Compute U(K, K + : N) (via a sequence of triangular solves) () Update A(K + : N, K + : N) ← A(K + : N, K + : N) − L(K + : N, K)⋅ U(K, K + : N) (via a sequence of calls to GEMM) end for
SuperLU_DIST uses right-looking algorithm mainly for scalability consideration. ●
The sparsity pattern and data structure can be determined before numerical factorization because of static pivoting. ● The right-looking algorithm fundamentally has more parallelism: at step () of Algorithm , all the GEMM updates to the trailing submatrix are independent and so can be done in parallel. On the other hand, each step of the left-looking algorithm involves operations that need to be carefully sequenced, which requires a sophisticated pipelining mechanism to exploit parallelism across multiple loop steps.
Parallelization and Performance SuperLU_MT is designed for parallel machines with shared address space, thus there is no need to partition the matrices. Matrices A, L, and U are stored in separate compressed formats. The parallel elimination uses an asychronous and barrier-free scheduling algorithm to schedule two types of parallel tasks to achieve a high degree of concurrency. One such task is factorizing the independent panels in the disjoint subtrees of the column elimination tree []. Another task is updating a panel by previously computed supernodes. The scheduler facilitates the smooth transition between the two types of tasks, and maintains load balance dynamically. In symbolic factorization, a non-blocking algorithm is used to perform depth-first search and symmetric pruning in parallel. The code achieved over tenfold speedups on a number of earlier SMP machines with processors []. Recent evaluation shows that SuperLU_MT performs very well on current multithreaded, multicore machines; it achieved over -fold speedup on a core, thread Sun VictoriaFalls []. The design of SuperLU_DIST is drastically different from SuperLU/SuperLU_MT. Many design choices were made by the need for scaling to a large process count. The input sparse matrix A is divided by block rows, with each process having one block row represented in a local row-compressed format. This format is user-friendly and is compatible with the input interface of much other distributed-memory sparse matrix software. The factored L and U matrices, on the other hand, are distributed by a two-dimensional block cyclic layout using supernodal structure for block partition. This distribution ensures that most (if not all) processors are engaged in the right-looking update at each block elimination step, and also ensures that interprocess communication is restricted among process row set or column set. The right-looking factorizations use elimination DAGs to identify task and data dependencies, and a one step look-ahead scheme to overlap communication with computation. A distributed parallel symbolic factorization algorithm is designed so that there is no need to gather the entire graph on a single node, which largely increases memory scalability []. SuperLU_DIST has achieved - to -fold speedups with sufficiently large matrices, and over half a Teraflops factorization rate [].
SuperLU
Future Directions There are still many open problems in the development of high performance algorithm and software for sparse direct methods. The current architecture trend shows that the Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to extreme high-end supercomputers. The core count per chip will quickly increase from four to eight today to tens in the near future. Given the limited perchip memory and memory bandwidth, the standard parallelization procedure based on MPI would suffer from serious resource contention. It becomes essential to consider a hybrid model of parallelism at the algorithm level as well as the programming level. The quantitative multicore evaluation of SuperLU shows that the left-looking algorithm in SuperLU_MT consistently outperforms the right-looking algorithm in SuperLU_DIST on a single node of the recent CMP systems, mainly because the former incurs much less memory traffic []. One new design is to combine the two algorithms – partitioning the matrix into larger panels, performing left-looking intra-chip elimination and right-looking inter-chip elimination. A second new direction is to exploit the property of low numerical rank in many discretized operators so that a specialized Gaussian elimination algorithm can be designed. For example, it has been shown recently that semi-separable structures occur in the pivoting block and the corresponding Schur complement of each block factorization step. Thus, using the compressed, semi-separable representation throughout the entire factorization leads to an approximate factorization algorithm that has nearly linear time and space complexity [, ]. For many discretized elliptic PDEs, this approximation is sufficiently accurate and can be used as an optimal direct solver. For more general problems, the factoriation can be used as an effective preconditioner. Because of its asymptotically lower data volume compared with conventional algorithms, the amount of memory-to-processor and inter-processor communication is smaller, making it more amenable to a scalable implementation. Another promising area is to extend the new communication avoiding dense LU and QR factorizations to sparse factorizations []. In conventional algorithms, the panel factorization of a tall-skinny
S
submatrix requires a sequence of fine-grained message transfers, and often lies on the critical path of parallel execution. The new method employs a divideand-conquer scheme for this phase, which has asymptotically less communication. This method should also work for sparse matrices. In particular, the new pivoting strategy for LU can replace the static pivoting currently used in SuperLU, leading to a more stable and scalable solver.
Related Entries Chaco LAPACK METIS and ParMETIS Mumps PARDISO PETSc (Portable, Extensible Toolkit for Scientific
Computation) Preconditioners for Sparse Iterative Methods Reordering ScaLAPACK Sparse Direct Methods
Bibliography . Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, van der Vorst H () Templates for the solution of linear systems: building blocks for the iterative methods. SIAM, Philadelphia, PA . Davis TA, Gilbert JR, Larimore S, Ng E () A column approximate minimum degree ordering algorithm. ACM Trans Math Softw ():– . Demmel JW, Eisenstat SC, Gilbert JR, Li XS, Liu JWH () A supernodal approach to sparse partial pivoting. SIAM J Matrix Anal Appl ():– . Demmel JW, Gilbert JR, Li XS () An asynchronous parallel supernodal algorithm for sparse gaussian elimination. SIAM J Matrix Anal Appl ():– . Demmel JW, Gilbert JR, Li XS () SuperLU users’ guide. technical report LBNL-, Lawrence Berkeley National Laboratory, September . http://crd.lbl.gov/~xiaoye/SuperLU/. Last update: September . Duff IS, Koster J () The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J Matrix Anal Appl ():– . BLAS Technical Forum () Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard I. Int J High Perform Comput Appl :–
S
S
Supernode Partitioning
. BLAS Technical Forum () Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard II. Int J High Perform Comput Appl :– . George A, Liu J, Ng E () A data structure for sparse QR and LU factorizations. SIAM J Sci Stat Comput :– . Gilbert JR, Ng E () Predicting structure in nonsymmetric sparse matrix factorizations. In: George A, Gilbert JR, Liu JWH (eds) Graph theory and sparse matrix computation. SpringerVerlag, New York, pp – . Grigori L, Demmel J, Xiang H () Communication-avoiding Gaussian elimination. In: Supercomputing , Austin, TX, November –, . . Grigori L, Demmel JW, Li XS () Parallel symbolic factorization for sparse LU with static pivoting. SIAM J Sci Comput ():– . Gu M, Li XS, Vassilevski P () Direction-preserving and schur-monotonic semi-separable approximations of symmetric positive definite matrices. SIAM J Matrix Anal Appl ():– . Karypis G, Kumar V () MeTiS – a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices – version .. University of Minnesota, September . http://wwwusers.cs.umn.edu/ karypis/metis/. Accessed . Karypis G, Schloegel K, Kumar V () ParMeTiS: Parallel graph partitioning and sparse matrix ordering library – version .. University of Minnesota. http://www- users.cs.umn.edu/∼karypis/ metis/parmetis/. Accessed . Li XS () Evaluation of sparse factorization and triangular solution on multicore architectures. In: Proceedings of VECPAR th international meeting high performance computing for computational science, Toulouse, France, June –, . Li XS (Sept ) An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans Math Softw (): – . Li XS () Sparse direct methods on high performance computers. University of California, Berkeley, CS Lecture Notes . Li XS, Demmel JW (June ) SuperLU DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans Math Softw ():– . Li XS, Shao M () A supernodal approach to imcomplete LU factorization with partial pivoting. ACM Trans Math Softw () . Liu JWH () Modification of the minimum degree algorithm by multiple elimination. ACM Trans Math Softw :– . Xia J, Chandrasekaran S, Gu M, Li XS () Superfast multifrontal method for large structured linear systems of equations. SIAM J Matrix Anal Appl , ():–
Supernode Partitioning Tiling
Superscalar Processors Wen-mei Hwu University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Multiple-instruction issue; Out-of-order execution processors
Definition A superscalar processor is designed to achieve an execution rate of more than one instruction per clock cycle for a single sequential program.
Discussion Introduction Superscalar processor design typically refers to a set of techniques that allow the central processing unit (CPU) of a computer to achieve a throughput of more than one instruction per cycle while executing a single sequential program. While there is not a universal agreement on the definition, superscalar design techniques typically include parallel instruction decoding, parallel register renaming, speculative execution, and out-of-order execution. These techniques are typically employed along with complementing design techniques such as pipelining, caching, branch prediction, and multi-core in modern microprocessor designs. A typical superscalar processor today is the Intel Core i processor based on the Nehalem microarchitecture. There are multiple processor cores in a Core i design, where each processor core is a superscalar processor. The processor performs parallel decoding on IA (X) instructions, performs parallel register renaming to map the X registers used by these instructions to a larger set of physical registers, performs speculative execution of instructions beyond conditional branch instruction and potential exception causing instructions, and allows instructions to execute out of their program specified order while maintaining the appearance of in-order completion. These techniques are accompanied and supported by pipelining, instruction caching, data caching, and branch prediction in the design of each processor core.
Superscalar Processors
Instruction-Level Parallelism and Dependences Superscalar processor design assumes the existence of instruction-level parallelism, a phenomenon that multiple instructions can be executed independently of each other at each point in time. Instruction-level parallelism arises due to the fact that instructions in an execution phase of a program often read and write different data, thus their executions do not affect each other. In Fig. , Instruction A is a memory load instruction that forms its address from the contents of register (r), accesses the data in the address location, and deposits the accessed data into register (r). Instruction B is a memory load instruction that forms its address by adding value to the contents of register (r), accesses the data in the address location, and deposits the accessed data into register (r). Instructions A and B can be executed independently of each other. Neither of their execution results is affected by the other. With sufficient execution resources, a superscalar processor can execute A and B together and achieve an execution rate of more than one instruction per clock cycle. Note that Instruction C cannot be executed independently of Instruction A or Instruction B. This is because Instruction C uses the data in register (r) and register (r) as its input. These data are loaded from memory by Instruction A and Instruction B. That is, the execution of Instruction C depends on that of instruction A and Instruction B. Such dependences limit the amount of instruction-level parallelism that exists in a program. In general, data dependences arise in programs due to the way instructions read from and write into register and memory storage locations. Data dependencies occur between instructions in three forms. We use the
A
r2 ← Load MEM[r1+0]
B
r3 ← Load MEM[r1+4]
C
r2 ← r2+r3
D
r3 ← Load MEM[r1+8]
Superscalar Processors. Fig. Code example for register access data dependences
S
code example in Fig. to illustrate these forms of data dependences: . Data flow dependency: the destination register of Instruction A is the same as one of the source registers of Instruction C, and C follows A in the sequential program order. In this case, a subsequent instruction consumes the execution result of a previous instruction. . Data antidependency: one of the source operands of Instruction C is the same as the destination register of Instruction D, and D follows C in the sequential program order. In this case, Instruction C should receive result of Instruction B. However, if Instruction D is executed too soon, C may receive execution result of Instruction D, which is too new. In this case, a subsequent instruction overwrites one of the source registers of a previous instruction. . Data output dependency: the destination register of Instruction B is the same as the destination register of Instruction D, and Instruction D follows B in the sequential program order. In this case, a subsequent instruction overwrites the destination register of a previous instruction. If D is executed too soon, its result could be overwritten by Instruction B, leaving a result which is too old in the destination register. Subsequent instructions would be using stale results. A superscalar processor uses register renaming and outof-order execution techniques to detect and enhance the amount of instruction-level parallelism between instructions so that it can execute multiple instructions per clock cycle. These techniques ensure that all instructions acquire the appropriate input value in the presence of all the parallel execution activities and data dependences. They also make sure that the output values of instructions are reflected correctly in the processor registers and memory locations.
Register Renaming Register renaming is a technique that eliminates register antidependences and output dependences in order to increase instruction parallelism. A register renaming mechanism provides a physical or implementation register file that is larger than the architectural register files. For example, the IA (X) architecture specifies general-purpose registers whereas the register
S
S
Superscalar Processors
renaming mechanism of a superscalar processor typically provides or more physical registers. At any time, each architectural register is mapped to one or more of the physical registers. By mapping, an architectural register is mapped to multiple physical registers, one can eliminate apparent antidependences and output dependences between instructions. For example, a register renaming mechanism may map architectural register r to physical register pr for the destination operand of Instruction B and source of Instruction C in Fig. . That is, Instruction B will deposit its result to pr and Instruction C will fetch its second input operand from pr. As long as the producer of the data and all the consumers of the data are redirected to the same physical register, the execution results will not be affected. Let’s further assume that architecture register r is mapped to physical register pr for the destination operand of Instruction D. That is, Instruction D will deposit its execution results to pr. As long as all subsequent uses of the data produced by Instruction D are redirected to pr for their input, their execution results will remain the same. The reader should see that the antidependence between Instruction C and Instruction D has been eliminated by the register renaming mechanism. Since Instruction C will go to pr for its input whereas Instruction D will deposit its result into pr, Instruction D will no longer overwrite the input for Instruction C no matter how soon it is executed. As a result, we have eliminated a dependence constraint between Instruction C and Instruction D. The reader should also notice that the output dependence between Instruction B and Instruction D has been eliminated by the register renaming mechanism. Since Instruction B deposits its result into pr and Instruction D deposits its result into pr, B will no longer overwrite the result of D no matter how soon D is executed. As long as subsequent Instructions that use architecture register r are also directed to pr, they will see the updated results rather than stale results even though B could be executed after D. By eliminating the register antidependence between C and D and output dependence between B and D, Instruction D can now be executed in parallel with Instructions A and B. This increases the level of
instruction-level parallelism. This is the reason why register renaming has become an essential mechanism for superscalar processor design.
Speculative Execution A major limitation on the amount of instructionlevel parallelism is uncertainties in program execution sequence. There are two major sources of such uncertainty. The first is conditional and indirect branch instructions. Conditional branches are used to implement control constructs such as if-then-else statements and loops in high-level language programs. Depending on the condition values, the instructions to be executed after a conditional branch can be either from the next sequential location or from a target location specified by the branch offset. Indirect branches use the contents of a register or memory location as the address of the next instruction to execute. Indirect branches are used to implement procedure return statements, case statements, and table-based procedural calls in high-level language programs. While conditional and indirect branch instructions serve essential purposes in implementing high-level languages, they introduce uncertainties for execution. The classic approach to addressing this problem is branch prediction, where a mechanism is used to predict the next instructions to execute when conditional and indirect branches are encountered during program execution. However, prediction alone is not sufficient. One needs a means to speculatively execute the instructions on the predicted path of execution and recover from any incorrect predictions. The second source of uncertainty in program execution is exception conditions. Modern computers use exceptions to support the implementation of virtual memory management, memory protection, and rare execution condition handling. For example, the Instruction A in Fig. may trigger a page fault in its execution and require operating system service to bring in its missing load data. The execution needs to be able to resume cleanly after the page fault is handled by the operating system. This requirement means that the instructions after a load instruction, or any instruction that can potentially cause exceptions, cannot change the execution state in a way that prevents the execution from restarting at the exception causing instruction.
Superscalar Processors
Like in the case of branch prediction, one can “predict” that the exception conditions do not occur and assume that the execution will simply continue down the current path. However, one needs a means to recover the state when an exception indeed occurs so that the execution can correctly restart from the exception-causing instruction. Speculative execution is a mechanism that allows processors to fetch and execute instructions down a predicted path and to recover if the prediction is incorrect. In general, these mechanisms use buffers to keep both the original state and the recent updates to the state. The updated state is used during the speculative execution. The original state is used if the processor needs to recover from an incorrect prediction. Since conditional and indirect branches occur frequently, one in every four to five instructions on average, the level of instruction level parallelism a superscalar processor can exploit would be extremely low without speculative execution. This is why speculative execution has become an essential mechanism in superscalar processor design. The most popular methods for recovering from incorrect branch predictions are reorder buffer (ROB) and checkpointed register file. The most popular method for recovering from exception causing instructions is reorder buffer. A retirement mechanism in speculative execution attempts to make the effects of instruction execution permanent. An instruction is eligible for retirement when all instructions before it have retired. An eligible instruction is then checked if it has caused any exceptions or incorrect branch prediction. If the instruction does not incur any exception or incorrect branch prediction, it can commit its execution result into an architectural state. Otherwise, the instruction triggers a recovery and the state of the processor is restored to a previous architectural state. The capacity of reorder buffers and checkpoint buffers defines the notion of instruction window, a collection of consecutive instructions that are actively processed by a superscalar processor. At any point in time, the instruction window starts with the oldest instruction that has not completed execution and ends with the youngest instruction that has started execution. The larger the instruction window, the more hardware is needed to keep track of the execution status
S
of all instructions in the window and the information needed to recover the processor state if anything goes wrong with the execution of the instructions in the window.
Out-of-Order Execution In a superscalar processor, instructions are fetched according to their sequential program order. However, this may not be best order of execution. For example, in Fig. , Instruction D is fetched after Instruction C. However, Instruction C cannot execute until Instructions B and Instruction D completes their execution. On the other hand, with register renaming, Instruction D does not have any dependence on Instructions A, Instruction B, or Instruction C. Therefore, a superscalar processor “reorders” the execution of Instruction C and Instruction D so that Instruction D can produce results for its consumers as soon as possible. An out-of-order execution mechanism provides buffering for Instructions that need to wait for their input data so that some of their subsequent instructions can proceed with execution. A popular method used in out-of-order execution is the Tomasulo’s Algorithm that was originally used in the IBM /. The method maintains reservation stations, hardware buffers that allow the instructions to wait for their input operands.
Brief Early History The most popular out-of-order execution mechanism in modern superscalar processors is Tomasulo’s algorithm [] designed by Bob Tomasulo and used in the IBM / floating point unit in . The out-orderexecution mechanism was abandoned in later IBM machines partly due to the concern of the problems it introduces to virtual memory management. The Intel Pentium processor [] is an early superscalar design that fetches and executes multiple instructions at each clock cycle. It did not employ register renaming or speculative execution and was able to exploit only a very limited amount of instruction-level parallelism. Register renaming in superscalar processor design started with the Register Alias Table (RAT) by Patt et al. []. To this day, the register renaming structure used in Intel superscalar processors are still called RAT.
S
S
SWARM: A Parallel Programming Framework for Multicore Processors
Simth and Plaszkun proposed reorder buffers and history buffers for recovering from exceptions in highly pipelined processors []. Hwu et al. [] extended the concepts with checkpointing and incorporated these speculative execution mechanisms with Tomasulo’s algorithm. In , Patt et al. [] proposed a comprehensive superscalar design that incorporates register renaming, speculative execution, out-of-order execution, along with parallel instruction decode and branch prediction. This is the first comprehensive academic design of superscalar processors. In , Sohi and Vijapeyam [] proposed a unified reservation station design. These designs were later adopted and refined by Intel to create the Pentium Pro Processor, the first commercially successful superscalar processor design []. Recent superscalar processors include MIPS R, Intel Pentium , IBM Power , AMD Athlon, and ARM Cortex. Interested readers should refer to Hennessy and Patterson [] for more detailed treatment and more recent history on superscalar processor design.
Bibliography . Tomasulo R () An efficient algorithm for exploiting multiple arithmetic units. IBM J ResDev ():– . Case B () Intel reveals pentium implementation details. Microprocessor Report Mar . Patt Y, Hwu W-M, Shebanow M () HPS, a new microarchitecture: rationale and introduction. In: Proceedings of the th annual workshop on microprogramming, Pacific Grove, pp – . Smith J, Pleszkun A () Implementation of precise interrupts in pipelined processors. In: Proceedings of the th international symposium on computer architecture, Boston . Hwu W-M, Patt Y () Checkpoint repair for out-of-order execution machines. In: Proceedings of the th international symposium on computer architecture, Pittsburgh . Sohi G, Vajapeyam S () Instruction issue logic for highperformance, interruptable pipelined processors. In: Proceedings of the th international symposium on computer architecture, New York . Colwell R () Pentium chronicles – the people, passion, and politicc behind Intel’s Lanmark chips. Wiley-IEEE Computer Society, ISBN ---- . Hennessy J, Patterson D () Computer architecture – a quantitative approach, th edn. Morgan Kauffman, San Francisco, ISBN ----
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader , Guojing Cong Georgia Institute of Technology, Atlanta, GA, USA IBM, Yorktown Heights, NY, USA
Definition SoftWare and Algorithms for Running on Multi-core (SWARM) is a portable open-source parallel library of basic primitives for programming multicore processors. SWARM is built on POSIX threads that allows the user to use either the already developed primitives or direct thread primitives. SWARM has constructs for parallelization, restricting control of threads, allocation and deallocation of shared memory, and communication primitives for synchronization, replication and broadcast. Built on these techniques, it contains a higher-level library of multicore-optimized parallel algorithms for list ranking, comparison-based sorting, radix sort, and spanning tree. In addition, SWARM application example codes include efficient implementations for solving combinatorial problems such as minimum spanning tree [], graph decomposition [], breadth-first-search [], tree contraction [], and maximum parsimony [].
Motivation For the last few decades, software performance has improved at an exponential rate, primarily driven by the rapid growth in processing power. However, performance improvements can no longer rely solely on Moore’s law. Fundamental physical limitations such as the size of the transistor and power constraints have now necessitated a radical change in commodity microprocessor architecture to multicore designs. Dual and quad-core processors from Intel [] and AMD [] are now ubiquitous in home computing. Also, several novel architectural ideas are being explored for high-end workstations and servers [, ]. Continued software performance improvements on such novel multicore systems now require the exploitation of concurrency at the algorithmic level. Automatic methods for detecting concurrency from sequential codes, for example
SWARM: A Parallel Programming Framework for Multicore Processors
with parallelizing compilers, have had only limited success. SWARM was introduced to fully utilize multicore processors. On multicore processors, caching, memory bandwidth, and synchronization constructs have a considerable effect on performance. In addition to time complexity, it is important to consider these factors for algorithm analysis. SWARM assumes the multicore model that can be used to explain performance on systems such as Sun Niagara, Intel, and AMD multicore chips. Different models [] are required for modeling heterogeneous multicore systems such as the Cell architecture.
Model for Multicore Architectures Multicore systems have a number of processing cores integrated on to a single chip [, , , , ]. Typically, the processing cores have their own private L cache and share a common L cache [, ]. In such a design, the bandwidth between the L cache and main memory is shared by all the processing cores. Figure shows the simplified architectural model.
Multicore Model The multicore model (MCM) consists of p identical processing cores integrated onto a single chip. The
Core 1
L 1
Core 2
L 1
S
processing cores share an L cache of size C, and the memory bandwidth is σ . .
Let T(i) denote the local time complexity of the core i for i = , . . . , p. Let T = max T(i). i
.
Let B be the total number of blocks transferred between L cache and the main memory. The requests may arise out of any processing core. . Let L be the time required for synchronization between the cores. Let NS (i) be the total number of synchronizations required on core i for i = , . . . , p. Let NS = max NS (i). i
Then the complexity under the multicore model can be represented by a triple ⟨T, B ⋅ σ − , NS ⋅ L⟩. The complexity of an algorithm will be represented by the dominant element in this triple. The model proposed above is in many ways similar to the Helman-JáJá model for symmetric multiprocessor (SMP) systems [], with a few important differences. In the case of SMPs, each processor typically has a large L cache and dedicated bandwidth to main memory, whereas in multicore systems, the shared memory bandwidth will be an important consideration. Thus, SWARM explicitly models the cache hierarchy, and count the number of block transfers between the cache and main memory in a manner similar to Aggarwal and Vitter’s external memory model [].
S
L 2
. C A C H E
. . .
Core p
Main memory
L 1
SWARM: A Parallel Programming Framework for Multicore Processors. Fig. Architectural model for multicore systems
S
SWARM: A Parallel Programming Framework for Multicore Processors
SWARM targets three primary issues that affect performance on multicore systems: . Number of processing cores: Current systems have two to eight cores integrated on a single chip. Cores typically support features such as simultaneous multithreading (SMT) or hardware multithreading, which allow for greater parallelism and throughput. In future designs, up to cores can exist on a single chip. . Caching and memory bandwidth: Memory speeds have been historically increasing at a much slower rate than processor capacity []. Memory bandwidth and latency are important performance concerns for several scientific and engineering applications. Caching is known to drastically affect the efficiency of algorithms even on single processor systems [, ]. In multicore systems, this will be even more important due to the added bandwidth constraints. . Synchronization: Implementing algorithms using multiple processing cores will require synchronization between the cores from time to time, which is an expensive operation in shared memory architectures.
Case Study: Merge Sort Algorithm In the first algorithm, the input array of length N is equally divided among the p processing cores so that each core gets N/p elements to sort. Once the sorting phase is completed, there are p sorted sub-arrays, each of length N/p. Thereafter the merge phase takes place. A p-way merge over the runs will give the sorted array. Each processor individually sorts its elements using some cache-friendly algorithm. This approach does not try to minimize the number of blocks transferred between the L cache and main memory. Analysis. Since the p processors are all sorting their respective elements at the same time, the L cache will be shared by all the cores during the sorting phase. Thus, if the size of the L cache is C, then effectively each core can use just a portion of the cache with size C/p. Assuming the input size is larger than the cache size, the cache misses will be p times that if only a single core were sorting. Also the bandwidth between the cache and shared
main memory is also shared by all the p cores, and this may be a bottleneck. The time complexity of each processor is: Tc (sort) =
N N ⋅ log ( ) p p
During the merge phase: Tc (merge) = N ⋅ log(p) Tc (total) =
N N log ( ) + N log(p) p p
Algorithm This algorithm divides the given array of length N into blocks of size M where M is less that C, the size of the L cache. Each of such N/M blocks is first sorted using all p cores. This is the sorting phase. When the sorting phase is completed, the array consists of N/M runs each of length M. During the merge phase, p blocks are merged at a time. This process is repeated till a single sorted array is arrived. Thus, the merge phase is carried N out logp ( ) times. M Analysis. This algorithm is very similar to the I/O model merge sort []. Thus, this algorithm is optimal in terms of transfers between main memory and L cache. However, it will have slightly higher-computational complexity. The p cores sort a total of N/M blocks of size M. Assuming the use of a split-and-merge sort for sorting the block of M elements, thus, during the sorting phase, the time per core is: Tc (sort) = =
N M M N ⋅ log ( ) + ⋅ M log(p) M p p M M N log ( ) + N log(p) p p
()
During any merge phase, if blocks of size S are being N ⋅ merged p at a time, the complexity per core is Sp N Sp log(p) = N log(p). There are logp ( ) merge M phases, thus Tc (merge) = N log(p) ⋅ logp ( Tc (total) =
N ) M
M N N log ( ) + N log(p) ( + logp ( )) p p M
Comparison. Algorithm clearly has better time complexity than algorithm . However, algorithm is optimal in terms of transfers between L cache and shared
SWARM: A Parallel Programming Framework for Multicore Processors
main memory. Algorithm analysis using this model captures computational complexity as well as memory performance.
Programming in SWARM A typical SWARM program is structured as follows: int main (int argc, char **argv) { SWARM_Init(&argc, &argv); /* sequential code */ .... .... /* parallelize a routine using SWARM */ SWARM_Run(routine); /* more sequential code */ .... .... SWARM_Finalize(); } In order to use the SWARM library, the programmer needs to make minimal modifications to existing sequential code. After identifying compute-intensive routines in the program, work can be assigned to each core using an efficient multicore algorithm. Independent operations such as those arising in functional parallelism or loop parallelism can be typically threaded. For functional parallelism, this means that each thread acts as a functional process for that function, and for loop parallelism, each thread computes its portion of the computation concurrently. Note that it might be necessary to apply loop transformations to reduce data dependencies between threads. SWARM contains efficient implementations of commonly used primitives in parallel programming. Data parallel. The SWARM library contains several basic “pardo” directives for executing loops concurrently on one or more processing cores. Typically, this is useful when an independent operation is to be applied to every location in an array, for example element-wise addition of two arrays. Pardo implicitly partitions the loop among the cores without the need for coordinating overheads such as synchronization of communication between the cores. By default, pardo uses block partitioning of the loop assignment values to the threads, which typically results in better cache utilization due to
S
the array locations on lefthand side of the assignment being owned by local caches more often than not. However, SWARM explicitly provides both block and cyclic partitioning interfaces for the pardo directive. /* example: partitioning a "for" loop among the cores */ pardo(i, start, end, incr) { A[i] = B[i] + C[i]; } Control. SWARM control primitives restrict which threads can participate in the context. For instance, the control may be given to a single thread on each core, all threads on one core, or a particular thread on a particular core. THREADS: total number of execution threads MYTHREAD: the rank of a thread, from 0 to THREADS-1 /* example: execute code on thread MYTHREAD */ on_thread(MYTHREAD) { .... .... } /* example: execute code on one thread */ on_one_thread { .... .... } Memory management. SWARM provides two directives SWARM_malloc and SWARM_free that, respectively, dynamically allocate a shared structure and release this memory back to the heap. /* example: allocate a shared array of size n */ A =(int*)SWARM_malloc(n*sizeof(int), TH); /* example: free the array A */ SWARM_free(A); Barrier. This construct provides a way to synchronize threads running on the different cores. /* parallel code */
S
S
SWARM: A Parallel Programming Framework for Multicore Processors
.... .... /* use the SWARM Barrier for synchronization */ SWARM_Barrier(); /* more parallel code */ .... .... Replicate. This primitive uniquely copies a data buffer for each core. Scan (reduce). This performs a prefix (reduction) operation with a binary associative operator, such as addition, multiplication, maximum, minimum, bitwiseAND, and bitwise-OR. allreduce replicates the result from reduce for each core. /* function signatures */ int SWARM_Reduce_i(int myval, reduce_t op, THREADED); double SWARM_Reduce_d(double myval, reduce_t op, THREADED);
Algorithm Design and Examples in SWARM The SWARM library contains a number of techniques to demonstrate key methods for programming on multicore processors. ●
●
●
●
/* example: compute global sum, using partial local values from each core */ sum = SWARM_Reduce_d(mySum, SUM, TH); Broadcast. This primitive supplies each processing core with the address of the shared buffer by replicating the memory address.
●
/* function signatures */ int SWARM_Bcast_i (int myval, THREADED); int* SWARM_Bcast_ip (int* myval, THREADED); char SWARM_Bcast_c (char myval, THREADED); Apart from the primitives for computation and communication, the thread-safe parallel pseudorandom number generator SPRNG [] is integrated in SWARM.
●
The prefix-sum algorithm is one of the most useful parallel primitives and is at the heart of several other primitives, such as array compaction, sorting, segmented prefix-sums, and broadcasting; it also provides a simple use of balanced binary trees. Pointer-jumping (or path-doubling) iteratively halves distances in a list or graph. It is used in numerous parallel graph algorithms, and also as a sampling technique. Determining the root for each tree node in a rooteddirected forest is a crucial step in handling equivalence classes – such as detecting whether or not two nodes belong to the same component; when the input is a linked list, this algorithm also solves the parallel prefix problem. An entire family of techniques of major importance in parallel algorithms is loosely termed divide-andconquer – such techniques decompose the instance into smaller pieces, solve these pieces independently (typically through recursion), and then merge the resulting solutions into a solution to the original instance. These techniques are used in sorting, in almost any tree-based problem, in a number of computational geometry problems (finding the closest pair, computing the convex hull, etc.), and are also at the heart of fast transform methods such as the FFT. The pardo primitive in SWARM can be used for implementing such a strategy. A variation of the above theme is the partitioning strategy, in which one seeks to decompose the problem into independent subproblems – and thus avoid any significant work when recombining solutions; quicksort is a celebrated example, but numerous problems in computational geometry can be solved efficiently with this strategy (particularly problems involving the detection of a particular configuration in three- or higher-dimensional space). Another general technique for designing parallel algorithms is pipelining. In this approach, waves of concurrent (independent) work are employed to achieve optimality.
Switch Architecture
Built on these techniques, SWARM contains a higher-level library of multicore-optimized parallel algorithms for list ranking, comparison-based sorting, radix sort and spanning tree. In addition, SWARM application example codes include efficient implementations for solving combinatorial problems such as minimum spanning tree [], graph decomposition [], breadthfirst-search [], tree contraction [] and maximum parsimony [].
Related Entries Cilk OpenMP Parallel Skeletons
History The SWARM programming framework is a descendant of the symmetric multiprocessor (SMP) node library component of SIMPLE [].
Bibliography . Aggarwal A, Vitter J () The input/output complexity of sorting and related problems. Commun ACM :– . AMD Multi-Core Products (), http://multicore.amd.com/en/Products/ . Bader DA, Cong G () A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). In: Proceedings of the international parallel and distributed processing symposium (IPDPS ), Santa Fe, NM, April . Bader DA, JáJá J () SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). J Parallel Distrib Comput ():– . Bader DA () SWARM: a parallel programming framework for multicore processors, https://sourceforge.net/projects/ multicore-swarm . Bader DA, Agarwal V, Madduri K () On the design and analysis of irregular algorithms on the Cell processor: a case study of list ranking. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS ), Long Beach, CA . Bader DA, Chandu V, Yan M () ExactMP: an efficient parallel exact solver for phylogenetic tree reconstruction using maximum parsimony. In: Proceedings of the th International Conference on Parallel Processing (ICPP), Columbus, OH, August . Bader DA, Illendula AK, Moret BME, Weisse-Bernstein N () Using PRAM algorithms on a uniform-memoryaccess sharedmemory architecture. In: Brodal GS, Frigioni D, MarchettiSpaccamela A (eds) Proceedings of the th international workshop on algorithm engineering (WAE ), volume of lecture notes in computer science. Springer-Verlag, Århus, Denmark, pp –
S
. Bader DA, Madduri K () Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-. In: Proceedings of the th international conference on parallel processing (ICPP), IEEE Computer Society, Columbus, OH, August . Bader DA, Sreshta S, Weisse-Bernstein N () Evaluating arithmetic expressions using tree contraction: a fast and scalable parallel implementation for symmetric multiprocessors (SMPs). In: Sahni S, Prasanna VK, Shukla U (eds) Proceedings of the th international conference on high performance computing (HiPC ), volume of lecture notes in computer science. Bangalore, India, Springer-Verlag, December , pp – . Barroso LA, Gharachorloo K, McNamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B () Piranha: a scalable architecture based on single-chip multi-processing. SIGARCH Comput Archit News ():– . Helman DR, JáJá J () Designing practical efficient algorithms for symmetric multiprocessors. In: Algorithm engineering and experimentation (ALENEX’), volume of lecture notes in computer science, Springer-Verlag, Baltimore, MD, January , pp – . Multi-Core from Intel – Products and Platforms () http://www.intel.com/multi-core/products.htm . International Technology Roadmap for Semiconductors (), http://itrs.net, update . Kahle JA, Day MN, Hofstee HP, Johns CR, Maeurer TR, Shippy D () Introduction to the cell multiprocessor. IBM J Res Dev (/):– . Kongetira P, Aingaran K, Olukotun K () Niagara: a -way multithreaded Spare processor. IEEE Micro ():– . Ladner R, Fix JD, LaMarca A () The cache performance of traversals and random accesses. In: Proceedings of the th annual symposium discrete algorithms (SODA-), ACM-SIAM, Baltimore, MD, pp – . Ladner RE, Fortna R, Nguyen B-H () A comparison of cache aware and cache oblivious static search trees using program instrumentation. In: Fleischer R, Meineche-Schmidt E, Moret BME (eds) Experimental algorithms, volume of lecture notes in computer science, Springer-Verlag, Berlin Heidelberg, pp – . Mascagni M, Srinivasan A () Algorithm : SPRNG: a scalable library for pseudorandom number generation. ACM Trans Math Softw ():–
Switch Architecture José Flich Technical University of Valencia, Valencia, Spain
Synonyms Router architecture
S
S
Switch Architecture
Definition The switch architecture defines the internal organization and functionality of components in a switch. The switch is in charge of forwarding units of information from the input ports to the output ports.
Discussion High-performance computing systems, like clusters and massively parallel processors (MPPs), rely on the use of an efficient interconnection network. As the number of end nodes increases to thousands, or even larger sizes, the network becomes a key component since it must provide low latencies and high bandwidth at a moderate cost and power consumption. The basic components of a network are switches/routers, links, and network interfaces. The interconnection network efficiency largely depends on the switch design. Prior to defining and describing the switch architecture concept, it is worth differentiating between router and switch. Typically, the router is the basic component in a network that forwards messages from a set of input ports to a set of output ports. The router usually negotiates paths and computes the output ports messages need to take. Therefore, the router has some intelligence and adapts to the varying conditions of the network. Indeed, the router concept comes from Wide Area Networks (WANs) where the switching devices must be smart enough to adapt to the varying topological conditions. In high-performance interconnection networks, typically found in cluster cabinets, connecting massively parallel processors, and nowadays even inside a chip, switching devices are also needed. However, differently from the WAN environment, these devices, although have some intelligence and can make critical decisions, they have no capabilities to negotiate the paths and to adapt to the varying conditions. Indeed, usually the topology is expected not to change, thus no need for such negotiation. This is the reason why these devices are also known as switches rather than routers. However, both terms are used with no clear differentiation by the community and thus, both become acceptable. This entry is related to switch architecture, although it can be seen also as router architecture. The switch architecture defines the internal organization and functionality of a switch or router. Basically, the switch has a set of input ports where data messages
come in and a set of output ports where data messages are delivered. The way internal components are used and connected between them is defined by the switch architecture. The switch architecture largely depends on the switching technique (see Switching Techniques) used; thus, we may find switch architectures implementing store and forward switching that greatly vary from switches implementing wormhole switching. Most of the current modern routers and switches, used in high-performance networks, implement cut-through switching, or some variant. This entry focuses on such architectures, mainly in wormhole (WH) switches and virtual cut-through (VCT) switches. Although there are basic differences between them, that affect the switch architecture, there are commonalities as both rely on the same form of switching. In the next section, a canonical switch architecture including all the commonalities found in current switches is provided. Then, different components are reviewed and alternative switch organizations are described.
Canonical Switch Architecture Figure shows the organization of a basic switch architecture. The switch is made up of a set of identical input ports, connected to input physical channels. Each input port uses a link control module to adapt messages coming through the physical channel to the internal switch. Each message is then stored in a buffer at the input port. The buffer can be made of different queues, each one usually associated to a virtual channel. Each message is then routed (computing the appropriate output port to take) at the Routing control unit. Once the output port is computed, the message may cross the internal crossbar of the switch, thus reaching the output port. To do so, the message has to win the access to the output port since different messages may compete for the same port. In addition, the message has to compete to get access to an available virtual channel. Both are resolved by the arbitration unit. Once the access is granted, the message crosses the crossbar and is written at the corresponding queue at the output port. Each message, again, needs to compete with other messages in order to win the access to the physical channel. Prior to reaching the physical channel the message passes through the link controller logic.
Switch Architecture
S
Switch Architecture. Fig. Basic switch architecture
As an alternative view, the resources found in the switch can be divided into two separate blocks: the data plane (or datapath) and the control plane. Basically, the data plane is the set of resources messages use while being forwarded (storage and movement) through the switch. The control plane is made of the resources used to make decisions at the switch, like the routing control unit and the arbitration unit.
Alternative Switch Architectures Taking this basic switch architecture as the starting point, several alternative architectures can be derived, some of them better suited for a particular switching mechanism. One of the key issues in the switch architecture is the location of the buffering resources. Regarding this, switch architectures can be classified as: Output-Queued (OQ) switch architecture. In this architecture buffers only exist at the output ports and thus no buffering exists at the input port. Thus, whenever a message arrives at the switch it must be sent directly to the appropriate output port. An N × N OQ switch requires only N memories, one per output
port. As packets are mapped to the memory associated with the requested output port, HOL blocking is totally eliminated (see Congestion Management Entry); thus, this organization achieves maximum switch efficiency. However, internal speedup is required to handle the worst-case scenario without dropping messages, allowing all the input ports to transmit packets to the same output port at the same time. In particular, output queues must either implement multiple write ports or use a higher clock frequency. A speedup of N is required in OQ switches in the worst case. Unfortunately, providing such internal speedup is not always viable. Input-Queued (IQ) switch architecture. Buffers only exist at input ports and not at output ports. In this architecture the required internal bandwidth does not increase with the number of ports and the switch can be designed with the same bandwidth as the link. However, a switch designed with this organization may face low performance due to contention/congestion at the output ports. It is well known that such switches achieve only % of maximum efficiency under uniformly distributed requests []. This is mainly
S
S
Switch Architecture
due to the HOL blocking problem. One solution to eliminate the HOL blocking issue in IQ switches is the use of N queues at every input port, mapping the incoming message to a queue associated with the requested output port. This technique is known as Virtual Output Queuing (VOQ) []. However, it increases the queue requirements quadratically with the number of ports. Therefore, as the number of ports increases, this solution becomes too expensive. Combined-Input-Output-Queued (CIOQ) switch architecture. With this organization, the contention/ congestion problem found in IQ switches is alleviated or even eliminated, since messages can be also stored at the output side of the switch. In this architecture some moderate internal speedup is used, as the internal bandwidth is higher than the aggregate link bandwidth. A speedup of two is usually enough to compensate the performance drop produced by HOL blocking. Speedup can be implemented by using internal datapaths with higher transmission frequencies or wider transmission paths. However, as the external link bandwidth increases, sustaining the speedup may become difficult. Buffered-Crossbar (BC) switch architecture. This organization uses a memory at every crossbar crosspoint. An input link is connected to N memories, each one connected to a different output port. By design, the BC organization implements internal speedup, as many inputs can forward a packet to the same output at the same time. Additionally, such memory organization eliminates the HOL blocking (every packet is mapped to the memory associated with the requested output port). As a consequence, the BC organization requires low-cost arbiters per output port. However, the problem with such organization is that the number of memories increases quadratically with the number of ports (N ), thus limiting scalability. When focusing on the switching device (the crossbar) different alternatives exist. The solution used in the basic example is the most frequently implemented one, where a crossbar connecting every input to every possible output is used. The crossbar has inherited bandwidth as it is able to connect an input to every output (broadcast communication). There are, however, other solutions, like using a centralized buffer and using a bus. In addition, the connection of the input ports to the crossbar, and the crossbar to the output ports can be implemented in different ways. In the example
multiplexers and demultiplexers are used at both sides of the crossbar. This is done to reduce the crossbar complexity (that increases quadratically with the number of ports). Different alternative configurations arise when these devices are removed from one or both sides of the crossbar. In the case multiplexers are removed, the input speedup of the switch is increased, since different messages can be forwarded through the crossbar from the same input port. If demultiplexers are removed then the output speedup of the switch is increased, since different messages can be written to the same output port at the same time. Obviously, output buffers are required when implementing output speedup.
Input Buffer Organization An important component of the switch architecture is the buffer organization, mostly when virtual channels are implemented. The way memory resources are assigned to queues will impact on the performance of the switch. This issue is highly related to the flow control protocol (see Flow Control). Buffer partitioning can be designed in several ways. The first one is to combine all the buffers across the entire switch (a single memory). In that case, there is no need for a switching element. The benefit of this approach is the flexibility in dynamically allocating memory across the input ports. The problem, however, is the high memory bandwidth required. The second way to organize buffers is by partitioning them by physical input ports and dynamically assigning them to virtual channels of the same input port. The third way is to partition the buffers by virtual channel, providing a separate memory for each virtual channel at each port. This, however, may lead to high cost and poor memory utilization. Another important aspect of buffer organization is the way flits are stored. Such memories require data structures to keep track of where flits are located, and to manage free slots. Two buffer organizations are mostly used: circular buffers and linked lists. Circular buffers are used when memory is statically assigned to a queue (virtual channel) and have low implementation overheads. However, a linked list is used when memory is assigned dynamically and has, in turn, higher implementation overhead.
Switch Architecture
Pipelined Organization It is common to design a switch as a pipelined datapath. Such designs allow for high clock frequencies of the switch and, thus, high throughput. On every cycle, a different flit (see Flow Control Entry) from the same message may be processed at each stage. Typically, five stages are conceived for the switches: – Input Buffer (IB) stage, where the flit is stored at the input port of the switch. – Routing computation (RC) stage, where the output port is computed for the message. – Switch allocation (SA) stage, where the access to the output port is granted among all the requesting messages. Also, selection of the virtual channel to use is performed. – Switch transversal (ST) stage, where the flit crosses the crossbar and reaches the output. – Output Buffer (OB) stage, where the flit is stored at the output port of the switch. In the ideal case (no contention experienced within the switch) the flit advances through the pipelined switch architecture as shown in Fig. . The figure shows the case for a switch implementing virtual cut-through switching where arbitration is performed once per packet. The first flit is the header flit of the message. At the first cycle the header flit is stored at the input port (IB stage) in a virtual channel. The header includes the identifier of the virtual channel to use. At the second cycle, the header flit is processed at the RC stage so the output port for the message is computed. The output port identifier is associated with the input virtual channel, as this output port will be used by all the flits of the message. At the same cycle the second flit of the message (payload flit) is stored at the input port (IB stage). At
the third cycle the SA stage is executed to get access to the requested output port. A valid virtual channel for the message at the next switch is also requested. On success (there is an available virtual channel and the output port is granted) the header flit crosses the crossbar at the next cycle, followed in the following cycles by the rest of payload flits. All the flits use the same virtual channel and output port. Notice also that resources (access to the output port) are granted per message, so flit multiplexing in the crossbar may not occur. Upon crossing the crossbar, the flits are stored at the output port in the corresponding queue (OB stage). The tail flit of a message will have a different treatment since the connection of the input port to the output port will be broken. Notice that the RC and SA stages are performed only once per packet. A typical wormhole switch architecture differs in some stages. First, it is not common to have a buffer at the output port, thus using an IQ switch approach. Second, virtual channel allocation and port allocation are usually performed in separate stages. Therefore, a new stage appears, referred to as Virtual channel allocation (VA) used only for header flits, and the SA stage only intended for requesting the output port. Also, output port usually is granted flit by flit. Figure shows an example of a five-stage pipelined wormhole switch. The pipeline design may experience stalls. Stalls may happen due to different reasons (a virtual channel is not available in the SA stage, the RC stage does not find a free output port, . . .). In all these cases, the switch must be designed accordingly. Also, the flow control (see Flow Control) must be aware of the stalls and thus backpressure the previous switch to avoid buffer overflows. For a detailed analysis of pipelined switches and stall treatments, the reader is referred to [] (Chapter ).
Packet header Packet header Payload fragment Payload fragment Payload fragment
IB
S
IB
RC VA SA ST
RC SA ST OB Payload fragment IB
IB
IB
ST OB
IB
IB
IB
ST OB
IB
IB
IB
Payload fragment Payload fragment
IB
IB
IB
SA ST
IB
IB
IB
SA ST
IB
IB
IB
SA ST
ST OB
Switch Architecture. Fig. Typical five-stage pipelined switch design for a virtual cut-through switch
Switch Architecture. Fig. Typical five-stage pipelined switch architecture for a wormhole switch
S
S
Switch Architecture
This basic pipelined architecture can be enhanced with different techniques, most of them trying to reduce the number of stages, thus also reducing the delay of traversing a switch. Routers with three or even two stages can be found in the literature. Also, singlecycle switches are common in some environments, like networks-on-chip (NoCs) for low-end systems-on-chip (SoCs). Obviously, such reduction in the number of stages should be achieved by not incurring in an excessive increase of the cycle time. Two basic procedures exist to reduce the number of stages, the first one by using speculation and performing actions in parallel, and the second one by performing computations ahead of time to remove the operation from the critical path. Virtual channel allocation can be performed speculatively and in parallel with switch allocation in wormhole switching. Notice that this is done only for header flits. If both resources are obtained (the virtual channel is successfully assigned and the output port is granted) then the wormhole switch saves one cycle. If any of the two fails (or both fail) then the pipeline stalls and at the next cycle the operation is repeated. Figure shows the case. A further reduction in cycles would be to speculatively send the flit through the crossbar to the output port (ST) at the same time. To achieve this, internal speedup is required. Finally, the output port computation can be performed at the previous switch and stored at the head flit. So, when the flit reaches the next switch the output port to take is already computed and thus, there is no need for the RC stage. In that case, the output port computed for the next switch can be done in parallel with the VA stage, thus further reducing the pipeline depth.
High-Radix Switch Architectures As identified in [], during the last decades, the pin bandwidth of a switch has increased exponentially (from Mb/s of the Torus routing chip in to the Tb/s of the recent Velio ). This is due to the increase in the signaling speed and the increase in the number of signals. In this sense, high-radix switches with narrow channels are able to achieve lower packet latencies than low-radix switches with wide channels. The explanation is simple. With high-radix switches the hop count in the network decreases. Additional benefits from high-radix switches are a lower cost and lower power consumption (as the total number of switches and links to build a network is reduced). Following this trend, there are some proposals for high-radix switch architectures, [–]. However, designing high-radix switches presents major challenges. The most important one is to keep a high switch efficiency with an affordable cost. The cost of a high-radix switch will largely depend on three key components: memory resources, arbiter logic, and internal connection logic. Depending on the location of memories in the switch, different switch organizations (memory and crossbar capabilities and their interconnects) have been used. In some of them, the number of memories increases quadratically with the number of ports. Also, arbiters and crossbars must cope with more candidates and connections, and for that reason become expensive. As an example, in on-chip networks, the use of high-radix switches is not appropriate due to the increase in power consumption and reduction in switch-operating frequency [].
Related Entries Flow Control Packet header Payload fragment Payload fragment Payload fragment
IB
Switching Techniques
RC
VA ST SA
IB
IB
SA
ST
IB
IB
SA
ST
IB
IB
SA
ST
Switch Architecture. Fig. Four-stage pipelined wormhole switch design
Bibliographic Notes and Further Reading Two basic books exist for interconnection networks. Both describe in detail the concept of switch architecture. The first one, [] describes a wide range of routers, some with wormhole switching (Intel Teraflops router,
Switching Techniques
Cray TD and TE routers, Reliable router and SGI spider), others with virtual cut-through switching (Chaos router, Arctic router, R router, Alpha router), and others with circuit switching (Intel iPSC Direct Connect Module). Also, the book describes the Myrinet switch. The other book [], provides as a case study the Alpha router and the IBM Colony router. In on-chip networks building an efficient switch is critical. In addition to delay constraints, designing an on-chip switch has also power consumption and area limitations. In [] basic design rules for building an on-chip router are provided. Also, literature is populated with many switch/router architectures for on-chip networks.
S
Switching Techniques Sudhakar Yalamanchili Georgia Institute of Technology, Atlanta, GA, USA
Definition Switching techniques determine how messages are forwarded through the network. Specifically, these techniques determine how and when buffers and switch ports of individual routers are allocated and released and thereby the timing with which messages or message components can be forwarded to the next router on the destination path.
Discussion Bibliography
Introduction
. Dally W, Towles B () Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco, CA . Duato J, Yalamanchili S, Ni N () Interconnection networks: an engineering approach. Morgan Kaufmann, San Francisco, CA . Karol MJ et al () Input versus output queueing on a spacedivision packet switch. IEEE Trans Commun COM-():– . Tamir Y, Frazier GL () High-performance multi-queue buffers for vlsi communications switches. SIGARCH Comput Archit News ():– . Kim J, Dally WJ, Towles B, Gupta AK () Microarchitecture of a high-radix router. In: nd Annual International Symposium on Computer Architecture (ISCA ’), Madison, WI, pp – . Scott S, Abts D, Kim J, Dally WJ () The blackwidow high-radix clos network. In: Proceedings of the rd Annual International Symposium on Computer Architecture (ISCA), The Washington, DC, June . Mora G, Flich J, Duato J, López P, Baydal E, Lysne O () Towards and efficient switch architecture for high-radix switches. In: Proceedings of ANCS , San Jose, CA . Pullini A, Angiolini F, Murali S, Atienza D, De Micheli G, Benini G () Bringing NOCs to nm. IEEE Micro ():– . de Micheli G, Benini L () Networks on chips: technology and tools. Morgan Kaufmann, San Francisco, CA
This section introduces basic switching techniques used within the routers of multiprocessor interconnection networks. Switching techniques determine when and how messages are forwarded through the network. These techniques determine the granularity and timing with which resources such as buffers and switch ports are requested and released and consequently determine the blocking behavior of routing protocols that utilize them in different network topologies. As a result, they are key determinants of the deadlock properties of routing protocols. Further, their relationship to flow control protocols and traffic characteristics significantly impact the latency and bandwidth characteristics of the network.
A Generic Router Model Switching techniques are understood in the context of routers used in multiprocessor interconnection networks. A simple generic router architecture is illustrated in Fig. . This router microarchitecture presents to messages a four stage pipeline comprised of the following stages. ●
Switched-Medium Network Buses and Crossbars
Input Buffering (IB): Message data is received into the input buffer. ● Route Computation (RC) and Switch Allocation (SA): Based on the message destination, a switch output port is computed, requested, and allocated.
S
S
Switching Techniques
Router operational pipeline Route computation
Input 0
Switch allocator
IB
RC & SA
ST
LT
Output 0
Input buffers
Route computation
Output (p−1)
Input (p−1) Input buffers
Crossbar
Switching Techniques. Fig. A generic router model
●
Switch Traversal (ST): Message data traverses the switch to the output buffer. ● Link Traversal (LT): The message data traverses the link to the next router. The end-to-end latency experienced by a message depends on how the switching techniques interact with this pipeline. This generic router architecture has been enhanced over the years with virtual channels [], speculative operation [, ], flexible arbiters, effective channel and port allocators [], deeper pipelines, and a host of buffer management and implementation optimizations (e.g., see [, ]). The result has been a range of router designs with different processing pipelines. For the purposes of this discussion, here it is assumed that all of the pipeline stages take the same time – one cycle. This is adequate to define, distinguish, and compare properties of basic switching techniques. The following section addresses some basic concepts governing the operation and implementation of switching techniques based on the generic router model shown in Fig. . The remainder of the section is devoted to a detailed presentation of alternative switching techniques.
Basic Concepts The performance and behavior of switching techniques are enabled by the low-level flow control protocols used for the synchronized transfer of data between routers. Flow control determines the granularity with which data is moved through the network and consequently when routing decisions can be made, when switching
operations can be initiated, and how (at what granularity) data is transferred. Flow control is the synchronized transfer of a unit of data between a sender and a receiver and ensures the availability of sufficient buffering at the receiver to avoid the loss of data. Selection of the unit of data transfer is based on a few simple concepts. A message is partitioned into fixed-length packets. Packets are individually routable units of data and are comprised of control bits packaged as the header and data bits packaged as the body. Packets are typically terminated with some bits for error detection such as a checksum. The header contains destination information used by the routers to select the onward path through the network. The packet as a whole is partitioned into fixed size units corresponding to the unit of transfer across a link or across a router and is referred to as a flow control digit or flit []. A flit becomes the unit of buffer management for transmission across a physical channel. Many schemes have been developed over the years for flow control including on-off, credit-based, sliding window, and flit reservation. The physical transfer of a flit across a link may in fact rely on synchronized transfers of smaller units of information. For example, consider flit sizes of bytes and a physical channel width of bits. The transfer of a flit across the physical link requires the synchronized transfer of -bit quantities referred to as a physical digit or phit [] using phit-level flow control. In contrast to flits which represent units of buffer management, phits correspond to quantities reflecting a specific physical link implementation. Both flit-level and phit-level flow control are atomic in the sense that in the absence of
Switching Techniques
errors, all transfers will complete successfully and are not interleaved with other transfers. While phit sizes between chips or boards tend to be small, e.g., – bits, phit sizes on-chip can be much larger, e.g., bits. Typical message sizes can range from – bytes (control messages) to – bytes (for example cache lines), to much larger sizes in message-passing parallel architectures. The preceding hierarchy of units is illustrated in Fig. along with an example of a packet format [, ]. The actual content of a packet header will depend on the specifics of an implementation such as the routing protocol (e.g., destination address), flow control strategy (e.g., credits), use of virtual channels (e.g., virtual channel ID), and fault tolerance strategy (e.g., acknowledgements). Switching techniques determine when messages are forwarded through the network and differ in the relative timing of flow control operations, route computation, and data transfer. High-performance switching techniques seek as much overlap as possible between these operations while reliable communication protocols may seek less concurrency between these operations in the interest of efficient error recovery. For example, one can
S
wait until the entire packet is received before a request is made for the output port of the router’s switch (packet switching). Alternatively, the request can be made as soon as all flits corresponding to the header are received (cut-through switching) but before the rest of the packet has been received. The remainder of this section will focus on the presentation and discussion of switching techniques under the assumption of a flit-level flow control protocol and a supporting phit-level flow control protocol. It is common to have the flit size the same as the phit size.
Basic Switching Techniques As one might expect, switching techniques have their roots in traditional digital communication and have evolved over the years to address the unique requirements of multiprocessor interconnection networks. For the purposes of comparison, the no-load latency is computed for an L-bit message. The phit size and flit size are assumed to be equivalent and equal to the physical data channel width of W bits, which is also the width of the internal datapath of the router. The routing header is assumed to be one flit, thus the message size is L + W Data/message
Packets
flits: flow control digits Head flit Tail flit/checksum
S Type
phits: physical flow control digits Dest Info
Seq #
Misc
Request packet Header phit 0 phit 1
Reply packet Header
Flit 2
phit 0 phit 1 phit 0 phit 1 phit 0 phit 1 phit 0 phit 0 Flit 1
Switching Techniques. Fig. Basic concepts and an example packet format
Flit 3
S
Switching Techniques
bits. A router can make a routing decision in one cycle and a flit can traverse a switch or link in one cycle as described with respect to Fig. .
traverses D routers to the destination and operates at B Hz, the no-load message latency can be represented as follows. tcircuit = tsetup + tdata
Circuit Switching Circuit switching evolved from early implementations in telephone switching networks. In circuit switching, a physical path from the source network interface to the destination network interface is reserved prior to the transmission of the data. The source injects a routing packet commonly called a probe into the network. The probe is routed to the destination as specified by the routing protocol reserving the physical channels and switch ports along the way. An acknowledgement is returned to the source node to confirm path reservation. Figure b shows several circuits that have been set up and one in the process of being set up. While circuits B, C, D have been set up, circuit A is blocked from being set up by circuit B. The base no-load latency of a circuit-switched message is determined by the sum of the time to set up a path and the time to transmit data. For a circuit that Header probe
tdata =
B
L ⌈W ⌉
Data
tIB + tST
tIB + tRC + tST
tsetup
a
tdata
Time busy
Setup failed!
A
()
The notation in the expression follows the generic router model shown in Fig. , and the expression does not include the time to inject the probe into the router at the source (a link traversal) or the time to inject the acknowledgement into the router at the destination. The subscripts correspond to the pipeline stages illustrated in Fig. . For example, tRC is the time taken to compute the output port to be traversed by the message at a router. This computation is carried out in the same cycle as the process of requesting and allocating a switch port, and therefore the switch allocation (SA) time is hidden and does not appear in the expression. The terms tIB , tST , and tLT represent the times for input buffering, switch traversal, and link traversal,
Acknowledgment
tLT Link
tsetup = D[tRC + (tIB + tST + tLT )]
C B
D
b Switching Techniques. Fig. Circuit switching. (a) Time-space utilization across two routers, (b) an example
Switching Techniques
respectively, experienced by the probe and the acknowledgement. The factor of in the expression for path setup reflects the forward progress of the probe and the return progress of the acknowledgement – assuming the acknowledgement traverses the same path in the reverse direction. Note the acknowledgements do not have to be routed and therefore do not experience any routing delays at intermediate routers. A time-space diagram in Fig. a depicts the setup and operation of a circuit that crosses two routers. Nominally, the routing probe contains the destination address and additional control bits used by the routing protocol and is buffered at intermediate routers where it is processed to reserve links and set router switch settings. On reaching the destination, an acknowledgement packet is transmitted back to the source. The hardware circuit has been set up and data can be transmitted at the full speed of the circuit. Flow control for data transfer is exercised end-to-end across the circuit and the flow control bandwidth (rate of signaling) should be at least as fast as the transmission speeds to avoid slowing down the hardware circuit. When transmission is complete, a few control bits traversing the circuit from the source release link and switch resources along the circuit. These bits may take the form of a small packet or with suitable channel design and message encoding, may be transmitted as part of the last few bits of the message data. Routing and data transmission functions are disjoint operations where all switches on the source-destination path are set prior to the transmission of any data. Circuit switching is generally advantageous when messages are infrequent and long compared to the size of the routing probe. It is also advantageous when internode traffic exhibits a high degree of temporal locality. After a path has been set up, subsequent messages to the same destination can be transmitted without incurring the latency of path set-up times. The inter-node throughput is also maximized since once a path has been set up, messages to the destination are not routed and do not block in the network. The disadvantages are the same as those typically associated with most reservation-based protocols. When links and switch ports are reserved for a duration they prevent other traffic from making progress if they need to use any of the resources reserved by the established circuit. In particular, a probe can be blocked during circuit set up while
S
waiting for another circuit to be torn down. The links reserved by the probe up to that point can similarly prevent other circuits from being established. Wire delays place a practical limit on the speed of circuit switching as a function of system size, motivating the development of techniques to mitigate or eliminate end-to-end wire delay dependencies. One set of techniques evolved around pipelining multiple bits on the wire essentially using a long latency wire as a deeply pipelined transmission medium. Such techniques have been referred to as wave pipelining or wave switching [–]. These techniques maximize wire bandwidth while reducing sensitivity to distance. However, when used in anything other than bit serial channels, pragmatic constraints of signal skew and stable operation across voltage and temperature remain challenges to widespread adoption. Alternatively, both the wire length issue and blocking are mitigated by combining circuit switching concepts with the use of virtual channels. Virtual channel buffers at each router are reserved by the routing probe setting up a pipelined virtual circuit from source to destination in a manner referred to as pipelined circuit switching []. This approach increases physical link utilization by enabling sharing across circuits although the available bandwidth to each circuit is now reduced. This approach was also used in the Intel iWarp chip that was designed to support systolic communication through message pathways: longlived communication paths []. Rather than set up and remove network paths each time data are to be communicated, paths through the network persist for long periods of time. Special messages called pathway begin markers are used to reserve virtual channels (referred to as logical channels in iWarp) and set up interprocessor communication paths. On completion of the computation, the paths are explicitly removed by other control messages.
Packet Switching In circuit switching, routing and data transfer operations are separated. All switches in the path to the destination are set a priori, and all data is transmitted in a burst at the full bandwidth of the circuit. In packet switching, the message data is partitioned into fixedsized packets. The first few bytes of a packet contain routing and control information and are collectively
S
S
Switching Techniques
referred to as the packet header while the data is contained in the packet body. Each packet can now be independently routed through the network. Flow control between routers is at the level of a complete packet. A packet cannot be forwarded unless buffer space for a complete packet is available at the next router. A packet is then transferred in its entirety across a link to the input buffer of the next router. Only then is the header information extracted and used by the routing and control unit to determine the candidate output port. The switch can now be set to enable the packet to be transferred to the output port and then stored at the next router before being forwarded. This switching technique is also known as store-and-forward (SAF) switching. There is no overlap between flow control, routing operations, and data transfer. Consequently, the
end-to-end latency of a packet is proportional to the distance between the source and destination nodes. An example of packet switching is illustrated in Fig. b (links to local processors are omitted for brevity). Note how packets A, B, and C are in the process of being transferred across a link. Each input buffer is partially full pending receipt of the remainder of the packet before being forwarded. The no-load latency for a packet transmission across D routers can be modeled as follows. tpacket = D {tRC + (tIB + tST + tLT ) ⌈
L+W ⌉} W
This expression does not include the time to inject the packet into the network at the source. The expression follows the router model in Fig. where entire packets must traverse the link or switch at each router.
Packet header Packet data tLT Link tIB + tRC + tST tpacket
a
Time busy
A
B
()
C
b Switching Techniques. Fig. Packet switching. (a) Time-space utilization across three links, (b) an example
Switching Techniques
Therefore end-to-end latency is proportional to the distance between the source and destination. Packet switching evolved from early implementations in data networks. Relative to circuit switching, packet switching is advantageous when the average message size is small. Since packets only hold resources as they are being used, this switching technique can achieve high link utilization and network throughput. Packets are also amenable to local techniques for error detection and recovery since all data and its associated routing information are encapsulated as a single locally available unit. However, the overhead per data bit is higher – each packet must invest in a header reducing energy efficiency as well as the proportion of physical bandwidth that is accessible to actual data transfer. If a message is partitioned into multiple packets and adaptive routing is employed, packets may arrive at the destination out of order necessitating investments in reordering mechanisms. Packet-switched routers have also been designed with dynamically allocated centralized queues rather than keeping the messages buffered at the router input and output resulting in both cost and power advantages.
Virtual Cut-Through (VCT) Switching Virtual cut-through (VCT) switching is an optimization for packet switching where in the absence of congestion, packet transfer is pipelined. Like the preceding techniques, VCT has its genesis in packet data networks []. Flow control is still at the packet level. However, packet transfer is overlapped with flow control and routing operations as follows. Routing can begin as soon as the header bytes of a packet have arrived at the input buffer and before the rest of the packet has been received. In the absence of congestion, switch allocation and switch traversal can proceed and the forwarding of the packet through the switch as well as flow control requests to the next router can begin. Thus, packet transfer can be pipelined through multiple routers. For example, consider a byte packet with an byte header. After the first bytes have been received, routing decisions and switch allocation can be initiated. If the switch output port is available, then the router can begin forwarding bytes to the output port before the remainder of the packet has arrived and can cut-through to the next router. In the absence of blocking in the network, the
S
latency for the header to arrive at the destination network interface is proportional to the distance between the source and destination. Thereafter a phit can exit the network every cycle. If the header is blocked on a busy output channel at an intermediate router, the packet is buffered at the router – a consequence of the fact that flow control is at the level of a packet. Thus, at high network loads, VCT switching behaves like packet switching. An example of VCT at work is illustrated in Fig. b. Packet A is blocked by packet B. Note that packet A has enough buffer space to be fully buffered at the local router. Packet B can be seen to be spread across multiple routers as it is pipelined through the network. The no-load latency of a message that successfully cuts through D intermediate routers is captured in the following expression. L () tVCT = D(tIB + tRC + tST + tLT ) + tIB ⌈ ⌉ W This expression does not include the time to inject the packet into the network at the source. The first term in the equation is much smaller than the second term (recall that each value of delay is one pipeline cycle). Therefore, under low load conditions, the message latency is approximately proportional to the packet size rather than distance between source and destination. However, when a packet is blocked, the packet is buffered at the router. Thus, at high network loads, the behavior of VCT approximates that of packet switching. At low loads, the performance improves considerably approaching that of wormhole switching which is described next.
Wormhole Switching The need to buffer complete packets within a router can make it difficult to construct small, compact, and fast routers. In the multiprocessor machines of the s, interconnection networks employed local node memory as storage for buffering blocked packets. This ejection and re-injection of packets incurred significant latency penalties. It was desirable to keep packets in the network. However, there was insufficient buffering within individual routers. Wormhole switching evolved as a small buffer optimization of virtual cut-through where packets were pipelined through routers. The buffers in each router had enough storage for several flits. When a packet header blocks, the message occupies buffers in several routers. For example, consider
S
S
Switching Techniques
Packet header Message packet cuts through the router
tLT Link
tblocking
a
tIB tRC tST Time busy
A
B
b Switching Techniques. Fig. Virtual cut-through switching. (a) Time-space utilization across three links, (b) an example
the message pattern shown in Fig. b with routers with one flit buffers. Message A is blocked by message B and occupies buffers across multiple routers leading to secondary blocking across multiple routers. The timespace diagram illustrates how packets are pipelined across multiple routers significantly reducing the sensitivity of message latency to distance. When it was introduced [, ], the pipelined behavior of wormhole switching led to relatively large reductions in message latency at low loads. Further gains derived from not having to eject messages from the network for storage. The use of small buffers also
has two physical consequences. First, smaller buffers lead to lower access latency and shorter pipeline stage time (see Fig. ). The smaller pipeline delay enables higher clock rates and consequently high bandwidth routers, for example, GHz in today’s Intel TeraFlops router []. Second, the smaller buffers also reduce static energy consumption which is particularly important in the context of on-chip routers. However, as the offered communication load increases, messages block in place occupying buffers across multiple routers and the links between them. This leads to secondary blocking of messages that share any of these links which
Switching Techniques
in turn propagates congestion further. The disadvantage of wormhole switching is that blocked messages hold physical channel resources. Routing information is only associated with a few header flits. Data flits have no routing information associated with them. Consequently, when a packet is blocked in place, packet transmission cannot be interleaved over a physical link without additional support (such as virtual channels – see Section on Virtual Channels) and physical channels cannot be shared. The result is the rapid onset of saturation as offered load increases. Virtual channel flow control was introduced to alleviate this problem. The key deadlock issue is that a single message produces dependencies between buffers across multiple routers. Routing protocols must be designed
S
to ensure that such dependencies are not composed across multiple messages to produce deadlocked configurations of messages. Deadlock freedom requirements for deterministic routing protocols are described in [] while proofs for adaptive routing protocols are described in [–]. The base latency of a wormhole-switched message crossing D routers with the flit size equal to the phit size can be computed as follows. twormhole = D(tIB + tRC + tST + tLT ) + tIB ⌈
L ⌉ W
()
This expression does not include the time to inject flits into the network at the source. After the first flit arrives at the destination, each successive flit is delivered
Header flit
Link
Single flit tIB tRC tST twormhole
a
Time busy A
S
b Switching Techniques. Fig. Wormhole switching. (a) Time-space utilization across three links, (b) an example
S
Switching Techniques
in successive clock cycles. Thus, for message sizes that are large relative to the distance between sender and destination, the no-load latency is approximately a function of the message size rather than distance. Several optimizations have been proposed to further improve the performance of wormhole switching. One example is flit reservation flow control was introduced to improve buffer turn-around time in wormhole switching []. Deeply pipelined, high speed routers can lead to low buffer occupancy as a consequence of propagation delays of flits over the link and the latency in receiving and processing credits before the buffer can be reused. Flit reservation is a technique that combines some elements of circuit switching (a priori reservations) to improve performance. A control flit advances ahead of the data flits of a message to reserve buffer and channel resources. Note that router pipeline for data flits is much shorter (they do not experience routing and switch allocation delays). As a result, reservations and transfers can be overlapped and buffer occupancy is significantly increased. Another example that combines the advantages of wormhole switching and packet switching is buffered wormhole switching (BWS). This was proposed and used in IBM’s Power Parallel SP systems [, ]. In the absence of blocking, messages are routed through the network using wormhole switching. When messages block, -flit chunks are constructed at the input port of a switch and buffered in a dynamically allocated centralized router memory freeing up the input port for use by other messages. Subsequently, buffered chunks are transferred to an output port where they are converted to a flit stream for transmission across the physical channel. BWS differs from wormhole switching in that flits are not buffered in place. Rather flits are aggregated and buffered in a local memory within the switch and in this respect BWS is similar to packet switching. The noload latency of a message routed using BWS is identical to that of wormhole-switched messages.
Virtual Channels An important interconnection architecture function is the use of virtual channel flow control []. Each unidirectional virtual channel across a physical link is realized by an independently managed pair of message buffers. Multiple virtual channels are multiplexed across the physical link increasing link utilization
and network throughput. Importantly, routing constraints on the use of virtual channels are commonly used to ensure deadlock freedom. The use of virtual channels decouples message flows from the physical links and their use is orthogonal to the operation of switching techniques. Each switching technique is now employed to regulate the flow of packet data within a virtual channel while constraints on virtual channel usage may govern routing decisions at intermediate routers. The microarchitecture pipeline of the routers now includes an additional stage for virtual channel allocation. Virtual channels have been found to be particularly useful for optimizing the performance of wormhole-switched routers ameliorating the consequences of blocking and thus broadening the scope of application of wormhole switching. A simple example of the operation of virtual channel flow control is illustrated in Fig. . A modified router architecture reflecting the use of virtual channels is shown in Fig. . The route computation operation now returns a set of virtual channels that are candidates for forwarding the message (note that the router might be employing adaptive routing). A typical router pipeline is now extended by an extra stage as shown – virtual channel allocation. This is implemented prior to requesting an output port of the switch extending the router pipeline of Fig. by one stage.
A Comparison of Switching Techniques Switching techniques have fundamental differences in their ability to utilize network bandwidth. In packet switching and VCT switching, messages are partitioned into fixed length packets each with its own header. Consequently, the overhead per transmitted data byte of a message is a fixed function of the message length. Wormhole switching supports variable sized messages and consequently overhead per data byte decreases with message size. However, as the network load increases when wormhole-switched messages block, they hold links and resources across multiple routers, wasting physical bandwidth, propagating congestion, and saturating the network at a fraction of the peak. The use of virtual channels in wormhole switching decouples the physical channel from blocked messages improving network utilization but increases the flow control latency across the physical link as well as the complexity of the channel controllers and intra-router switching.
Switching Techniques
Per VC state
Per VC state Unidirectional physical link
MUX
DEMUX
Link control
MUX
Input buffers Link control
Output buffers DEMUX
S
Switching Techniques. Fig. Virtual channel flow control
Route computation
Input 0
VC0
Router operational pipeline
VC allocator
IB
Switch allocator
RC & VCA
SA
ST
LT
VC1
Output 0
VC (n–1) Input buffers Route computation
Input (p–1)
Output (p–1)
VC0 VC1
Crossbar
VC (n–1) Input buffers
Switching Techniques. Fig. The generic router with virtual channels
In VCT switching and packet switching, packets are fully buffered at each router and therefore traffic consumes network bandwidth in proportion to network load at the expense of increased amount of in-network buffering. The latency behavior of packets using different switching techniques exhibits distinct behaviors. At low loads, the pipelined behavior of wormhole switching produces superior latency characteristics. However, saturation occurs at lower loads and the variance in packet latency is higher and accentuated with variance in packet length. The latency behavior of packets under packet switching tends to be more predictable since messages are fully buffered at each load. VCT packets will operate like wormhole switching at low loads and approximate packet switching at high loads where blocking will force packets to be buffered at an intermediate node. Consequently, approaches to provide Quality of Service guarantees (QoS) typically utilize VCT or packet switching. Attempting to control QoS when
blocked messages are spread across multiple nodes is by comparison much more difficult. Reliability schemes are shaped by the switching techniques. Alternative topologies and routing algorithms affect the probability of encountering a failed component. The switching technique affects feasible detection and recovery algorithms. For example, packet switching is naturally suited to link level error detection and retransmission since each packet is an independently routable unit. For the same reason, packets may be adaptively routed around faulty regions of the network. However, when messages are pipelined over several links, error recovery and control becomes complicated. Recall that data flits have no routing information. Thus, errors that occur within a message that is spread across multiple nodes can lead to buffers and channel resources that are indefinitely occupied (e.g., link transceiver failures) and can lead to deadlocked message configurations. Thus, link level recovery must be accompanied by some higher
S
S
Switching Techniques
level layer recovery protocols that typically operate end-to-end. Finally, it can be observed that the switching techniques exert a considerable influence on the architecture of the router, and as a result, the network performance. For example, flit level flow control enabled pipelined message transfers as well as the use of small buffers. The combination resulted in higher flow control signaling speeds and small compact router pipeline stages that could be clocked at higher speeds. The use of wormhole switching precluded the need for larger (slower) buffers or costly use of local storage at a node – the message could remain the network. This is a critical design point if one considers that the link bandwidth can often exceed the memory bandwidth. Such architectural advances have amplified the performance gained via clock speed advances over several technology generations. For example, consider the difference in performance between the Cosmic cube network [] that operated at MHz and produced message latencies approaching hundreds of microseconds to milliseconds while the most recent TeraFlops chip from Intel operating at GHz produces latencies on the order of nanoseconds. While performance has increased almost orders of magnitude, the clock speeds have only increased by about orders of magnitude. Much of this performance differential can be attributed to switching techniques and associated microarchitecture innovations that accompany their implementation.
Related Entries Collective Communication, Network Support for Congestion Management Flow Control Interconnection Networks Networks, Fault-Tolerant Routing (Including Deadlock Avoidance)
Bibliographic Notes and Further Reading Related topics such as flow control and deadlock freedom are intimately related to switching techniques. A combined coverage of fundamental architectural, theoretical, and system concepts and a distillation of key concepts can also be found in two texts [, ]. More advanced treatments of these and related topics can
be found in papers in most major systems and computer architecture conferences with the preceding texts contributing references to many seminal papers in the field.
Acknowledgments Assistance and feedback from Mitchelle Rasquinha, Dhruv Choudhary, and Jeffrey Young are gratefully acknowledged.
Bibliography . Dally WJ () Virtual-channel flow control. IEEE Trans Parallel Distrib Syst ():– . Peh L-S, Dally WJ () A delay model for router microarchitectures. IEEE Micro :– . Peh L-S, Dally WJ () A delay model and speculative architecture for pipelined routers. In: Proceedings of the th international symposium on high-performance computer architecture, Nuevo Leone . Dally WJ, Towles B () Principles and practices of interconnection networks. Morgan Kaufman, San Francisco . Choi Y, Pinkston TM () Evaluation of queue designs for true fully adaptive routers. J Parallel Distrib Comput ():– . Mullins R, West A, Moore S () Low-latency virtual-channel routers for on-chip networks. In: Proceedings of the st annual international symposium on computer architecture, Munchen . Dally WJ, Seitz CL () Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans Comput C-():– . Duato J, Yalamanchili S, Ni L () Interconnection networks: an engineering Approach. Morgan Kaufmann, San Francisco . Flynn M () Computer architecture: pipelined and parallel processor design. Jones & Bartlett, Boston, pp – . Duato J et al () A high performance router architecture for interconnection networks. In: Proceedings of the international conference on parallel processing, Bloomington, vol I, August , pp – . Scott SL, Goodman JR () The impact of pipelined channels on k-ary n-cube networks. IEEE Trans Parallel Distrib Syst ():– . Gaughan PT et al () Distributed, deadlock-free routing in faulty, pipelined, direct interconnection networks. IEEE Trans Comput ():– . Borkar S et al () iWarp: an integrated solutionto highspeed parallel computing. In: Proceedings of supercomputing ′ , Orlando, November , pp – . Kermani P, Kleinrock L () Virtual cut-through: a new computer communication switching technique. Comp Networks ():– . Dally WJ, Seitz CL () The torus routing chip. J Distrib Comput ():– . Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S () A -GHz mesh interconnect for a teraflops processor. IEEE Micro (): –
Synchronization
. Duato J () A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst (): – . Duato J () A necessary and sufficient condition for deadlockfree adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst ():– . Duato J () A necessary and sufficient condition for deadlockfree routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst ():– . Peh L-S, Dally WJ () Flit reservation flow control. In: Proceedings of the th international symposium on highperformance computer architecture, Toulouse, France, January , pp – . Stunkel CB et al () Architecture and implementation of vulcan. In: Proceedings of the th international parallel processing symposium, Cancun, Mexico, pp – . Stunkel CB et al () The SP high-performance switch. In: Proceedings of the scalable high performance computing conference, Knoxville, pp – . Seitz C () The cosmic cube. Commun ACM ():–
Symmetric Multiprocessors Shared-Memory Multiprocessors
Synchronization Michael L. Scott University of Rochester, Rochester, NY, USA
Synonyms Fences; Multiprocessor synchronization; Mutual exclusion; Process synchronization
Definition Synchronization is the use of language or library mechanisms to constrain the ordering (interleaving) of instructions performed by separate threads, to preclude orderings that lead to incorrect or undesired results.
Discussion In a parallel program, the instructions of any given thread appear to occur in sequential order (at least from that thread’s point of view), but if the threads run independently, their sequences of instructions may interleave arbitrarily, and many of the possible interleavings
S
may produce incorrect results. As a trivial example, consider a global counter incremented by multiple threads. Each thread loads the counter into a register, increments the register, and writes the updated value back to memory. If two threads load the same value before either stores it back, updates may be lost: c == 0 Thread 1: r1 := c
Thread 2: r1:= c
++r1 ++r1 c := r1 c := r1 c == 1 Synchronization serves to preclude invalid thread interleavings. It is commonly divided into the subtasks of atomicity and condition synchronization. Atomicity ensures that a given sequence of instructions, typically performed by a single thread, appears to all other threads as if it had executed indivisibly – not interleaved with anything else. In the example above, one would typically specify that the load-increment-store instruction sequence should execute atomically. Condition synchronization forces a thread to wait, before performing an operation on shared data, until some desired precondition is true. In the example above, one might want to wait until all threads had performed their increments before reading the final count. While it is tempting to suspect that condition synchronization subsumes atomicity (make the precondition be that no other thread is currently executing a conflicting operation), atomicity is in fact considerably harder, because it requires consensus among all competing threads: they must all agree as to which will proceed and which will wait. Put another way, condition synchronization delays a thread until some locally observable condition is seen to be true; atomicity is a property of the system as a whole. Like many aspects of parallel computing, synchronization looks different in shared-memory and message-passing systems. In the latter, synchronization is generally subsumed in the message-passing methods; in a shared-memory system, it typically employs a separate set of methods.
S
S
Synchronization
Shared-memory implementations of synchronization can be categorized as busy-wait (spinning), or scheduler-based. The former actively consume processor cycles until the running thread is able to proceed. The latter deschedule the current thread, allowing the processor to be used by other threads, with the expectation that future activity by one of those threads will make the original thread runnable again. Because it avoids the cost of two context switches, busy-wait synchronization is typically faster than scheduler-based synchronization when the expected wait time is short and when the processor is not needed for other purposes. Schedulerbased synchronization is typically faster when expected wait times are long; it is necessary when the number of threads exceeds the number of processors (else quantum-long delays or even deadlock can occur). In the typical implementation, busy-wait synchronization is built on top of whatever hardware instructions execute atomically. Scheduler-based synchronization, in turn, is built on top of busy-wait synchronization, which is used to protect the scheduler’s own data structures (see entries on Scheduling Algorithms and on Processes, Tasks, and Threads).
Hardware Primitives In the earliest multiprocessors, load and store were the only memory-access instructions guaranteed to be atomic, and busy-wait synchronization was implemented using these. Modern machines provide a variety of atomic read-modify-write (RMW) instructions, which serve to update a memory location atomically. These significantly simplify the implementation of synchronization. Common RMW instructions include: Test-and-set (l) sets the Boolean variable at location l to true, and returns the previous value. Swap (l, v) stores the value v to location l and returns the previous value. Atomic-ϕ (l, v) replaces the value o at location l with ϕ(o, v) for some simple arithmetic function ϕ (add, sub, and, etc.). Fetch-and-ϕ (l, v) is like atomic-ϕ, but also returns the previous value. Compare-and-swap (l, o, n) inspects the value v at location l, and if it is equal to o, replaces it with n.
In either case, it returns the previous value, from which one can deduce whether the replacement occurred. Load-linked (l) and store-conditional (l, v). The first of these returns the value at location l and “remembers” l. The second stores v to l if l has not been modified by any other processor since a previous load-linked by the current processor. These instructions differ in their expressive power. Herlihy has shown [] that compare-and-swap (CAS) and load-linked / store-conditional (LL/SC) are universal primitives, meaning, informally, that they can be used to construct a non-blocking implementation of any other RMW operation. The following code provides a simple implementation of fetch-and-ϕ using CAS. val old := *l; loop val new := phi(old); val found := CAS(l, old, new); if (old == found) break; old := found; If the test on line of this code fails, it must be because some other thread successfully modified *l. The system as a whole has made forward progress, but the current thread must try again. As discussed in the entry on Non-blocking Algorithms, this simple implementation is lock-free but not wait-free. There are stronger (but slower and more complex) non-blocking implementations in which each thread is guaranteed to make forward progress in a bounded number of its own instructions.
NB: In any distributed system, and in most modern
shared memory systems, instructions executed by a given thread are not, in general, guaranteed to be seen in sequential order by other threads, and instructions of any two threads are not, in general, guaranteed to be seen in the same order by all of their peers. Modern processors typically provide so-called fence or barrier instructions (not to be confused with the barriers discussed under Condition Synchronization below) that force previous instructions of the current thread to be seen by other threads before subsequent instructions of the current thread. Implementations of synchronization
Synchronization
methods typically include sufficient fences that if synchronization method s in thread t occurs before synchronization method s in thread t , then all instructions that precede s in t will appear in t to have occurred before any of its own instructions that follow s . For more information, see the entry on Memory Models. The remainder of the discussion here assumes that memory is sequentially consistent, that is, that instructions appear to interleave in some global total order that is consistent with program order in every thread.
Atomicity A multi-instruction operation is said to be atomic if appears to occur “all at once” from every other thread’s point of view. In a sequentially consistent system, this means that the program behaves as if the instructions of the atomic operation were contiguous in the global instruction interleaving. More specifically, in any system, intermediate states of the atomic operation should never be visible to other threads, and actions of other threads should never become visible to a given thread in the middle of one of its own atomic operations. The most straightforward way to implement atomicity is with a mutual-exclusion (mutex) lock – an abstract object that can be held by at most one thread at a time. In standard usage, a thread invokes the acquire method of the lock when it wishes to begin an atomic operation and the release method when it is done. Acquire waits (by spinning or rescheduling) until it is safe for the operation to proceed. The code between the acquire and release (the body of the atomic operation) is known as a critical section. Critical sections that conflict with one another (typically, that access some common location, with at least one section writing that location) must be protected by the same lock. Programming discipline commonly ensures this property by associating data with locks. A thread must then acquire locks for all the data accessed in a critical section. It may do so all at once, at the beginning of the critical section, or it may do so incrementally, as the need for data is encountered. Considerable care may be required to ensure that locks are acquired in the same order by all critical sections, to avoid deadlock. All locks are typically held until the end of the critical section. This two-phase locking (all acquires occur
S
before any releases) ensures that the global set of critical section executions remains serializable.
Relaxations of Mutual Exclusion So-called reader–writer locks increase concurrency by observing that it is safe for more than one thread to read a location concurrently, so long as no thread is modifying that location. Each critical section is classified as either a reader or a writer of the data associated with a given lock. The reader_acquire method waits until there is no concurrent writer of the lock; the writer_acquire method waits until there is no concurrent reader or writer. In a standard reader–writer lock, a thread must know, when it first reads a location, whether it will ever need to write that location in the current critical section. In some contexts it may be possible to relax this restriction. The Linux kernel, for example, provides a sequence lock mechanism that allows a reader to abort its peers and upgrade to writer status. Programmers are required to follow a restrictive programming discipline that makes critical sections “restartable,” and checks, before any write or “dangerous” read, to see whether a peer’s upgrade has necessitated a restart. For data structures that are almost always read, and very occasionally written, several operating system kernels provide some variant of a mechanism known as RCU (originally an abbreviation for read-copy update). RCU divides execution into so-called epochs. A writer creates a new copy of any data structure it needs to update. It replaces the old copy with the new, typically using a single CAS instruction. It then waits until the end of the current epoch to be sure that all readers that might have been using the old copy have completed their critical sections (at which point it can reclaim the old copy, or perform other actions that depend on the visibility of the update). The advantage of RCU, in comparison to locks, is that it imposes zero overhead in the read-only case. For more general-purpose use, transactional memory (TM) allows arbitrary operations to be executed atomically, with an underlying implementation based on speculation and rollback. Originally proposed [] as a hardware assist for lock-free data structures – sort of a multi-word generalization of LL/SC – TM has seen a flurry of activity in recent years, and several hardware
S
S
Synchronization
and software implementations are now widely available. Each keeps track of the memory locations accessed by transactions (would-be atomic operations). When two concurrent transactions are seen to conflict, at most one is allowed to commit; the others abort, “roll back,” and try again, using a fully automated, transparent analogue of the programming discipline required by sequence locks. For further details, see the separate entry on TM.
Fairness Because they sometimes force multiple threads to wait, synchronization mechanisms inevitably raise issues of fairness. When a lock is released by the current holder, which waiting thread should be allowed to acquire it? In a system with reader–writer locks, should a thread be allowed to join a group of already-active readers when writers are already waiting? When transactions conflict in a TM system, which should be permitted to proceed, and which should wait or abort? Many answers are possible. The choice among conflicting threads may be arbitrary, random, first-comefirst-served (FIFO), or based on some other notion of priority. From the point of view of an individual thread, the resulting behavior may range from potential starvation (no progress guarantees) to some sort of proportional share of system run time. Between these extremes, a thread may be guaranteed to run eventually if it is continuously ready, or if it is ready infinitely often. Even given the possibility of starvation, the system as a whole may be livelock-free (guaranteed to make forward progress) as a result of algorithmic guarantees or pseudo-random heuristics. (Actual livelock is generally considered unacceptable.) Any starvation-free system is clearly livelock free.
Simple Busy-Wait Locks Several early locking algorithms were based on only loads and stores, but these are mainly of historical interest today. All required Ω(tn) space for t threads and n locks, and ω() (more-than-constant) time to arbitrate among threads competing for a given lock. In modern usage, the simplest constant-space, busywait mutual exclusion lock is the test-and-set (TAS)
lock, in which a thread acquires the lock by using a test-and-set instruction to change a Boolean flag from false to true. Unfortunately, spinning by waiting threads tends to induce extreme contention for the lock location, tying up bus and memory resources needed for productive work. On a cache-coherent machine, better performance can be achieved with a “test-and-test-andset” (TATAS) lock, which reduces contention by using ordinary load instructions to spin on a value in the local cache so long as the lock remains held: type lock = Boolean; proc acquire(lock *l): while (test-and-set(l)) while (*l) /* spin */ ; proc release(lock *l): *l := false; This lock works well on small machines (up to, say, four processors). Which waiting thread acquires a TATAS lock at release time depends on vagaries of the hardware, and is essentially arbitrary. Strict FIFO ordering can be achieved with a ticket lock, which uses fetch-andincrement (FAI) and a pair of counters for constant space and (per-thread) time. To acquire the lock, a thread atomically performs an FAI on the “next available” counter and waits for the “now serving” counter to equal the value returned. To release the lock, a thread increments its own ticket, and stores the result to the “now serving” counter. While arguably fairer than a TATAS lock, the ticket lock is more prone to performance anomalies on a multiprogrammed system: if any waiting thread is preempted, all threads behind it in line will be delayed until it is scheduled back in.
Scalable Busy-Wait Locks On a machine with more than a handful of processors, TATAS and ticket locks scale poorly, with time per critical section growing linearly with the number of waiting threads. Anderson [] showed that exponential backoff (reminiscent of the Ethernet contention-control algorithm) could substantially improve the performance of TATAS locks. Mellor-Crummey and Scott [] showed similar results for linear backoff in ticket locks (where
Synchronization
a thread can easily deduce its distance from the head of the line). To eliminate contention entirely, waiting threads can be linked into an explicit queue, with each thread spinning on a separate location that will be modified when the thread ahead of it in line completes its critical section. Mellor-Crummey and Scott showed how to implement such queues in total space O(t + n) for t threads and n locks; their MCS lock is widely used in largescale systems. Craig [] and, independently, Landin and Hagersten [] developed an alternative CLH lock that links the queue in the opposite direction and performs slightly faster on some cache-coherent machines. Auslander et al. developed a variant of the MCS lock that is API-compatible with traditional TATAS locks []. Kontothanassis et al. [] and He et al. [] developed variants of the MCS and CLH locks that avoid performance anomalies due to preemption of threads waiting in line.
Scheduler-Based Locks A busy-wait lock wastes processor resources when expected wait times are long. It may also cause performance anomalies or deadlock in a multiprogrammed system. The simplest solution is to yield the processor in the body of the spin loop, effectively moving the current thread to the end of the scheduler’s ready list and allowing other threads to run. More commonly, scheduler-based locks are designed to deschedule the waiting thread, moving it (atomically) from the ready list to a separate queue associated with the lock. The release method then moves one waiting thread from the lock queue to the ready list. To minimize overhead when waiting times are short, implementations of scheduler-based synchronization commonly spin for a small, bounded amount of time before invoking the scheduler and yielding the processor. This strategy is often known as spin-then-wait.
Condition Synchronization It is tempting to assume that busy-wait condition synchronization can be implemented trivially with a Boolean flag: a waiting thread spins until the flag is true; a thread that satisfies the condition sets the flag to true. On most modern machines, however, additional fence
S
instructions are required both in the satisfying thread, to ensure that its prior writes are visible to other threads, and in the waiting thread, to ensure that its subsequent reads do not occur until after the spin completes. And even on a sequentially consistent machine, special steps are required to ensure that the compiler does not violate the programmer’s expectations by reordering instructions within threads. In some programming languages and systems, a variable may be made suitable for condition synchronization by labeling it volatile (or, in C++’X, atomic<>). The compiler will insert appropriate fences at reads and writes of volatile variables, and will refrain from reordering them with respect to other instructions. Some other systems provide special event objects, with methods to set and await them. Semaphores and monitors, described in the following two subsections, can be used for both mutual exclusion and condition synchronization. In systems with dynamically varying concurrency, the fork and join methods used to create threads and to verify their completion can be considered a form of condition synchronization. (These are, in fact, the principal form of synchronization in systems like Cilk and OpenMP.)
Barriers One form of condition synchronization is particularly common in data-parallel applications, where threads iterate together through a potentially large number of algorithmic phases. A synchronization barrier, used to separate phases, guarantees that no thread continues to phase n + until all threads have finished phase n. In most (though not all) implementations, the barrier provides a single method, composed internally of an arrival phase that counts the number of threads that have reached the barrier (typically via a log-depth tree) and a departure phase in which permission to continue is broadcast back to all threads. In a so-called fuzzy barrier [], these arrival and departure phases may be separate methods. In between, a thread may perform any instructions that neither depend on the arrival of other threads nor are required by other threads prior to their departure. Such instructions can serve to “smooth out” phase-by-phase imbalances in the work assigned
S
S
Synchronization
to different threads, thereby reducing overall wait time. Wait time may also be reduced by an adaptive barrier [, ], which completes the arrival phase in constant time after the arrival of the final thread. Unfortunately, when t threads arrive more or less simultaneously, no barrier implementation using ordinary loads, stores, and RMW instructions can complete the arrival phase in less than Ω(log t) time. Given the importance of barriers in scientific applications, some supercomputers have provided special near-constanttime hardware barriers. In some cases the same hardware has supported a fast eureka method, in which one thread can announce an event to all others in constant time.
Semaphores First proposed by Dijkstra in [] and still widely used today, semaphores support both mutual exclusion and condition synchronization. A general semaphore is a nonnegative counter with an initial value and two methods, known as V and P. The V method increases the value of the semaphore by one. The P method waits for the value to be positive and then decreases it by one. A binary semaphore has values restricted to zero and one (it is customarily initialized to one), and serves as a mutual exclusion lock. The P method acquires the lock; the V method releases the lock. Programming discipline is required to ensure that P and V methods occur in matching pairs. The typical implementation of semaphores pairs the counter with a queue of waiting threads. The V method checks to see whether the counter is currently zero. If so, it checks to see whether any threads are waiting in the queue and, if there are, moves one of them to the ready list. If the counter is already positive (in which case the queue is guaranteed to be empty) or if the counter is zero but the queue is empty, V simply increments the counter. The P method also checks to see whether the counter is zero. If so, it places the current thread on the queue and calls the scheduler to yield the processor. Otherwise it decrements the counter. General semaphores can be used to represent resources of which there is a limited number, but more than one. Examples include I/O devices, communication channels, or free or full slots in a fixed-length buffer. Most operating systems provide semaphores as part of the kernel API.
Monitors While semaphores remain the most widely used scheduler-based shared-memory synchronization mechanism, they suffer from several limitations. In particular, the association between a binary semaphore (mutex lock) and the data it protects is solely a matter of convention, as is the paired usage of P and V methods. Early experience with semaphores, combined with the development of language-level abstraction mechanisms in the s, led several developers to suggest building higher-level synchronization abstractions into programming languages. These efforts culminated in the definition of monitors [], variants of which appear in many languages and systems. A monitor is a data abstraction (a module or class) with an implicit mutex lock and an optional set of condition variables. Each entry (method) of the monitor automatically acquires and releases the mutex lock; entry invocations thus exclude one another in time. Programmers typically devise, for each monitor, a program-specific invariant that captures the mutual consistency of the monitor’s state (data members – fields). The invariant is assumed to be true at the beginning of each entry invocation, and must be true again at the end. Condition variables support a pair of methods superficially analogous to P and V; in Hoare’s original formulation, these were known as wait and signal. Unlike P and V, these methods are memory-less: a signal invocation is a no-op if no thread is currently waiting. For each reason that a thread might need to wait within a monitor, the programmer declares a separate condition variable. When it waits on a condition, the thread releases exclusion on the monitor. The programmer must thus ensure that the invariant is true immediately prior to every wait invocation.
Semantic Details The details of monitors vary significantly from one language to another. The most significant issues, discussed in the paragraphs below, are commonly known as the nested monitor problem and the modeling of signals as hints vs. absolutes. More minor issues include language syntax, alternative names for signal and wait, the modeling of condition variables in the type system, and the prioritization of threads waiting for conditions or for access to the mutex lock.
Synchronization
The nested monitor problem arises when an entry of one monitor invokes an entry of another monitor, and the second entry waits on a condition variable. Should the wait method release exclusion on the outer monitor? If it does, there is no guarantee that the outer monitor will be available again when execution is ready to resume in the inner call. If it does not, the programmer must take care to ensure that the thread that will perform the matching signal invocation does not need to go through the outer monitor in order to reach the inner one. A variety of solutions to this problem have been proposed; the most common is to leave the outer monitor locked. Signal methods in Hoare’s original formulation were defined to transfer monitor exclusion directly from the signaler to the waiter, with no intervening execution. The purpose of this convention was to guarantee that the condition represented by the signal was still true when the waiter resumed. Unfortunately, the convention often has the side effect of inducing extra context switches, and requires that the monitor invariant be true immediately prior to every signal invocation. Most modern monitor variants follow the lead of Mesa [] in declaring that a signal is merely a hint, and that a waiting process must double-check the condition before continuing execution. In effect, code that would be written if (!condition) cond_var.wait(); in a Hoare monitor is written while (!condition) cond_var.wait(); in a Mesa monitor. To make it easier to write programs in which a condition variable “covers” a set of possible conditions (particularly when signals are hints), many monitor variants provide a signal-all or broadcast method that awakens all threads waiting on a condition, rather than only one.
Message Passing In a system in which threads interact by exchanging messages, rather than by sharing variables, synchronization is generally implicit in the send and receive methods. A receive method typically blocks until an appropriate message is available (a matching send has
S
been performed). Blocking semantics for send methods vary from one system to another: Asynchronous send – In some systems, a sender continues execution immediately after invoking a send method, and the underlying system takes responsibility for delivering the message. While often desirable, this behavior complicates the delivery of failure notifications, and may be limited by finite buffering capacity. Synchronous send – In other systems – notably those based on Hoare’s Communicating Sequential Processes (CSP) [] – a sender waits until its message has been received. Remote-invocation send – In yet other systems, a send method has both ingoing and outcoming parameters; the sender waits until a reply is received from its peer.
Distributed Locking Libraries, languages, and applications commonly implement higher-level distributed locks or transactions on top of message passing. The most common lock implementation is analogous to the MCS lock: acquired requests are sent to a lock manager thread. If the lock is available, the manager responds directly; otherwise it forwards the request to the last thread currently waiting in line. The release method sends a message to the manager or, if a forwarding request has already been received, to the next thread in line for the lock. Races in which the manager forwards a request at the same time the last lock holder sends it a release are trivially resolved by statically choosing one of the two (perhaps the lock holder) to inform the next thread in line. Distributed transaction systems are substantially more complex. Rendezvous and Remote Procedure Call In some systems, a message must be received explicitly by an already existing thread. In other systems, a thread is created by the underlying system to handle each arriving message. Either of these options – explicit or implicit receipt – can be paired with any of the three send options described above. The combination of remote-invocation send with implicit receipt is often called remote procedure call (RPC). The combination of remote-invocation send with explicit receipt is known
S
S
System Integration
as rendezvous. Interestingly, if all shared data is encapsulated in monitors, one can model – or implement – each monitor with a manager thread that executes entry calls one at a time. Each such call then constitutes a rendezvous between the sender and the monitor.
Related Entries Actors Cache Coherence Concurrent Collections Programming Model Deadlocks Memory Models Monitors, Axiomatic Verification of Non-Blocking Algorithms Path Expressions Processes, Tasks, and Threads Race Conditions Scheduling Algorithms Shared-Memory Multiprocessors Transactions, Nested
Bibliographic Notes The study of synchronization began in earnest with Dijkstra’s “Cooperating Sequential Processes” monograph of []. Andrews and Schneider provide an excellent survey of synchronization mechanisms circa []. Mellor-Crummey and Scott describe and compare a variety of busy-wait spin locks and barriers, and introduce the MCS lock []. More extensive coverage of synchronization can be found in Chapter of Scott’s programming languages text [], or in the recent texts of Herlihy and Shavit [] and Taubenfeld [].
Bibliography . Anderson TE (Jan ) The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans Parallel Distr Sys ():– . Andrews GR, Schneider FB (Mar ) Concepts and notations for concurrent programming. ACM Comput Surv ():– . Auslander MA, Edelsohn DJ, Krieger OY, Rosenburg BS, Wisniewski RW () Enhancement to the MCS lock for increased functionality and improved programmability. U.S. patent application , submitted Oct . Craig TS (Feb ) Building FIFO and priority-queueing spin locks from atomic swap. Technical Report --, University of Washington Computer Science Department
. Dijkstra EW (Sept ) Cooperating sequential processes. Technical report, Technological University, Eindhoven, The Netherlands. Reprinted in Genuys F (ed) Programming Languages, Academic Press, New York, , pp –. Also available at www.cs.utexas.edu/users/EWD/transcriptions/EWDxx/ EWD.html. . Gupta R (Apr ) The fuzzy barrier: a mechanism for high speed synchronization of processors. Proceedings of the rd International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, pp – . Gupta R, Hill CR (June ) A scalable implementation of barrier synchronization using an adaptive combining tree. Int J Parallel Progr ():– . He B, Scherer III WN, Scott ML (Dec ) Preemption adaptivity in time-published queuebased spin locks. Proceeding of the International Conference on High Performance Computing, Goa, India . Herlihy MP (Jan ) Wait-free synchronization. ACM Trans Progr Lang Syst ():– . Herlihy MP, Moss JEB () Transactional memory: architectural support for lock-free data structures. Proceedings of the th International Symposium on Computer Architecture, San Diego, CA, May pp – . Herlihy MP, Shavit N () The Art of Multiprocessor Programming. Morgan Kaufmann, Burlington, MA . Hoare CAR (Oct ) Monitors: an operating system structuring concept. Commun ACM ():– . Hoare CAR (Aug ) Communicating sequential processes. Commun ACM ():– . Kontothanassis LI, Wisniewski R, Scott ML (Feb ) Schedulerconscious synchronization. ACM Trans Comput Sys ():– . Lampson BW, Redell DD (Feb ) Experience with processes and monitors in Mesa. Commun ACM ():– . Magnussen P, Landin A, Hagersten E (Apr ) Queue locks on cache coherent multiprocessors. Proceedings of the th International Parallel Processing Symposium, Cancun, Mexico, pp – . Mellor-Crummey JM, Scott ML (Feb ) Algorithms for scalable synchronization on sharedmemory multiprocessors. ACM Trans Comput Syst ():– . Scott ML () Programming Language Pragmatics, rd edn. Morgan Kaufmann, Burlington, MA . Scott ML, Mellor-Crummey JM (Aug ) Fast, contention-free combining tree barriers. Int J Parallel Progr ():– . Taubenfeld G () Synchronization Algorithms and Concurrent Programming. Prentice Hall, Upper Saddle River
System Integration Terrestrial Ecosystem Carbon Modeling
Systems Biology, Network Inference in
System on Chip (SoC) SoC (System on Chip) VLSI Computation
Systems Biology, Network Inference in Jaroslaw Zola , Srinivas Aluru, Iowa State University, Ames, IA, USA Indian Institute of Technology Bombay, Mumbai, India
Synonyms Gene networks reconstruction; Gene networks reverseengineering
Definition Inference of gene regulatory networks, also called reverse-engineering of gene regulatory networks, is a process of characterizing, either qualitatively or quantitatively, regulatory mechanisms in a cell or an organism from observed expression data.
Discussion Introduction Biological processes in every living organism are governed by complex interactions between thousands of genes, gene products, and other molecules. Genes that are encoded in the DNA are transcribed and translated to form multiple copies of gene products including proteins and various types of RNAs. These gene products coordinate to execute cellular processes – sometimes by forming supramolecular complexes (e.g., ribosome), or by acting in a concerted fashion, e.g., in biochemical or metabolic pathways. They also regulate the expression of genes, often through binding to cis-regulatory sequences upstream of the coding region of the genes, to calibrate gene expression depending on the endogenous and exogenous stimuli carried by, e.g., small molecules. Gene regulatory networks are conceptual representations of interactions between genes in a cell or an organism. They are depicted as graphs with vertices corresponding to genes and edges representing regulatory interactions between genes (see Fig. ). Overall,
S
gene regulatory networks are mathematical models to explain the observed gene expression levels. Network inference, or reconstructing, is the process of identifying the underlying network from multiple observations of gene expressions (outputs of the network). To infer a gene network, one relies on experimental data from high-throughput technologies such as microarrays, quantitative polymerase chain reaction, or short-read sequencing, which measure a snapshot of all gene expression levels under a particular condition or in a time series.
Information Theoretic Approaches Consider a set of n genes {g , g , . . . , gn }, where for each gene a set of m expression measurements is given. One can represent expression of gene i (gi ) as a random variable Xi ∈ X , X = {X , . . . , Xn }, with marginal probability pXi derived from some unknown joint probability characterizing the entire system. This random variable is described by observations {xi, , . . . , xi,m }, where xi,j corresponds to the expression level of gi under condition j. The vector ⟨xi, , xi, , . . . , xi,m ⟩ is called profile of gi . Given a profile matrix Yn×m , Y[i, j] = xi,j , one can formulate network inference problem as that of finding a model that best explains the data in Y. Such formulated problem can be approached using a variety of methods, including Bayesian networks [] and Gaussian graphical models []; one class of methods that has been widely adopted uses the concept of mutual information. These methods [] operate under the assumption that correlation of expression implies coregulation, and proceed in two main phases: First significant dependencies (connection between two genes) or independencies (lack of connections) are determined by means of computing mutual information for every pair of genes. Then, identification and removal of indirect interactions (e.g., when two genes are coregulated by a third) is performed. Mutual information is arguably the best measure of correlation between two random variables, and is defined based on entropy H in the following way: I(Xi ; Xj ) = H(Xi ) + H(Xj ) − H(Xi , Xj ), where entropy H is given by: H(X) = − ∑ pX (x) log pX (x),
S
S
Systems Biology, Network Inference in
BLR
AGL9
WUS
AT2G27330 GA4 ASN1
SUS4 AT4G14090
AT2G44450
AG
CRC
AtMYB17 AGL5
AtbZIP AT5G49770 CAL
AT5G03230
LFY
NAP PI
AT5G60630
AT3G19390 LMI1 AT1G61830
AP3 AT3G52470 ACR7
AtTLP8
FD
AP1
SUP
AGL24
Systems Biology, Network Inference in. Fig. Example gene regulatory network. Nodes represent genes, “T”-edges denote regulation in which a source gene represses expression of the target gene. Arrow-edges denote regulation in which a source gene induces expression of the target gene
and pX defines the probability distribution of X, and ∑ is replaced by integral if X is continuous. Mutual information is a symmetric, nonnegative function, and is equal to zero if and only if two random variables are independent. Application of mutual information for gene network inference poses two significant challenges. As the probability distribution of the random variable describing a gene is unknown, it has to be estimated from the expression profile. Consequently, gene comparison becomes more difficult because even independent expression profiles can result in mutual information greater than zero (owing to sampling and estimation errors). This in turn requires some mechanism to decide if the given mutual information estimate is statistically significant. The second challenge is due to the fact that a typical
genome-level network covers thousands of genes, and hence “all-pairs” comparison adds considerably to computational requirements. In practice, several mutual information estimators are available that offer different precision to complexity ratios (e.g., Gaussian kernel estimator, B-spline estimator), and complex statistical techniques are employed to decide if observed mutual information implies dependency. Although all information theoretic approaches for reverse-engineering depend on “all-pairs” mutual information kernel executed in the first stage, they differ in how they identify indirect interactions in the second stage. For example, in relevance networks [] the second stage is omitted, in ARACNe [] and TINGe [, ] the Data Processing Inequality concept is used, while CLR algorithm [] depends on estimates of a likelihood
Systems Biology, Network Inference in
of obtained mutual information values. Some other methods extend mutual information into conditional mutual information or augment it with feature selection techniques.
Parallel Information Theoretic Approach Reverse engineering of regulatory networks using mutual information is compute and memory intensive especially if whole-genome (i.e., covering all genes of an organism) networks are considered. Memory consumption arises from the Θ(nm) size of input data, and from the Θ(n ) dense initial network generated in the first phase of the reconstruction algorithm. Adding to this is complexity of mutual information estimators, which for Gaussian kernel estimator for instance is O(m ). Taking into account that the number of genes typically considered is in the thousands, and at the same time genome-level inference requires that the number of observations m is large, the problem becomes prohibitive for sequential computers. In [], Zola et al. proposed a parallel information theory-based inference method that efficiently exploits multiple levels of parallelism inherent to mutual information computations and uses a generalized scheme for pairwise computation scheduling. The method has been implemented in the MPI-based software package called TINGe (Tool Inferring Networks of Genes), along with a version that supports the use of cell accelerators. The algorithm proceeds in three stages. In the first stage, input expression profiles are rank-transformed and mutual information is computed for each of the (n ) pairs of genes, and q randomly chosen permutations per pair. Rank transformation substitutes a gene expression profile with a permutation of ⟨, . . . , m⟩ by replacing a gene expression with its rank among all gene expressions within the same expression profile. It has been shown that mutual information is invariant under this transformation []. By applying it the algorithm can reduce the total number of mutual information estimations between gene expression vectors and their random permutations []. In the second phase, the threshold value above which mutual information is considered to signify dependence is computed, and edges below this threshold are discarded. The threshold is computed by finding the element with rank ( − ε) ⋅ q ⋅ (n ) among q ⋅ (n ) values contributed by permutations generated
S
in the earlier stage, where ε is specifies the desired statistical significance of the corresponding permutation test. Finally, in the third stage data processing inequality is applied with the consequence that if gi interacts with gk via some other gene gj then I(Xi ; Xk ) ≤ min(I(Xi ; Xj ), I(Xj ; Xk )). The algorithm represents gene network using the standard adjacency matrix Dn×n . The input and output data are distributed row-wise among p processors. Each processor stores up to ⌈ np ⌉ consecutive rows of matrix Y and the same number of consecutive rows from matrix D. Matrix D is then partitioned into p × p+ p blocks of submatrices which are computed in ⌈ ⌉ iterations, where in iteration i processor with rank j computes submatrix Dj,(j+i) mod p . To implement the second stage a simple reduction operation is used to find the threshold value followed by pruning of matrix D. Finally, removing of indirect interactions is performed based on streaming of matrix D in p − communication rounds, where in iteration i only processors with ranks lower than p−i participate in communication and computation. In their method, Zola et al. use B-spline mutual information estimator which is implemented to take advantage of SIMD extensions of modern processors. However, any mutual information estimator could be used. Furthermore, they report modification of the first stage of the algorithm that enables execution on cell heterogeneous processors. The method has been used to reconstruct a , network of the model plant Arabidopsis thaliana from , microarray experiments in minutes on a , core IBM Blue Gene/L, and in h and min on a -node QS cell blade cluster.
Approaches Based on Bayesian Networks Bayesian networks are a class of graphical models that represent probabilistic relationships among random variables of a given domain. Formally, a Bayesian network is a pair (N, P), where P is the joint probability distribution and N is a directed acyclic graph, with vertices representing random variables and edges corresponding to “parent – child” relationship between variables, which encodes the Markov assumption that a node is conditionally independent from its non-descendants, given its parents in N. Under this assumption one can
S
S
Systems Biology, Network Inference in
represent the joint probability as a product of conditional probabilities: P(X , . . . , Xn ) = ∏ P(Xi ∣πi ), i
where π i is a set of parents of Xi in N. Given a set of realizations (observations) of random variables one can learn a structure of the Bayesian network that best fits the observed data. Bayesian networks have been widely employed for reverse-engineering of gene regulatory networks [, , ]. Following the same formalization as for information theoretic approaches described above, the problem of gene network inference becomes that of learning the structure of the corresponding Bayesian network. Nodes of the network are random variables assigned to genes X = {X , . . . , Xn }, expression profiles are realizations of those variables, and a Bayesian network learned from such data represents a gene regulatory network, where πi is interpreted as a set of regulators of gene gi . In order to learn the structure of a Bayesian network, a statistically motivated scoring function that evaluates the posterior probability of a network given the input data is typically assumed: Score(N) = log P(N∣Y). To find the optimal network efficiently such a function should be decomposable into individual score contributions s(Xi , π i ), i.e.: Score(N) = ∑ s(Xi , π i ). i
A major difficulty in Bayesian network structure learning is the super exponential search space in the number of random variables – for a set of n variables n n! (n−) possible directed acyclic graphs, there exist r ⋅ zn where r ≈ . and z ≈ ..
Parallel Exact Structure Learning Even assuming that the cost of evaluating scoring function is negligible, exhaustive enumeration of all possible network structures remains prohibitive. Although heuristics have been proposed to tackle the problem, e.g., based on simulated annealing, oftentimes reconstructing the optimal network is advantageous as it enables more meaningful conclusions. Nikolova et al. [] proposed an elegant parallel exact algorithm for learning Bayesian networks that builds on top of earlier sequential methods []. In this
approach a network is represented as a permutation of nodes where each node is preceded by its parents. The algorithm identifies the optimal ordering, and a corresponding optimal network, using a dynamic programming approach that operates on the lattice formed on the power set of X by the partial order “set inclusion.” The lattice is organized into n + levels, where level l ∈ [, n] contains all subsets of size l, and a node at level l has l incoming and n − l outgoing edges. At each node of the lattice (n − l) evaluations of individual scores s have to be performed as a part of dynamic programming search, which next have to be communicated along outgoing edges. The key component of the approach by Nikolova et al. is the observation that dynamic programming lattice forms an n-dimensional hypercube that can be decomposed on p = k processors into n−k kdimensional hypercubes, each mapping to p processors. These hypercubes can be processed in a pipelined fashion that provides a work optimal algorithm. The reported method has been used to reverseengineer regulatory networks using synthetic data with up to genes and microarray observations, and applying Minimum Description Length principle [] as a scoring function. To reconstruct a network for the largest data it took h and min on , processors of an IBM BlueGene/L.
Approaches Based on Differential Equations Under several simplifying assumptions the dynamics of a gene’s expression can be modeled as a function of abundance of all other genes and the rate of degradation: x˙i = fi (x) − λ(xi ), where fi is called input function of gene i, xi is expression of gene i, vector x represents expression levels of all genes, and λ describes the rate of degradation. One can further assume that input functions are linear and in such cases the dynamics of the entire system can be represented as: x˙ = A ⋅ x, where An×n is a matrix describing influences of genes on each other (including rates of degradation), i.e., A[i, j] represents the influence of gene gj on gi . Consequently
Systems Biology, Network Inference in
by solving such system of equations one would obtain the underlying gene network represented by matrix A. Unfortunately finding the solution requires a large number of measurements of x and x˙ since otherwise the system is greatly underdetermined. At the same time obtaining representative measurements is experimentally very challenging and sometimes infeasible. To overcome this limitation, and to enable linearization of systems with nonlinear gene input functions, Gardner et al. [] proposed an approach in which m perturbation experiments (i.e., experiments in which expression of selected genes is affected in a controlled way) are performed, and resulting expression profiles are used to write the following system of differential equations: Y˙ = AY + U, which at steady state gives: AY = −U. Here Yn×m is a matrix describing expression of all genes in all experiments, i.e., Y[i, j] describes expression of gene i under perturbation j, and matrix Un×m represents the effect of perturbations on every gene in all m experiments. Because in the majority of cases m < n and the resulting system remains underdetermined, Gardner et al. assumed that each gene can have at most k regulatory inputs (which is biologically plausible), and then applied multiple regression for every possible combination of k regulators, choosing the one that best fits the data to approximate A.
Parallelization of Multiple Regression Algorithms The approach of Gardner et al., named Network Identification by Multiple Regression, is computationally prohibitive for networks with more than a few dozen genes as it is infeasible to consider all (nk) combinations of regulators, especially that for each gene and each combination of regulators the following expression must be evaluated aˆ i = −ui Z T (ZZ T )− , to find the combination for which aˆ i minimizes the sum squared errors with respect to the observed data. Here, ui represents row of matrix U for gene i, and Z consists
S
of k rows selected from Y that correspond to k selected regulator genes. In [], Gregoretti et al. describe parallel implementation of the algorithm that replaces the exact enumeration of all combinations of regulators with a greedy search heuristic that starts with d best candidate regulators that are next iteratively combined with other genes to form the final set of regulators. Furthermore, they observe that it is possible to obtain aˆ i by solving the system of linear equations Sai = −r, where S is the symmetric submatrix of YY T of k rows and columns that correspond to the considered regulators, and r is the i-th row of Y T . Because S is positive definite the Cholesky factorization can be used to efficiently solve this system. In practice, the implementation uses the PPSV routine, which has multiple parallel implementations. Finally, searching for regulators of a gene can be performed independently for every gene. Consequently, in the parallel version each of p processors is assigned at most ⌈ np ⌉ genes for which it executes the search heuristic. The main limitation of the multiple regression algorithm is that it requires perturbation data as an input. Because in many cases such data is not available, Gregoretti et al. used synthetic data consisting of , genes with a single perturbation experiment for each gene. This data has been analyzed on a cluster with nodes, each with dual core Itanium processor, and with Quadrics ELAN interconnect in approximately h and min.
Future Directions The problem of reverse engineering gene regulatory networks is one of many in the broad area of computational systems biology, and parallel processing only recently attracted attention of systems biology researchers. Together with the rapid progress in highthroughput biological technologies one can expect accumulation of massive and diverse data, which will enable more complex and realistic models of regulation. Most likely these models will be evolving such as to enable in silico simulation of biological systems, which is one of the goals of the emerging field of synthetic biology. Consequently, parallel processing in its various flavors, ranging from accelerators and multicore
S
S
Systolic Architecture
processors to clusters, grids, and clouds, will be necessary to tackle the resulting computational complexity.
Bibliographic Notes and Further Reading The textbook by Alon [] and articles by Kitano [, ] and Murali and Aluru [] provide a good general introduction to systems biology. A brief overview of existing approaches and challenges in gene networks inference can be found in [, ]. In [] Friedman gives an introduction to using graphical models for network inference, and Meyer et al. review information theoretic approaches in []. The “Dialogue for Reverse Engineering Assessments and Methods” project [] provides a robust set of benchmark data that can be used to assess the quality of inference methods, and it is a good source of information about developments in the area of networks reconstruction.
Bibliography . Alon U () An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC, Boca Raton . Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D () How to infer gene networks from expression profiles. Mol Syst Biol : . Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A () Reverse engineering of regulatory networks in human B cells. Nat Genet ():– . Butte AJ, Kohane IS () Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Pacific Symposium on Biocomputing, pp – . Cover TM, Thomas JA () Elements of Information Theory, nd edn. Wiley, New York . Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS () Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol ():e . Friedman N () Inferring cellular networks using probabilistic graphical models. Science :– . Friedman N, Linial M, Nachman I, Pe’er D () Using Bayesian networks to analyze expression data. J Comput Biol :– . Gardner TS, di Bernardo D, Lorenz D, Collins JJ () Inferring genetic networks and identifying compound mode of action via expression profiling. Science ():– . Gregoretti F, Belcastro V, di Bernardo D, Oliva G () A parallel implementation of the network identification by multiple regression (NIR) algorithm to reverse-engineer regulatory gene networks. PLoS One ():e . Grunwald PD () The Minimum Description Length Principle. MIT Press, Cambridge . Kitano H () Computational systems biology. Nature ():–
. Kitano H () Systems biology: a brief overview. Science ():– . Margolin A, Califano A () Theory and limitations of genetic network inference from microarray data. Ann N Y Acad Sci :– . Meyer PE, Kontos K, Lafitte F, Bontempi G () Informationtheoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform Syst Biol : . Murali TM, Aluru S () Algorithms and Theory of Computation Handbook, chapter Computational Systems Biology. Chapman & Hall/CRC, Boca Raton . Nikolova O, Zola J, Aluru S () A parallel algorithm for exact Bayesian network inference. In IEEE Proceedings of the International Conference on High Performance Computing (HiPC ), pp – . Ott S, Imoto S, Miyano S () Finding optimal models for small gene networks. In Pacific Symposium on Biocomputing, pp – . Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G () Towards a rigorous assessment of systems biology models: the DREAM challenges. PLoS One ():e . Schafer J, Strimmer K () An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics ():– . Yu J, Smith V, Wang PP, Hartemink AJ, Jarvis ED () Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics ():– . Zola J, Aluru M, Aluru S () Parallel information theory based construction of gene regulatory networks. In Proceedings of the International Conference on High Performance Computing (HiPC ). LNCS, vol , pp – . Zola J, Aluru M, Sarje A, Aluru S () Parallel informationtheory-based construction of genome-wide gene regulatory networks. IEEE Trans Parallel Distributed Syst ():–
Systolic Architecture Systolic Arrays
Systolic Arrays James R. Reinders Intel Corporation, Hillsboro, OR, USA
Synonyms Instruction systolic arrays; Processor arrays; Systolic architecture; Wavefront arrays
Systolic Arrays
Definition A Systolic Array is a collection of processing elements, called cells, that implements an algorithm by rhythmically computing and transmitting data from cell to cell using only local communication. Cells of a Systolic Array are arranged and connected in a regular pattern with a design that emphasizes a balance between computational and communicational capabilities. Systolic Arrays have proven particularly effective for real-time applications in signal and image processing. Systolic Algorithms are algorithms specifically designed to make effective use of Systolic Arrays. Systolic Algorithms have also been shown to make particularly efficient use of many general-purpose parallel computers. The name Systolic Arrays derives from an analogy with the regular pumping of blood by the heart. Systolic, in medical terms, refers to the phase of blood circulation in which the pumping chambers of the heart, ventricles, are contracting forcefully and therefore blood pressure is at its highest.
Discussion Background: Motivated by Emergence of VLSI Systolic Arrays, first described in by H. T. Kung and Charles E. Leiserson, were originally described as a systematic approach to take advantage of rapid advances in VLSI technology and coping with difficulties present in designing VLSI systems. The simplicity and regularity of Systolic Arrays lead to a cheaper VLSI implementation as well as higher chip density. Systolic Arrays were originally proposed for VLSI implementation of some matrix operations that were shown to have efficient solutions thereafter known as Systolic Algorithms. Systolic Algorithms support high degrees of concurrency while requiring only simple, regular communication and control, which in turn allows for efficient implementation in hardware. Early VLSI offered high degrees of integration but no speed advantage over the decade-old TTL technology. While tens of thousands of gates could be integrated on a single chip, it appeared that computational speed would only come from the concurrent use of many processing elements. In ,
S
H. T. Kung wrote, “Since the technological trend clearly indicates a diminishing growth rate for component speed, any major improvement in computation speed must come from the concurrent use of many processing elements.” Furthermore, large-scale designs were pushing the limits of the methodologies of the day. It was observed at the time that the general practice of ad hoc designs for VLSI systems was not contributing to sufficient accumulation of experiences, and errors were often repeated. By providing general guidelines, the concept of a Systolic Array, a general methodology emerged for mapping high-level computations into hardware structures. Cost-effectiveness emerged as a chief concern with special-purpose systems; costs need to be low enough to justify construction of any device with limited applicability. The cost of special-purpose systems can be greatly reduced if a limited number of simple substructures or building blocks, such as a cell, can be reused repeatedly for the overall design. The VLSI implications of Systolic Array designs proved to be substantial. Spatial locality divides the time to design a chip by the degree of regularity. Locality avoids the need for long and therefore high capacitive wires, temporal regularity and synchrony reduces control issues, pipelinability yields performance, I/O closeness holds down I/O bandwidth, and modularity allows for parameterized designs. Special-purpose systems pushed the limits of technology in order to achieve the performance needed to justify their cost. Systolic Arrays embraced VLSI circuit technology to achieve high performance via parallelism, honoring the scarcity of power and resistive delay by using a communication topology devoid of long inter-processor wires. Such communication topologies require chip area that is only linear in the number of processors. Systolic Arrays also embraced the design economics of special-purpose processors by reusing a limited number of cell designs. Systolic Arrays proved especially well suited for processing data from sensor devices as an attached processor. Stringent time requirements of real-time signal processing and large-scale scientific computation strongly favor special-purpose devices that can be built in a cost-effective and reliable fashion because they deliver the performance needed.
S
S
Systolic Arrays
Concept Unlike general-purpose processors, a Systolic Array is characterized by its regular data flow. Typically, two or more data streams flow through a Systolic Array in various speeds and directions. The crux of the Systolic Array approach is that once a stream of data is formed, it can be used effectively by each processing element it passes. A higher computational throughput is therefore achieved as compared with a general-purpose processor in which its computational speed may be limited by the I/O bandwidth. When a complex algorithm can be decomposed to fine-grained, regular operations, each operation will then be simpler to implement. A Systolic Array consists of a set of interconnected cells, each capable of performing at least simple computational operations, and communicating only with cells in close proximity. Simple, regular communication and control structures offer substantial advantages over complicated ones in both design and implementation. Many shapes for an “array” are possible, and have been proposed including triangular, but simple twodimensional meshes, or tori, have dominated as Systolic Arrays have trended to become more general because of the flexibility and simplicity of meshes and tori (Fig. ). Systolic Arrays are specifically designed to address the communication requirements of parallel computer systems by placing an emphasis on strong connections between computation and communication in order to achieve both balance and scaling. Systolic Arrays directly address the importance of the communication system for scalable parallel systems by providing direct paths between the communication system and the computational units. Systolic Algorithms address the need for “balanced algorithms” to best utilize parallelism. In a Systolic Array, data flows from the computer memory in a rhythmic fashion, passing through many
Cell 0
Cell 1
processing cells before it returns to memory, much as blood circulates into and out of the heart. The system works like an assembly line where many people work on the same automobile at different times and many cars are assembled simultaneously. The network for the flow of data can offer different degrees of parallelism, and data flow itself may be at different speeds and multiple directions. Traditional pipelined systems flow only results, whereas a Systolic Array flow includes inputs and partial results. The essential characteristic of a Systolic Array is an emphasis on balance between computational and communicational capabilities, and the scalability of practical parallel systems. The features of Systolic Arrays, in pursuit of this emphasis on balance and scalability, are generally as follows: ● Spatial regularity and locality: the variety of processing cells is limited, and connections are limited to nearby processors. Cells are not connected via shared busses, which would not scale due to contention. There is neither global broadcasting nor global memory. ● Temporal regularity and synchrony: each cell acts as a finite state transducer. Cells do not need to execute the same program. ● Pipelinability: a design of N cells will exhibit a linear speedup O(N). ● I/O closeness: only cells on the boundaries of the array have access to “the outside world” to perform I/O. ● Modularity/scaling: a larger array can handle a larger instance of a problem than the smaller version of which it is an extension. Replacing a processing cell with an array of cells offers a higher computation throughput without increasing memory bandwidth. This realization of parallelism offers the advantages
Cell 2
Cell N
Memory
Connection to outside world
Systolic Arrays. Fig. Simple linear systolic array configuration
Systolic Arrays
of both increased performance from increased concurrency and increased designer productivity from component reuse. Synchronous global clocking is not a requirement of Systolic Arrays despite it being a property of many early implementations. Systolic Arrays are distinguished by their pursuit of both balance and scaling with a design emphasis on strong connections between computation and communication in order to achieve both.
Importance of Interconnect Design A central issue for every parallel system is how the computational nodes of a parallel computer communicate with other nodes. There are problems that have been dubbed “embarrassingly parallel,” where little or no communication is necessary between the multiple processors performing subsets of the problem. For those applications, the computation speed alone will determine the speed of execution on a parallel system. There are numerous important problems that are not embarrassingly parallel, in which communication plays a critical role in determining the effective speed of execution as well as the degree to which the problem can scale to utilize parallelism. The exploration of “balanced algorithms” looked to find algorithms that scale without bounds. Such algorithms are marked by a constant ratio of computation to communication steps. These algorithms became known as “Systolic Algorithms.” While Systolic Algorithms can be mapped to a wide range of hardware, it was found that these algorithms tended to rely on finer and finergrained communication as machine sizes increased. To efficiently deal with this fine-grained parallelism, communication with little to no overhead is needed. Systolic Arrays accomplish this goal by providing a method to directly couple computation to communication. In such machines, the design will seek to match the communicational capabilities to the computational capabilities of the machine. Systolic Arrays are one approach to addressing the communication requirements of parallel systems. Systolic Arrays acknowledge the importance of the communication system for scalable parallel systems and provide direct paths between the communication system and the computational units. The key concepts
S
behind a Systolic Array are the balance between computational and communicational capabilities, and the scalability of practical parallel systems. A balanced design minimizes inefficiencies as measured by underutilization of portions of a system due to stalls and bottlenecks. A scalable design allows for expanding performance by increasing the number of computational nodes in a system. Regularity, an often-noted characteristic of a Systolic Array, is a consequence of the scalability goal and not itself part of the definition of a Systolic Array. Another advantage of regularity is that the system can be scaled by repeating a common subsystem design, a benefit to designers that has reinforced use of regularity as a way to accomplish scalability. The Systolic Array approach initially led to very rigidly synchronous hardware designs that, while elegant, proved overly constraining for many programming solutions. While balance is desirable, the tight coupling of systolic systems is not always desirable. Designs of more programmable Systolic Arrays worked to preserve the benefits of systolic communication while generalizing the framework to allow for more application diversity. With tight coupling, any stall in communication will stall computation and vice versa. For many applications, a looser coupling that allowed coupling at a level of data blocks instead of individual data elements was advantageous. Looser coupling also leads to increased ability to overlap communication and computation in practice. As a result, the idea of Systolic Arrays evolved from an academic concept to realization in microprocessors such as the CMU/Intel iWarp.
Variations Increasing the degree of independence of individual processors in an array adds complexity for flexibility and affects the efficiency and performance of an array. One design consideration is whether individual processors have a local control store or a design where instructions were delivered via the processor interconnects to be executed upon arrival. The latter was sometimes referred to as an Instruction Systolic Array (ISA). Synchronous broadcasting of instructions to an array, as opposed to the flow of instructions via the interconnect, would be inconsistent with the goals of Systolic Array designs because it would introduce long paths and the associated delays. Processors with individual control
S
S
Systolic Arrays
stores have been referred to as programmable Systolic Arrays, and therefore incur some overhead for the initial loading of the control stores. As a design methodology for VLSI, silicon implementations were generally hard coded to a large degree once a design was determined. The resulting array performed a fixed function and offered no ability to be loaded with a new algorithm for other functions. Such designs could be optimized to offer the most efficient implementations with the least flexibility for future changes by customizing the cells and the connection networks. More flexible designs offered more design reuse, and thereby offered the opportunity to amortize development costs over multiple uses in exchange for some loss of efficiency, increase in silicon size and generally some loss of array performance. The most flexible implementations of Systolic Arrays were machines developed principally for academic and research purposes to host the exploration and demonstration of Systolic Algorithms. The Warp and iWarp machines, developed under the guidance of H. T. Kung at Carnegie Mellon University, were such machines.
Systolic Algorithms Systolic Arrays have garnered substantial attention in the development of Systolic Algorithms, which have proven to be efficient at creating solutions suited to the
… x3, −, x2, −, x1, −, x0 … y0, −, y1, −, y2, −, y3 …
demands of real-time systems in terms of reliability and the ability to meet stringent time constraints. Development of error detections and fault tolerance in Systolic Algorithms has also received substantial attention. Systolic Algorithms are among the most efficient class of algorithms for massively parallel implementation. A systolic algorithm can be thought of as having two major parts: a cell program and a data flow specification. The cell program defines the local operations of each processing element, while the data flow describes the communication network and its use (Fig. ). One nice property of a Systolic Algorithm is that each processor communicates only with a few other processors. It is thus suitable for implementation on a cluster of computers in which we seek to avoid costly global communication operations. Systolic Algorithms require preservation of data ordering within a stream but do not require that streams of data proceed in lockstep as they would have in the earliest hardware implementations of Systolic Arrays. A Systolic Algorithm will make multiple uses of each input data item, make extensive use of concurrency, rely on only a few simple cell types, and utilize simple and regular data and control flows. Bottlenecks to computational speedups are often caused by limited system memory bandwidths, known as von Neumann bottlenecks, rather than limited processing capabilities (Fig. ).
Cell 0
Cell 1
Cell 2
Cell 3
preloaded with W3
preloaded with W2
preloaded with W1
preloaded with W0
… x3, −, x2, −, x1, −, x0 0, 0, 0, 0, 0, …
Each cell computes:
Outputs (every two cycles):
xout = xin
yi = w0 xi + w1xi+1 + w2 xi+2 + w3 xi+3
yout = yin + wlocal • xin
Systolic Arrays. Fig. Systolic array implementation for a convolution product
Cell 0
Cell 1
Cell 2
Cell 3
Cell 4
Cell 5
Cell 6
Cell 7
Cell 8
Cell 9
Cell 10
Radix-4 butterfly computation
Radix-4 data reordering (256)
Radix-4 butterfly computation
Radix-4 data reordering (64)
Radix-4 butterfly computation
Radix-4 data reordering (16)
Radix-4 butterfly computation
Radix-4 data reordering (4)
Radix-4 butterfly computation
Radix-4 data reordering (1)
Radix-4 butterfly computation
Systolic Arrays. Fig. Systolic array implementation for a -point FFT
Systolic Arrays
Systolic Algorithms have been shown to have wide applicability. They utilize a computational model for a wide range of parallel processing structures not limited to the initially targeted special-purpose needs, and which offer efficient operations on parallel computers in general, not just special-purpose designs. For instance, linear algebra algorithms and FFT algorithms are computationally demanding, especially when high throughput rates are expected, and they display a high degree of regularity. These algorithms are ideal candidates for parallel implementation via Systolic Algorithms.
Input
Latch
S
Latch
Register
MUL Output
ADD
Systolic Arrays. Fig. -D convolution systolic system cell architecture
Systolic Array Machines
ESL Systolic Processor
The Colossus Mark II, built in , has been cited as the first digital computer known to have used a technique similar to Systolic Arrays. In , H. T. Kung and Charles E. Leiserson published their concept of Systolic Arrays. They made systematic use of ideas that were already present in older architectural paradigms in order to respond to the new challenges posed by VLSI technology. This sparked much interest and many designs for Systolic Arrays. While there have been many designs, only a handful have been actually built and put into use. Here is a brief review of key systems in chronological order. The earliest Systolic Arrays were introduced as a straightforward way to create scalable, balanced, systems for special-purpose computations, by directly embodying a systolic algorithm in the hardware design. These early Systolic Arrays were very limited in applications because the algorithm was expressed in efficient and special-purpose designs. Changing an algorithm implemented as a Systolic Array may be difficult or impossible without redesigning the hardware, particularly the communication interconnects. More general processors used in parallel have been shown to be able to perform a broad range of applications, but are often too large, too slow, or too expensive due to inefficiencies imposed by communication overhead.
A Systolic Array to compute convolutions and other signal processing computations was designed and implemented at ESL and operational by . Seven nodes fit on a board measuring cm by cm using discrete high-performance components with each node capable of ten million operations per second. Up to five boards could be linked together in a system, which was attached to a VAX / host machine. Accessing the capabilities of the Systolic Array from a high-level language was a key innovation of this Systolic Array. A Fortran program viewed capabilities as a function to call with information on the kernel to execute the input data and where to store the output data. The ESL systolic processor was able to greatly expand the application domain for which it was suited by including local memory to hold more kernel elements and a control unit to create addresses to access data from memory. This systolic system could perform -D and -D convolutions, matrix multiplication, and Fourier and cosine transforms (Fig. ).
Systolic Convolution Chip The systolic convolution chip was created at CMU in to solve finite-sized two-dimensional convolutions. All nodes in a system performed the same operation while data flowed through the systems in a completely regular synchronous manner (Fig. ).
S NOSC Systolic Array Test Bed A programmable Systolic Array, formed using standard off-the-shelf Intel microprocessors, was built by the Naval Ocean Systems Center (NOSC) from to . The microprocessor operated as the control unit, while a separate arithmetic unit on the board consumed and output data. The arithmetic processor was directly connected to the communication ports so as to tightly link computations and communications. The NOSC test bed contained nodes arranged in an × grid. Each node fits on a × -cm board. Each node in the system is connected to five other nodes. The
S
Systolic Arrays
Control (data)
Latch
Control (addr)
Latch
Controller (Intel 8051 Microprocessor)
MUL
Memory Memory
ADD Control
MUL
ADD
Communication
MAC Input
Latch
Systolic Arrays. Fig. NOSC test bed cell architecture Results (output)
Latch
Systolic Arrays. Fig. ESL systolic cell architecture
topology allows for direct implementation of several key Systolic Algorithms such as matrix multiplication by a hexagonal array. All programming was done in assembly language. Node performance was about K FLOPs for a total performance of . MFlop for the system. The machine was a test bed for software development, and the design could not extend beyond nodes. NOSC research into Systolic Arrays spanned a number of machines from to including the Systolic Array Processor (SAP), the Systolic Linear Algebra Parallel Processor (SLAPP), the Video Analysis Transputer Array (VATA), and the High-Speed Systolic Array Processor (HISSAP) test bed. Subsequent algorithm development moved to iWarp and on to general-purpose machines (Fig. ).
Programmable Systolic Chip (PSC) A programmable Systolic Array chip designed at CMU led to a functional nine-node system in . The architecture of the PSC was similar to the NOSC Systolic Array but was organized around three buses instead of one, cells had three input and output ports to attach to other cells. All programming was in assembly language. The machine was used for low-level image processing computations (Fig. ).
Geometric Arithmetic Processor (GAPP) The NCR GAPP programmable Systolic Array is a mesh-connected single-bit cell that communicates directly with neighbors to the North, East, South, and West. It was first implemented as a medium-scale integration (MSI) breadboard in for a × cell array. GAPP I was a PLA-based × cell chip, GAPP II was a × cell in -¯m CMOS, and finally a version in two-micron CMOS clocked at MHz with MB/s input and MB/s output bandwidth was sold as NCRCG and could perform million eightbit additions per second. The use of one-bit data paths and one-bit registers minimize the size of a single cell. Local memory on a node was only bits. The machine broadcast long instruction words to control the machine in pure lockstep SIMD. Programming was originally done in STOIC, a variant of Forth, but later development of an “Ada-like” language compiler eased programming to create important libraries of commonly used functions. VAX, IBM PC-AT, and Sun workstation host support existed. Military and space systems leveraged GAPP for use in image processing, and the construction of the largest array of processing elements of their generation (, processing cells in one system reported in ) (Fig. ).
Warp and iWarp The Warp project (–) was a series of increasingly general-purpose programmable Systolic Array
Systolic Arrays
Microcode
RAM
Micro sequencer
ALU
MAC
Ports
S
Register file
Bus 1
Bus 2
Bus 3
Systolic Arrays. Fig. PSC cell architecture
North
Comm EW
East
West X
NS
Memory
C
1-bit ALU Carry Borrow
Sum
South
iWarp was based on a full custom VLSI component integrating a , transistor LIW microprocessor, a network interface, and a switching node into one single chip of . × . cm silicon. The processor dissipated up to watts and was packaged in a ceramic pin grid array with pins. Intel marketed the iWarp with the tag line “Building Blocks for GigaFLOPs.” The standard iWarp machines configuration arranged iWarp nodes in a m × n torus. All iWarp machines included the “back edges” and, therefore, were tori. Warp and iWarp were programmed using highlevel languages and domain-specific program generators. About Warp machines were built, and more than , iWarp processors were manufactured. Warp and iWarp handled a large number of image and signal processing algorithms (Figs. and ).
Systolic Arrays. Fig. Geometric arithmetic cell architecture
systems and related software, created by Carnegie Mellon University (CMU) and developed in conjunction with industrial partners G.E., Honeywell, and Intel with funding from the US Defense Advanced Research Projects Agency (DARPA) (see Warp and iWarp). Warp was a highly programmable Systolic Array computer with a linear array of ten or more cells, capable of performing ten million single-precision floatingpoint operations per second ( MFLOPS). A ten-cell machine had a peak performance of MFLOPS. The iWarp machines doubled this performance, delivering MFLOPS single-precision and supporting doubleprecision floating point at half the performance.
Future Directions Faced with relatively little growth in transistor speeds coupled with ever-expanding transistor densities in the late s and early s, the use of transistors for many forms for parallelism, from multiple cells to LIW, were urgently explored. Investigations in Systolic Arrays gave rise to fruitful exploration of Systolic Algorithms. Transistor speeds did finally explode, which helped divert interest from parallelism including Systolic Arrays and Systolic Algorithms. By , once again clock speed gains dramatically slowed, while transistor densities continued growing in accord with Moore’s Law as much as they had at the dawn of VLSI. Hardware parallelism, this time as multicore processors, and parallel programming have
S
S
Systolic Arrays
X
Y
to X (node+1) X Queue 512 x 32
to Y (node−1)
Y Queue 512 x 32 AReg 31 x 32
FADD
Data crossbar
Memory 32k x 32
Mem 2k x 32
MReg 31 x 32
FMUL
literal
addr
Address crossbar
Address queue 512 x 32
Address Gen
to Addr (node+1)
Systolic Arrays. Fig. PC-warp cell architecture Up
Memory
addr data Left
iWarp processor
cores faces the same challenges Systolic Arrays worked to solve the first time the industry found transistor density on the rise without transistor performance moving nearly as quickly, and Systolic Algorithms proved their usefulness via programmability.
Right
Related Entries VLIW Processors Warp and iWarp
Down
Systolic Arrays. Fig. iWarp cell architecture
once again found widespread interest. Single processor designs have become the cells in multicore and many core processor designs, and interconnect debates are being revisited. Interconnection of many processor
Bibliographic Notes and Further Reading Application-specific solutions, like the early Systolic Arrays, continue today in the annual IEEE International Conference on Application-specific Systems, Architectures and Processors. This conference traces its origins back to the International Workshop on Systolic Arrays, first organized in . It later developed into the International Conference on Application-Specific
Systolic Arrays
Array Processors. With its current title, it was organized for the first time in Chicago, USA, in . Systolic Arrays were first described in by H. T. Kung and Charles E. Leiserson [, ]. VLSI designs took serious note of the concept and the design principles [, ]. Digital systems operating in a synchronous fashion utilizing a central clock predated the description of Systolic Arrays and VLSI by many years, the earliest such machine has been cited as the Colossus [] built in . Additional reading about actual realizations of Systolic Arrays are available[, –].
Bibliography . Cloud EL () Frontiers of massively parallel computation. In: Proceedings of the nd symposium on the frontiers of massively parallel computation, Fairfax, VA, pp – . Cragon HG () From fish to colossus: how the German Lorenz cipher was broken at Bletchley park. Cragon Books, Dallas, ISBN: --- . Frank GA, Greenawalt EM, Kulkarni AV () A systolic processor for signal processing. In: Proceedings of AFIPS ’, June –. ACM, New York, pp –
S
. Fisher AL, Kung HT, Monier LM, Dohi Y () Architecture of the PSC: a programmable systolic chip. In: Proceedings of the th annual international symposium on computer architecture (ISCA ’), Stockholm, Sweden. ACM, New York, Vol , Issue , pp – . Gross T, O’Hallaron DR () iWarp: anatomy of a parallel computing system. MIT Press, Cambridge, MA, p . Kung HT, Leiserson CE () Systolic arrays (for VLSI). In: Duff IS, Stewart GW (eds) Proceedings of sparse matrix proceedings . Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, pp – . Kung HT, Song SW () A systolic -D convolution chip. In: Proceedings of IEEE computer society workshop on computer architecture for pattern analysis and image database management, – Nov , Hot Springs, Virginia, pp – . Kung HT () Why systolic architectures? IEEE Comput ():– . Kung SY () VLSI array processors. Prentice Hall, Upper Saddle River . Mead C, Conway L () Chap. , Highly concurrent systems. In: Introduction to VLSI systems. Addison-Westey series in computer science, Addison-Wesley, Menlo Park, CA, pp – . Tirpak FM Jr () Software development on the high-speed systolic array processor (HISSAP): lessons learned. Technical report . Naval Ocean Systems Center, San Diego, CA
S
T Task Graph Scheduling Yves Robert Ecole Normale Supérieure de Lyon, France
Synonyms DAG scheduling; Workflow scheduling
Definition Task Graph Scheduling is the activity that consists in mapping a task graph onto a target platform. The task graph represents the application: Nodes denote computational tasks, and edges model precedence constraints between tasks. For each task, an assignment (choose the processor that will execute the task) and a schedule (decide when to start the execution) are determined. The goal is to obtain an efficient execution of the application, which translates into optimizing some objective function, most usually the total execution time.
Discussion Introduction Task Graph Scheduling is the activity that consists in mapping a task graph onto a target platform. The task graph is given as input to the scheduler. Hence, scheduling algorithms are completely independent of models and methods used to derive task graphs. However, it is insightful to start with a discussion on how these task graphs are constructed. Consider an application that is decomposed into a set of computational entities, called tasks. These tasks are linked by precedence constraints. For instance, if some task T produces some data that is used (read) by another tasks T ′ , then the execution of T ′ cannot start before the completion of T. It is therefore natural to represent the application as a task graph: The task graph is a DAG (Directed Acyclic Graph), whose nodes are the
tasks and whose edges are the precedence constraints between tasks. The decomposition of the application into tasks is given to the scheduler as input. Note that the task graph may be directly provided by the user, but it can also be determined by some parallelizing compiler from the application program. Consider the following algorithm to solve the linear system Ax = b, where A is an n × n nonsingular lower triangular matrix and b is a vector with n components: for i = to n do Task Ti,i : xi ← bi /ai,i for j = i + to N do Task Ti,j : bj ← bj − aj,i × xi end end For a given value of i, ⩽ i ⩽ n, each task Ti,∗ represents some computations executed during the ith iteration of the external loop. The computation of xi is performed first (task Ti,i ). Then, each component bj of vector b such that j > i is updated (task Ti,j ). In the original sequential program, there is a total precedence order between tasks. Write T <seq T ′ if task T is executed before task T ′ in the sequential code. Then: T, <seq T, <seq T, <seq ⋯<seq T,n <seq T, <seq T, <seq ⋯<seq Tn,n . However, there are independent tasks that can be executed in parallel. Intuitively, independent tasks are tasks whose execution orders can be interchanged without modifying the result of the program execution. A necessary condition for tasks to be independent is that they do not update the same variable. They can read the same value, but they cannot write into the same memory location (otherwise there would be a race condition and the result would be nondeterministic). For instance,
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
T
Task Graph Scheduling
T1,1
tasks T, and T, both read x but modify distinct components of b, hence they are independent. This notion of independence can be expressed more formally. Each task T has an input set In(T) (read values) and an output set Out(T) (written values). In the example, In(Ti,i ) = {bi , ai,i } and Out(Ti,i ) = {xi }. For j > i, In(Ti,j ) = {bj , aj,i , xi } and Out(Ti,j ) = {bj }. Two tasks T and T ′ are not independent (write TT ′ ) if they share some written variable:
T1,2
T1,3
T1,4
T1,5
T1,6
T2,3
T2,4
T2,5
T2,6
T3,4
T3,5
T3,6
T4,5
T4,6
T2,2
T3,3
⎧ ⎪ ⎪ In(T) ∩ Out(T ′ ) ≠ / ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ TT ′ ⇔ ⎨ or Out(T) ∩ In(T ′ ) ≠ / ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ or Out(T) ∩ Out(T ′ ) ≠ / ⎪ ⎩ For instance, tasks T, and T, are not independent because Out(T, ) ∩ In(T, ) = {x }; therefore T, T, . Similarly, Out(T, ) ∩ Out(T, ) = {b }, and hence T, and T, are not independent; hence T, T, . Given the dependence relation , a partial order ≺ can be extracted from the total order <seq induced by the sequential execution of the program. If two tasks T and T ′ are dependent, that is, TT ′ , they are ordered according to the sequential execution: T ≺ T ′ if both TT ′ and T <seq T ′ . The precedence relation ≺ represents the dependences that must be satisfied to preserve the semantics of the original program; if T ≺ T ′ , then T was executed before T ′ in the sequential code, and it has to be executed before T ′ even if there are infinitely many resources, because T and T ′ share a written variable. In terms of order relations, ≺ is defined more accurately, as the transitive closure of the intersection of and <seq , and captures the intrinsic sequentiality of the original program. Note that transitive closure is needed to track dependence chains. In the example, T, T, and T, T, , hence a path of dependences from T, to T, , while T, T, does not hold. A directed graph is drawn to represent the dependence constraints that need to be enforced. The vertices of the graph denote the tasks, while the edges express the dependence constraints. An edge e : T → T ′ in the graph means that the execution of T ′ must begin only after the end of the execution of T, whatever the number of available processors. Transitivity edges are not drawn, as they represent redundant information; only predecessor edges are shown. T is a predecessor of T ′ if
T4,4
T5,5 T5,6 T6,6
Task Graph Scheduling. Fig. Task graph for the triangular system (n = )
T ≺ T ′ and if there is no task T ′′ in between, that is, such that T ≺ T ′′ and T ′′ ≺ T ′ . In the example, predecessor relationships are as follows (see Fig. ): Ti,i ≺ Ti,j for ⩽ i < j ⩽ n (the computation of xi must be done before updating bj at step i of the outer loop). ● Ti,j ≺ Ti+,j for ⩽ i < j ⩽ n (updating bj at step i of the outer loop is done before reading it at step i + ). ●
In summary, this example shows how an application program can be decomposed into a task graph, either manually by the user, or with the help of a parallelizing compiler.
Fundamental Results Traditional scheduling assumes that the target platform is a set of p identical processors, and that no communication cost is paid. Fundamental results are presented in this section, but only two proofs are provided, that of Theorem , an easy result on the efficiency of a schedule, and that of Theorem , Graham’s bound on list scheduling.
Task Graph Scheduling
Definitions Definition A task graph is a directed acyclic vertexweighted graph G = (V, E, w), where: ● ●
●
The set V of vertices represents the tasks (note that V is finite). The set E of edges represents precedence constraints between tasks: e = (u, v) ∈ E if and only if u ≺ v. The weight function w : V → N∗ gives the weight (or duration) of each task. Task weights are assumed to be positive integers.
For the triangular system (Fig. ), it can be assumed that all tasks have equal weight: w(Ti,j ) = for ⩽ i ⩽ j ⩽ n. On a contrary, a division could be considered as more costly than a multiply-add, leading to a larger weight for diagonal tasks Ti,i . A schedule σ of a task graph is a function that assigns a start time to each task. Definition A schedule of a task graph G = (V, E, w) is a function σ : V → N∗ such that σ (u) + w(u) ⩽ σ (v) whenever e = (u, v) ∈ E. In other words, a schedule must preserve the dependence constraints induced by the precedence relation ≺ and embodied by the edges of the dependence graph; if u ≺ v, then the execution of u begins at time σ (u) and requires w(u) units of time, and the execution of v at time σ (v) must start after the end of the execution of u. Obviously, if there was a cycle in the task graph, no schedule could exist, hence the restriction to acyclic graphs (DAGs). There are other constraints that must be met by schedules, namely, resource constraints. When there is an infinite number of processors (in fact, when there are as many processors as tasks), the problem is with unlimited processors, and denoted Pb(∞). When there is only a fixed number p of available processors, the problem is with limited processors, and denoted Pb(p). In the latter case, an allocation function alloc : V → P is required, where P = {, . . . , p} denotes the set of available processors. This function assigns a target processor to each task. The resource constraints simply specify that no
T
processor can be allocated more than one task at the same time. This translates into the following conditions: ⎧ ⎪ ⎪ σ (T) + w(T) ⩽ σ (T ′ ) ⎪ ⎪ alloc(T) = alloc(T ) ⇒ ⎨ ⎪ ⎪ ⎪ or σ (T ′ ) + w(T ′ ) ⩽ σ (T). ⎪ ⎩ ′
This condition expresses the fact that if two tasks T and T ′ are allocated to the same processor, then their executions cannot overlap in time. Definition Let G = (V, E, w) be a task graph. . Let σ be a schedule for G. Assume σ uses at most p processors (let p = ∞ if the processors are unlimited). The makespan MS(σ , p) of σ is its total execution time: MS(σ , p) = max{σ (v) + w(v)} − min{σ (v)}. v∈V
v∈V
. Pb(p) is the problem of determining a schedule σ of minimal makespan MS(σ , p) assuming p processors (let p = ∞ if the processors are unlimited). Let MSopt (p) be the value of the makespan of an optimal schedule with p processors: MSopt (p) = min MS(σ , p). σ
If the first task is scheduled at time , which is a common assumption, the expression of the makespan can be reduced to MS(σ , p) = maxv∈V {σ (v) + w(v)}. Weights extend to paths in G as usual; if Φ = (T , T , . . . , Tn ) denotes a path in G, then w(Φ) = ∑ni= w(Ti ). Because schedules respect dependences, the following easy bound on the makespan is readily obtained: Proposition Let G = (V, E, w) be a task graph and σ a schedule for G with p processors. Then, MS(σ , p) ⩾ w(Φ) for all paths Φ in G. The last definition introduces the notions of speedup and efficiency for schedules. Definition Let G = (V, E, w) be a task graph and σ a schedule for G with p processors: Seq , where . The speedup is the ratio s(σ , p) = MS(σ ,p) Seq = ∑v∈V w(v) is the sum of all task weights (Seq is the optimal execution time MSopt () of a schedule with a single processor). s(σ ,p) Seq . The efficiency is the ratio e(σ , p) = p = p×MS(σ ,p) .
T
T
Task Graph Scheduling
Processors
Active P4
Idle
P3 P2 P1 Time
Task Graph Scheduling. Fig. Active and idle processors during execution
Theorem Let G = (V, E, w) be a task graph. For any schedule σ with p processors, ⩽ e(σ , p) ⩽ . Proof Consider the execution of σ as illustrated in Fig. (this is a fictitious example, not related to the triangular system example). At any time during execution, some processors are active, and some are idle. At the end, all tasks have been processed. Let Idle denote the cumulated idle time of the p processors during the whole execution. Because Seq is the sum of all task weights, the quantity Seq+Idle is equal to the area of the rectangle in Fig. , that is, the product of the number of processors by the makespan of the schedule: Seq+Idle = p×MS(σ , p). Seq ⩽ . Hence, e(σ , p) = p × MS(σ ,p)
Solving Pb(∞) Let G = (V, E, w) be a given task graph and assume unlimited processors. Remember that a schedule σ for G is said to be optimal if its makespan MS(σ , ∞) is minimal, that is, if MS(σ , ∞) = MSopt (∞). Definition Let G = (V, E, w) be a task graph. . For v ∈ V, PRED(v) denotes the set of all immediate predecessors of v, and SUCC(v) the set of all its immediate successors. . v ∈ V is an entry (top) vertex if and only if PRED(v) = /. . v ∈ V is an exit (bottom) vertex if and only if SUCC(v) = /.
. For v ∈ V, the top level tl(v) is the largest weight of a path from an entry vertex to v, excluding the weight of v. . For v ∈ V, the bottom level bl(v) is the largest weight of a path from v to an output vertex, including the weight of v. In the example of the triangular system, there is a single entry vertex, T, , and a single exit vertex, Tn,n . The top level of T, is , and tl(T, ) = tl(T, ) + w(T, ) = . The value of T, is tl(T, ) = max{w(T, ) + w(T, ) + w(T, ), w(T, ) + w(T, )} = because there are two paths from the entry vertex to T, . The top level of a vertex can be computed by a traversal of the DAG; the top level of an entry vertex is , while the top level of a non-entry vertex v is tl(v) = max{tl(u) + w(u); u ∈ PRED(v)}. Similarly, bl(v) = max{bl(u); u ∈ SUCC(v)} + w(v) (and bl(v) = w(v) for an exit vertex v). The top level of a vertex is the earliest possible time at which it can be executed, while its bottom level represents a lower bound of the remaining execution time once starting its execution. This can be stated more formally as follows. Theorem Let G = (V, E, w) be a task graph and define σfree as follows: ∀v ∈ V, σfree (v) = tl(v). Then, σfree is an optimal schedule for G.
Task Graph Scheduling
From Theorem : MSopt (∞) = MS(σfree , ∞) = max{tl(v) + w(v)}. v∈V
Hence, MSopt (∞) is simply the maximal weight of a path in the graph. Note that σfree is not the only optimal schedule. Corollary Let G = (V, E, w) be a directed acyclic graph. Pb(∞) can be solved in time O(∣V∣ + ∣E∣). Going back to the triangular system (Fig. ), because all tasks have weight , the weight of a path is equal to its length plus . The longest path is T, → T, → T, → ⋯ → Tn−,n− → Tn,−,n → Tn,n , whose weight is n − . Not as many processors as tasks are needed to achieve execution within n− time units. For example, only n − processors can be used. Let ⩽ i ⩽ n; at time i − , processor P starts the execution of task Ti,i , while at time i − , the first n − i processors P , P , . . ., Pn−i execute tasks Ti,j , i + ⩽ j ⩽ n.
NP-completeness of Pb(p) Definition The decision problem Dec(p) associated with Pb(p) is as follows. Given a task graph G = (V, E, w), a number of processors p ⩾ , and an execution bound K ∈ N∗ , does there exist a schedule σ for G using at most p processors, such that MS(σ , p) ⩽ K? The restriction of Dec(p) to independent tasks (no dependence, that is, when E = /) is denoted Indep-tasks(p). In both problems, p is arbitrary (it is part of the problem instance). When p is fixed a priori, say p = , problems are denoted as Dec() and Indep-tasks(). Well-known complexity results are summarized in the following theorem. Theorem ● Indep-tasks() is NP-complete but can be solved by a pseudo-polynomial algorithm. Moreover, ∀ ε > , Indep-tasks() admits a (+ε)-approximation whose complexity is polynomial in ε . ● Indep-tasks(p) is NP-complete in the strong sense. ● Dec() (and hence Dec(p)) is NP-complete in the strong sense.
T
natural idea is to use greedy strategies: At each instant, try to schedule as many tasks as possible onto available processors. Such strategies deciding not to deliberately keep a processor idle are called list scheduling algorithms. Of course, there are different possible strategies to decide which tasks are given priority in the (frequent) case where there are more free tasks than available processors. But a key result due to Graham [] is that any list algorithm can be shown to achieve at most twice the optimal makespan. Definition Let G = (V, E, w) be a task graph and let σ be a schedule for G. A task v ∈ V is free at time t (note v ∈ FREE(σ , t)) if and only if its execution has not yet started (σ (v) ⩾ t) but all its predecessors have been executed (∀ u ∈ PRED(v), σ (u) + w(u) ⩽ t). A list schedule is a schedule such that no processor is deliberately left idle; at each time t, if ∣FREE(σ , t)∣ = r ⩾ , and if q processors are available, then min(r, q) free tasks start executing. Theorem Let G = (V, E, w) be a task graph and assume there are p available processors. Let σ be any list schedule of G. Let MSopt (p) be the makespan of an optimal schedule. Then, MS(σ , p) ⩽ ( − ) MSopt (p). p It is important to point out that Theorem holds for any list schedule, regardless of the strategy to choose among free tasks when there are more free tasks than available processors. Lemma There exists a dependence path Φ in G whose weight w(Φ) satisfies Idle ⩽ (p − ) × w(Φ), where Idle is the cumulated idle time of the p processors during the whole execution of the list schedule. Proof Define the ancestors of a task are its predecessors, the predecessors of its predecessors, and so on. Let Ti be a task whose execution terminates at the end of the schedule: σ (Ti ) + w(Ti ) = MS(σ , p).
List Scheduling Heuristics Because Pb(p) is NP-complete, heuristics are used to schedule task graphs with limited processors. The most
Let t be the largest time smaller than σ (Ti ) and such that there exists an idle processor during the time interval [t , t + ] (let t = if such a time does not exist).
T
T
Task Graph Scheduling
Why is this processor idle? Because σ is a list schedule, no task is free at t , otherwise the idle processor would start executing a free task. Therefore, there must be a task Ti that is an ancestor of Ti and that is being executed at time t ; otherwise Ti would have been started at time t by the idle processor. Because of the definition of t , it is known that all processors are active between the end of the execution of Ti and the beginning of the execution of Ti . Then, start the construction again from Ti so as to obtain a task Ti such that all processors are active between the end of Ti and the beginning of Ti . Iterating the process, one ends up with r tasks Tir , Tir− , . . . , Ti that belong to a dependence path Φ of G and such that all processors are active except perhaps during their execution. In other words, the idleness of some processors can only occur during the execution of these r tasks, during which at least one processor is active (the one that executes the task). Hence, Idle ⩽ (p − ) × ∑rj= w(Tij ) ⩽ (p − ) × w(Φ). Proof Going back to the proof of Theorem , recall that p × MS(σ , p) = Idle + Seq, where Seq = ∑v∈V w(v) is the sequential time, that is, the sum of all task weights (see Fig. ). Now take the dependence path Φ constructed in Lemma : w(Φ) ⩽ MSopt (p), because the makespan of any schedule is greater than the weight of all dependence paths in G (simply because dependence constraints are met). Furthermore, Seq ⩽ p × MSopt (p) (with equality only if all p processors are active all the time). Putting this together: p × MS(σ , p) = Idle + Seq ⩽ (p − )w(Φ) + Seq ⩽ (p − )MSopt (p) + pMSopt (p) = (p − )MSopt (p),
The bound MSlist (p) ⩽ is tight. Note that implementing a list scheduling algorithm is not difficult, but it is somewhat lengthy to describe in full detail; see Casanova et al. [].
Critical Path Scheduling A widely used list scheduling technique is critical path scheduling. The selection criterion for free tasks is based on the value of their bottom level. Intuitively, the larger the bottom level, the more “urgent” the task. The critical path of a task is defined as its bottom level and is used to assign priority levels to tasks. Critical path scheduling is list scheduling where the priority level of a task is given by the value of its critical path. Ties are broken arbitrarily. Consider the task graph shown in Fig. . There are eight tasks, whose weights and critical paths are listed in Table . Assume there are p = available processors and let Q be the priority queue of free tasks. At t = , Q is initialized as Q = (T , T , T ). These three tasks are executed. At t = , T is added to the queue: Q = (T ). There is one processor available, which starts the execution of T . At t = , the four successors of T are added to the queue: Q = (T , T , T , T ). Note that ties have been broken arbitrarily (using task indices in this case). The available processor picks the first task T in Q. Following this scheme, the execution goes on up to t = , as summarized in Fig. .
T1
which proves the theorem. Fundamentally, Theorem says that any list schedule is within % of the optimum. Therefore, list scheduling is guaranteed to achieve half the best possible performance, regardless of the strategy to choose among free tasks. Proposition Let MSlist (p) be the shortest possible makespan produced by a list scheduling algorithm.
p − MSopt (p) p
T2
T4
T5
T3
T6
T7
T8
Task Graph Scheduling. Fig. A small example Task Graph Scheduling. Table Weights and critical paths for the task graph in Fig. Tasks
T
T
T
T
T
T
T
T
Weights
Critical paths
Task Graph Scheduling
T3
P3
T8 T2
P2
T1
P1 0
1
T4 T6
2
3
4
5 6 7 Time steps
8
9
10
Task Graph Scheduling. Fig. Critical path schedule for the example in Fig.
T3
P3
T2
P2
T6
T7
T5
T4
T1
P1
0
1
T8
2
3
4
5 6 7 Time steps
8
The target platform consists of p identical processors that are part of of a fully connected clique. All interconnection links have same bandwidth. If a task T communicates data to a successor task T ′ , the cost is modeled as
T7
T5
T
9
10
Task Graph Scheduling. Fig. Optimal schedule for the example in Fig.
Note that it is possible to schedule the graph in only time units, as shown in Fig. . The trick is to leave a processor idle at time t = deliberately; although it has the highest critical path, T can be delayed by two time units. T and T are given preference to achieve a better load balance between processors. The schedule shown in Fig. is optimal, because Seq = , so that three processors require at least ⌈ ⌉ = time units. This small example illustrates the difficulty of scheduling with a limited number of processors.
Taking Communication Costs into Account The Macro-Dataflow Model Thirty years ago, communication costs have been introduced in the scheduling literature. Because the performance of network communication is difficult to model in a way that is both precise and conducive to understanding the performance of algorithms, the vast majority of results hold for a very simple model, which is as follows.
⎧ ⎪ ⎪ if alloc(T) = alloc(T ′ ) ⎪ ⎪ ′ cost(T, T ) = ⎨ ⎪ ⎪ ⎪ c(T, T ′ ) otherwise, ⎪ ⎩ where alloc(T) denotes the processor that executes task T, and c(T, T ′ ) is defined by the application specification. The time for communication between two tasks running on the same processor is negligible. This socalled macro-dataflow model makes two main assumptions: (i) communication can occur as soon as data are available and (ii) there is no contention for network links. Assumption (i) is reasonable as communication can overlap with (independent) computations in most modern computers. Assumption (ii) is much more questionable. Indeed, there is no physical device capable of sending, say, , messages to , distinct processors, at the same speed as if there were a single message. In the worst case, it would take , times longer (serializing all messages). In the best case, the output bandwidth of the network card of the sender would be a limiting factor. In other words, assumption (ii) amounts to assuming infinite network resources. Nevertheless, this assumption is omnipresent in the traditional scheduling literature. Definition A communication task graph (or commTG) is a direct acyclic graph G = (V, E, w, c), where vertices represent tasks and edges represent precedence constraints. The computation weight function is w : V → N∗ and the communication cost function is c : E → N∗ . A schedule σ must preserve dependences, which is written as ⎧ ⎪ ⎪ σ (T) + w(T) ⩽ σ (T ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if alloc(T) = alloc(T ′ ) ⎪ ⎪ ′ ∀e = (T, T ) ∈ E, ⎨ ⎪ ⎪ ⎪ σ (T) + w(T) + c(T, T ′ ) ⩽ σ (T ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ otherwise. ⎪ ⎩ The expression of resource constraints is the same as in the no-communication case.
T
T
Task Graph Scheduling
Complexity and List Heuristics with Communications Including communication costs in the model makes everything difficult, including solving Pb(∞). The intuitive reason is that a trade-off must be found between allocating tasks to either many processors (hence balancing the load but communicating intensively) or few processors (leading to less communication but less parallelism as well). Here is a small example, borrowed from []. Consider the commTG in Fig. . Task weights are indicated close to the tasks within parentheses, and communication costs are shown along the edges, underlined. For the sake of this example, two non-integer communication costs are used: c(T , T ) = c(T , T ) = .. Of course, every weight w and cost c could be scaled to have only integer values. Observe the following: ●
On the one hand, if all tasks are assigned to the same processor, the makespan will be equal to the sum of all task weights, that is, . ● On the other hand, with unlimited processors (no more than seven processors are needed because there are seven tasks), each task can be assigned to a different processor. Then, the makespan of the ASAP schedule is equal to . To see this, it is important to point out that once the allocation of tasks to processors is given, the makespan is computed easily: For each edge e : T → T ′ , add a virtual node of weight c(T, T ′ ) if the edge links two different processors (alloc(T) ≠ alloc(T ′ )), and do nothing
(1) T1
1
5
otherwise. Then, consider the new graph as a DAG (without communications) and traverse it to compute the length of the longest path. Here, because all tasks are allocated to different processors, a virtual node is added on each edge. The longest path is T → T → T , whose length is w(T ) + c(T , T ) + w(T ) + c(T , T ) + w(T ) = . There is a difficult trade-off between executing tasks in parallel (hence with several distinct processors) and minimizing communication costs. In the example, it turns out that the best solution is to use two processors, according to the schedule in Fig. , whose makespan is equal to . Using more processors does not always lead to a shorter execution time. Note that dependence constraints are satisfied in Fig. . For example, T can start at time on processor P because this processor executes T , hence there is no need to pay the communication cost c(T , T ). By contrast, T is executed on processor P , hence it cannot be started before time even though P is idle: σ (T ) + w(T ) + c(T , T ) = + + = . With unlimited processors, the optimization problem becomes difficult: Pb(∞) is NP-complete in the strong sense. Even the problem in which all task weights and communication costs have the same (unit) value, the so-called UET-UCT problem (unit execution timeunit communication time), is NP-hard []. With limited processors, list heuristics can be extended to take communication costs into account, but Graham’s bound does not hold any longer. For instance, the Modified Critical Path (MCP) algorithm proceeds as follows. First, bottom levels are computed using a pessimistic evaluation of the longest path, accounting for each potential communication (this corresponds to the allocation where there is a different processor per task).
(1) T3
(5) T2
4
3
(2 ) T4
(2) T5
1.5
1.5
2
T3
P2
T5
4
5 6 Time steps
T6
T7
T2
T1
P1
T4
(1) T6 1 (1)
T7
Task Graph Scheduling. Fig. An example commTG
0
1
2
3
7
8
9
Task Graph Scheduling. Fig. An optimal schedule for the example
Task Graph Scheduling
Computation costs : The execution cost of Ti on Pq is modeled as wiq . Therefore, an n × p matrix of values is needed to specify all computation costs. This matrix comes directly for the specific scheduling problem at hand. However, when attempting to evaluate competing scheduling heuristics over a large number of synthetic scenarios, one must generate this matrix. One can distinguish two approaches. In the first approach one generates a consistent (or uniform) matrix with wiq = wi × γq , where wi represents the number of operations required by Ti and γ q is the inverse of the speed of Pq (in operations per second). With this definition the relative speed of the processors does not depend on the particular task they execute. If instead some processors are faster for some tasks than some other processors, but slower for other tasks, one speaks of an inconsistent (or nonuniform) matrix. This corresponds to the case in which some processors are specialized for some tasks (e.g., specialized hardware or software). Communication costs : Just as processors have different speeds, communication links may have different
bandwidths. However, while the speed of a processor may depend upon the nature of the computation it performs, the bandwidth of a link does not depend on the nature of the bytes it transmits. It is therefore natural to assume consistent (or uniform) links. If there is a dependence eij : Ti → Tj , if Ti is executed on Pq and Tj executed on Pr , then the communication time is modeled as
These bottom levels are used to determine the priority of free tasks. Then each free task is assigned to the processor that allows its earliest execution, given previous task allocation decisions. It is important to explain further what “previous task allocation decisions” means. Free tasks from the queue are processed one after the other. At any moment, it is known which processors are available and which ones are busy. Moreover, for the busy processors, it is known when they will finish computing their currently allocated tasks. Hence, it is always possible to select the processor that can begin executing the task soonest. It may well be the case that a currently busy processor is selected.
Extension to Heterogeneous Platforms This section explains how to extend list scheduling techniques to heterogeneous platforms, that is, to platforms that consist of processors with different speeds and interconnection links with different bandwidths. Key differences with the homogeneous case are outlined. Given a commTG with n tasks T , . . . , Tn , the goal is to schedule it on a platform with p heterogeneous processors P , . . . , Pp . There are now many parameters to instantiate:
T
comm(i, j, q, r) = data(i, j) × vqr , where data(i, j) is the data volume associated to eij and vqr is the communication time for a unit-size message from Pq to Pr (i.e., the inverse of the bandwidth). Like in the homogeneous case, let vqr = if q = r, that is, if both tasks are assigned the same processor. If one wishes to generate synthetic scenarios to evaluate competing scheduling heuristics, one then must generate two matrices: one of size n × n for data and one of size p × p for vqr . The main list scheduling principle is unchanged. As before, the priority of each task needs to be computed, so as to decide which one to execute first when there are more free tasks than available processors. The most natural idea is to compute averages of computation and communication times, and use these to compute priority levels exactly as in the homogeneous case: p
∑q= w iq , p
●
wi =
●
commij = data(i, j) × ⩽q,r⩽p,q≠r , the average comp(p−) munication cost for edge eij : Ti → Tj .
the average execution time of Ti . ∑
v qr
The last (but important) modification concerns the way in which tasks are assigned to processors: Instead of assigning the current task to the processor that will start its execution first (given all already taken decisions), one should assign it to the processor that will complete its execution first (given all already taken decisions). Both choices are equivalent with homogeneous processors, but intuitively the latter is likely to be more efficient in the heterogeneous case. Altogether, this leads to the list heuristic called HEFT, for Heterogeneous Earliest Finish Time [].
Workflow Scheduling This section discusses workflow scheduling, that is, the problem of scheduling a (large) collection of identical task graphs rather than a single one. The main idea is to pipeline the execution of successive instances. Think
T
T
Task Graph Scheduling
of a sequence of video images that must be processed in a pipelined fashion: Each image enters the platform and follows the same processing chain, and a new image can enter the system while previous ones are still being executed. This section is intended to give a flavor of the optimization problems to be solved in such a context. It restricts to simpler problem instances. Consider “chains,” that is, applications structured as a sequence of stages. Each stage corresponds to a different computational task. The application must process a large number of data sets, each of which must go through all stages. Each stage has its own communication and computation requirements: It reads an input from the previous stage, processes the data, and outputs a result to the next stage. Initial data are input to the first stage and final results are obtained as the output from the last stage. The pipeline operates in synchronous mode: After some initialization delay, a new task is completed every period. The period is defined as the longest “cycle-time” to operate a stage, and it is the inverse of the throughput that can be achieved. For simplicity, it is assumed that each stage is assigned to a single processor, that is in charge of processing all instances (all data sets) for that stage. Each pipeline stage can be viewed a sequential task that may write some global data structure, to disk or to memory, for each processed data set. In this case, tasks must always be processed in a sequential order within a stage. Moreover, due to possible local updates, each stage must be mapped onto a single processor. For a given stage, one cannot process half of the tasks on one processor and the remaining half on another without maintaining global information, which might be costly and difficult to implement. In other words, a processor that is assigned a stage will execute the operations required by this stage (input, computation, and output) for all the tasks fed into the pipeline. Of course, other assumptions are possible: some stages could be replicated, or even data-parallelized. The reader is referred the bibliographical notes at the end of the chapter for such extensions.
Objective Functions An important metric for parallel applications that consists of many individual computations is the throughput. The throughput measures the aggregate rate of data processing; it is the rate at which data sets can enter
the system. Equivalently, the inverse of the throughput, defined as the period, is the time interval required between the beginning of the execution of two consecutive data sets. The period minimization problem can be stated informally as follows: Which stage to assign to which processor so that the largest period of a processor is kept minimal? Another important metric is derived from makespan minimization, but it must be adapted. With a large number of data sets, the total execution time is less relevant, but the execution time for each data set remains important, in particular for real-time applications. One talks of latency rather than of makespan, in order to avoid confusion. The latency is the time elapsed between the beginning and the end of the execution of a given data set, hence it measures the response time of the system to process the data set entirely. Minimizing the latency is antagonistic to maximizing the throughput. In fact, assigning all application stages to the fastest processor (thus working in a fully sequential way) would suppress all communications and accelerate computations, thereby minimizing the latency, but achieving a very bad throughput. Conversely, mapping each stage to a different processor is likely to decrease the period, hence increase the throughput (work in a fully pipelined manner), but the resulting latency will be high, because all interstage communications must be accounted for in this latter mapping. Trade-offs will have to be found between these criteria. How to deal with several objective functions? In traditional approaches, one would form a linear combination of the different objectives and treat the result as the new objective to optimize for. But it is not natural for the user to maximize a quantity like .T + .L, where T is the throughput and L the latency. Instead, one is more likely to fix a throughput T, and to search for the best latency that can be achieved while enforcing T? One single criterion is optimized, under the condition that a threshold is enforced for the other one.
Period and Latency Consider a pipeline with n stages Sk , ⩽ k ⩽ n, as illustrated in Fig. . Tasks are fed into the pipeline and processed from stage to stage, until they exit the pipeline after the last stage. The k-th stage Sk receives an input from the previous stage, of size bk− , performs a number of wk operations, and outputs data of size bk to the
Task Graph Scheduling
b0
S1 w1
b1
S2
...
bk−1
w2
Sk
bk
...
Sn
wk
T
bn
wn
Task Graph Scheduling. Fig. The application pipeline
next stage. The first stage S receives an initial input of size b , while the last stage Sn returns a final result of size bn . The target platform is a clique with p processors Pu , ⩽ u ⩽ p, that are fully interconnected (see Fig. ). There is a bidirectional link linku,v : Pu ↔ Pv with bandwidth Bu,v between each processor Pu and Pv , The literature often enforces more realistic communication models for workflow scheduling than for Task Graph Scheduling. For the sake of simplicity, a very strict model is enforced here: A given processor can be involved in a single communication at any time unit, either a send or a receive. Note that independent communications between distinct processor pairs can take place simultaneously. Finally, there is no overlap between communications and computations, so that all the operations of a given processor are fully sequentialized. The speed of processor Pu is denoted as Wu , and it takes X/Wu time units for Pu to execute X operations. It takes X/Bu,v time units to send (respectively, receive) a message of size X to (respectively, from) Pv . The mapping problem consists in assigning application stages to processors. For one-to-one mappings, it is required that each stage Sk of the application pipeline be mapped onto a distinct processor Palloc(k) (which is possible only if n ⩽ p). The function alloc associates a processor index to each stage index. For convenience, two fictitious stages S and Sn+ are created, assigning S to Pin and Sn+ to Pout . What is the period of Palloc(k) , that is, the minimum delay between the processing of two consecutive tasks? To answer this question, one needs to know to which processors the previous and next stages are assigned. Let t = alloc(k − ), u = alloc(k), and v = alloc(k + ). Pu needs bk− /Bt,u time units to receive the input data from Pt , wk /Wu time units to process it, and bk /Bu,v time units to send the result to Pv , hence a cycle-time of bk− /Bt,u + wk /Wu + bk /Bu,v time units for Pu . These three steps are serialized (see Fig. for an illustration). The period achieved with the mapping is the maximum of the cycle-times of the processors, which corresponds to the rate at which the pipeline can be activated.
Win
Pin
Pout W out Bv,out
Bin,u Pu Wu
Pv Bu,v
Wv
Task Graph Scheduling. Fig. The target platform
In this simple instance, the optimization problem can be stated as follows: Determine a one-to-one allocation function alloc : [, n] → [, p] (augmented with alloc() = in and alloc(n + ) = out) such that Tperiod = max { ⩽k⩽n
+
bk− Balloc(k−),alloc(k)
+
wk Walloc(k)
bk } Balloc(k),alloc(k+)
is minimized. Natural extensions are interval mappings, in which each participating processor is assigned an interval of consecutive stages. Note that when p < n interval mappings are mandatory. Intuitively, assigning several consecutive tasks to the same processor will increase its computational load, but will also decrease communication. The best interval mapping may turn out to be a one-to-one mapping, or instead may utilize only a very small number of fast computing processors interconnected by high-speed links. The optimization problem associated to interval mappings is formally expressed as follows. The intervals achieve a partition of the original set of stages S to Sn . One searches for a partition of [, . . . , n] into m intervals Ij = [dj , ej ] such that dj ⩽ ej for ⩽ j ⩽ m, d = , dj+ = ej + for ⩽ j ⩽ m − , and em = n. Recall that the function alloc : [, n] → [, p] associates a processor index to each stage index. In a one-to-one mapping, this function was a one-to-one
T
T
Task Graph Scheduling
Time unit P1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18
P2 P3
Task Graph Scheduling. Fig. An example of one-to-one mapping with three stages and processors. Each processor periodically receives input data from its predecessor (☀), performs some computation (∎), and outputs data to its successor (△). Note that these operations are shifted in time from one processor to another. The cycle-time of P and P is while that of P is , hence Tperiod =
assignment. In an interval mapping, for ⩽ j ⩽ m, the whole interval Ij is mapped onto the same processor Palloc(d j ) , that is, for dj ⩽ i ⩽ ej , alloc(i) = alloc(dj ). Also, two intervals cannot be mapped to the same processor, that is, for ⩽ j, j′ ⩽ m, j ≠ j′ , alloc(dj ) ≠ alloc(dj′ ). The period is expressed as ej ⎧ ∑i=dj wi ⎪ bdj − ⎪ + Tperiod = max ⎨ ⩽j⩽m ⎪ B Walloc(d j ) ⎪ ⎩ alloc(dj −),alloc(dj ) be j + } Balloc(d j ),alloc(ej +)
Note that alloc(dj − ) = alloc(ej− ) = alloc(dj− ) for j > and d − = . Also, ej + = dj+ for j < m, and em + = n + . It is still assumed that alloc() = in and alloc(n + ) = out. The optimization problem is then to determine the mapping that minimizes Tperiod , over all possible partitions into intervals, and over all mappings of these intervals to the processors. The latency of an interval mapping is computed as follows. Each data set traverses all stages, but only communications between two stages mapped on the same processors take zero time units. Overall, the latency is expressed as ej ⎧ ∑i=dj wi ⎫ ⎪ ⎪ bd j − ⎪ ⎪ ⎬ + Tlatency = ∑ ⎨ ⎪ B W ⎪ ⎪ alloc(d j ) ⎪ ⩽j⩽m ⎩ alloc(d j −),alloc(d j ) ⎭ bn + Balloc(n),alloc(n+)
The latency for a one-to-one mapping obeys the same formula (with the restriction that each interval has length ). Just as for the period, there are two minimization problems for the latency, with one-to-one and interval mappings.
It goes beyond the scope of this entry to assess the complexity of these period/latency optimization problems, and of their bi-criteria counterparts. The aim was to provide the reader with a quick primer on workflow scheduling, an activity that borrows several concepts from Task Graph Scheduling, while using more realistic platform models, and different objective functions.
Related Entries Loop Pipelining Modulo Scheduling and Loop Pipelining Scheduling Algorithms
Recommended Reading Without communication costs, pioneering work includes the book by Coffman []. The book by El-Rewini et al. [] and the IEEE compilation of papers [] provide additional material. On the theoretical side, Appendix A of Garey and Johnson [] provides a list of NP-complete scheduling problems. Also, the book by Brucker [] offers a comprehensive overview of many complexity results. The literature with communication costs is more recent. See the survey paper by Chrétienne and Picouleau []. See also the book by Darte et al. [], where many heuristics are surveyed. The book by Sinnen [] provides a thorough discussion on communication models. In particular, it describes several extensions for modeling and accounting for communication contention. Workflow scheduling is quite a hot topic with the advent of large-scale computing platforms. A few representative papers are [, , , ].
TAU
Modern scheduling encompasses a wide spectrum of techniques: divisible load scheduling, cyclic scheduling, steady-state scheduling, online scheduling, job scheduling, and so on. A comprehensive survey is available in the book []. See also the handbook []. Most of the material presented in this entry is excerpted from the book by Casanova et al. [].
Bibliography . Benoit A, Robert Y () Mapping pipeline skeletons onto heterogeneous platforms. J Parallel Distr Comput (): – . Brucker P () Scheduling algorithms. Springer, New York . Casanova H, Legrand A, Robert Y () Parallel algorithms. Chapman & Hall/CRC Press, Beaumont, TX . Chrétienne P, Picouleau C () Scheduling with communication delays: a survey. In: Chrétienne P, Coffman EG Jr, Lenstra JK, Liu Z (eds) Scheduling theory and its applications. Wiley, Hoboken, NJ, pp – . Coffman EG () Computer and job-shop scheduling theory. Wiley, Hoboken, NJ . Darte A, Robert Y, Vivien F () Scheduling and automatic parallelization. Birkhaüser, Boston . El-Rewini H, Lewis TG, Ali HH () Task scheduling in parallel and distributed systems. Prentice Hall, Englewood Cliffs . Garey MR, Johnson DS () Computers and intractability, a guide to the theory of NP-completeness. WH Freeman and Company, New York . Gerasoulis A, Yang T () A comparison of clustering heuristics for scheduling DAGs on multiprocessors. J Parallel Distr Comput ():– . Graham RL () Bounds for certain multiprocessor anomalies. Bell Syst Tech J :– . Hary SL, Ozguner F () Precedence-constrained task allocation onto point-to-point networks for pipelined execution. IEEE Trans Parallel Distr Syst ():– . Leung JY-T (ed) () Handbook of scheduling: algorithms, models, and performance analysis. Chapman and Hall/CRC Press, Boca Raton . Picouleau C () Task scheduling with interprocessor communication delays. Discrete App Math (–):– . Robert Y, Vivien F (eds) () Introduction to scheduling. Chapman and Hall/CRC Press, Boca Raton . Shirazi BA, Hurson AR, Kavi KM () Scheduling and load balancing in parallel and distributed systems. IEEE Computer Science Press, San Diego . Sinnen O () Task scheduling for parallel systems. Wiley, Hoboken . Spencer M, Ferreira R, Beynon M, Kurc T, Catalyurek U, Sussman A, Saltz J () Executing multiple pipelined data analysis operations in the grid. Proceedings of the ACM/IEEE supercomputing conference. ACM Press, Los Alamitos
T
. Subhlok J, Vondran G () Optimal mapping of sequences of data parallel tasks. Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM Press, San Diego, pp – . Topcuoglu H, Hariri S, Wu MY () Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distr Syst ():–
Task Mapping, Topology Aware Topology Aware Task Mapping
Tasks Processes, Tasks, and Threads
TAU Sameer Shende, Allen D. Malony, Alan Morris, Wyatt Spear, Scott Biersdorff University of Oregon, Eugene, OR, USA
Synonyms
TAU performance system ; Tuning and analysis utilities
Definition
The TAU Performance System is an integrated suite of tools for instrumentation, measurement, and analysis of parallel programs with particular focus on largescale, high-performance computing (HPC) platforms. TAU’s objectives are to provide a flexible and interoperable framework for performance tool research and development, and robust, portable, and scalable set of technologies for performance evaluation on high-end computer systems.
Discussion Introduction Scalable parallel systems have always evolved together with the tools used to observe, understand, and optimize their performance. Next-generation parallel computing environments are guided to a significant degree
T
T
TAU
by what is known about application performance on current machines and how performance factors might be influenced by technological innovations. State-ofthe-art performance tools play an important role in helping to understand application performance, diagnose performance problems, and guide tuning decisions on modern parallel platforms. However, performance tool technology must also respond to the growing complexity of next-generation parallel systems in order to help deliver the promises of high-end computing (HEC). The TAU project began in the early s with the goal of creating a performance instrumentation, measurement, and analysis framework that could produce robust, portable, and scalable performance tools for use in all parallel programs and systems over several technology generations. Today, the TAU Performance System is a ubiquitous performance tools suite for shared-memory and message-passing parallel applications written in multiple programming languages (e.g., C, C++, Fortran, OpenMP, Java, Python, UPC, Chapel) that can scale to the largest parallel machines available.
TAU Design TAU is of a class of performance systems based on the approach of direct performance observation, wherein execution actions of performance interest are exposed as events to the performance system through direct insertion of instrumentation in the application, library, or
system code, at locations where the actions arise. In general, the actions reflect an occurrence of some execution state, most commonly as a result of a code location being reached (e.g., entry in a subroutine). However, it could also include a change in data. The key point is that the observation mechanism is direct. Generated events are made visible to the performance system in this way and contain implicit meta information as to their associated action. Thus, for any performance experiment using direct observation, the performance events of interest must be decided and necessary instrumentation done for their generation. Performance measurements are made of the events during execution and saved for analysis. Knowledge of the events is used to process the performance data and interpret the analysis results. The TAU framework architecture, shown in Fig. , separates the functional concerns of a direct performance observation approach into three primary layers – instrumentation, measurement, and analysis. Each layer uses multiple modules which can be configured in a flexible manner under user control. This design makes it possible for TAU to target alternative models of parallel computation, from shared-memory multi-threading to distributed memory message passing to mixed-mode parallelism []. TAU defines an abstract computation model for parallel systems that captures general architecture and software execution features and can be mapped to existing complex system types []. Event selection
Instrumentation Source code
Event selection
Object code
Library wrapper
Binary code
Other profilers
Virtual machine Event information
MEASUREMENT API
Measurement
Analysis
Profiles
Measurement Profiles
Profile Data Management (PerfDMF) Event creation and management
Event identifier
Entry/exit events
Atomic events
Event mapping
Profiling
Profile translators
Event control
Atomic profiles
Entry/exit profiles
Trace buffering
Record creation
Trace I/o
Phase profiles
I/O profiles
Profile sampling
Timestamp generation
Trace filtering
Trace merging
Timing System counters
Profile database
Profile Analysis (PerfProf)
Tracing
Statistics
Performance data sources
Metadata (XML)
OS and runtime system modules
Hardware counters
Threading
Interrupts
Kernel
Runtime system
I/O
Event information
Instrumentation
Profile Data Mining (PerfExplorer)
TAU. Fig. TAU architecture: instrumentation and measurement (left), analysis (right)
Symbol table
Traces
Trace Data Management Trace translators
Trace Visualizers
Trace storage
Trace Analyzers
Vampir
Expert
JumpShot
ProfileGen
Paraver
Vampir Server
TAU
TAU’s design has proven to be robust, sound, and highly adaptable to generations of parallel systems. The framework architecture has allowed new components to be added that have extended the capabilities of the TAU toolkit. This is especially true in areas concerning kernel-level performance integration [, ], performance monitoring [, , ], performance data mining [], and GPU performance measurement [].
TAU Instrumentation The role of the instrumentation layer in direct performance observation is to insert code (a.k.a. probes) to make performance events visible to the measurement layer. Performance events can be defined and instrumentation inserted in a program at several levels of the program transformation process. In fact, it is important to realize that a complete performance view may require contribution of event information across code levels []. For these reasons, TAU supports several instrumentation mechanisms based on the code type and transformation level: source (manual, preprocessor, library interposition), binary/dynamic, interpreter, and virtual machine. There are multiple factors that affect the choice of what level to instrument, including accessibility, flexibility, portability, concern for intrusion, and functionality. It is not a question of what level is “correct” because there are trade-offs for each and different events are visible at different levels. TAU is distinguished by its broad support for different instrumentation methods and their use together. TAU supports two general classes of events for instrumentation using any method: atomic events and interval events. An atomic event denotes a single action. Instrumentation is inserted at a point in the program code to expose an atomic action, and the measurement system obtains performance data associated with the action where and when it occurs. An interval event is a pair of events: begin and end. Instrumentation is inserted at two points in the code, and the measurement system uses data obtained from each event to determine performance for the interval between them (e.g., the time spent in a subroutine from entry (beginning of the interval) to exit (end of the interval)). In addition to the two general events classes, TAU allows events to be selectively enabled / disabled for any instrumentation method.
T
TAU Measurement The TAU performance measurement system is a highly robust, scalable infrastructure portable to all HPC platforms. As shown in Fig. , TAU supports the two dominant methods of measurement for direct performance observation – parallel profiling and tracing – with rich access to performance data through portable timing facilities, integration with hardware performance counters, and user-level information. The choice of measurement method and performance data is made independently of the instrumentation decisions. This allows multiple performance experiments to be conducted to gain different performance views for the same set of events. TAU also provides unique support for novel performance mapping [], runtime monitoring [, , ], and kernel-level measurements [, ]. TAU’s measurement system has two core capabilities. First, the event management handles the registration and encoding of events as they are created. New events are represented in an event table by instantiating a new event record, recording the event name, and linking in storage allocated for the event performance data. The event table is used for all atomic and interval events regardless of their complexity. Event type and context information are encoded in the event names. The TAU event-management system hashes and maps these names to determine if an event has already occurred or needs to be created. Events are managed for every thread of execution in the application. Second, a runtime representation, called the event callstack, captures the nesting relationship of interval performance events. It is a powerful runtime measurement abstraction for managing the TAU performance state for use in both profiling and tracing. In particular, the event callstack is key for managing execution context, allowing TAU to associate this context to the events being measured. Parallel profiling in TAU characterizes the behavior of every application thread in terms of its aggregate performance metrics. For interval events, TAU computes exclusive and inclusive metrics for each event. The TAU profiling system supports several profiling variants. The standard type of profiling is called flat profiling, which shows the exclusive and inclusive performance of each event but provides no other performance information about events occurring when an interval is active (i.e., nested events). In contrast, TAU’sevent path profiling can capture performance data with respect to event nesting
T
T
TAU
relationships. It is also interesting to observe performance data relative to an execution state. The structural, logical, and numerical aspects of a computation can be thought of as representing different execution phases. TAU supports an interface to create (phase events) and to mark their entry and exit. Internally in the TAU measurement system, when a phase, P, is entered, all subsequent performance will be measured with respect to P until it exits. When phase profiles are recorded, a separate parallel profile is generated for each phase. TAU implements robust, portable, and scalable parallel tracing support to log events in time-ordered tuples containing a time stamp, a location (e.g., node, thread), an identifier that specifies the type of event, eventspecific information, and other performance-related data (e.g., hardware counters). All performance events are available for tracing. TAU will produce a trace for every thread of execution in its modern trace format as well as in OTF [] and EPILOG [] formats. TAU also provides mechanisms for online and hierarchical trace merging [].
TAU Analysis As the complexity of measuring parallel performance increases, the burden falls on analysis and visualization tools to interpret the performance information. As shown in Fig. , TAU includes sophisticated tools for parallel profile analysis and performance data mining. In addition, TAU leverages advanced trace analysis technologies from the performance tool community, primarily the Vampir [] and Scalasca [] tools. The following focuses on the features of the TAU profiling tools. TAU’s parallel profile analysis environment consists of a framework for managing parallel profile data, PerfDMF [], and TAU’s parallel profile analysis tool, ParaProf []. The complete environment is implemented entirely in Java. The performance data management framework (PerfDMF) in TAU provides a common foundation for parsing, storing, and querying parallel profiles from multiple performance experiments. It builds on robust SQL relational database engines and must be able to handle both large-scale performance profiles, consisting of many events and threads of execution, as well as many profiles from
multiple performance experiments. To facilitate performance analysis development, the PerfDMF architecture includes a well-documented data-management API to abstract query and analysis operation into a more programmatic, non-SQL form. TAU’s parallel profile analysis tool, ParaProf [], is capable of processing the richness of parallel profile information produced by the measurement system, both in terms of the profile types (flat, callpath, phase, snapshots) as well as scale. ParaProf provides the users with a highly graphical tool for viewing parallel profile data with respect to different viewing scopes and presentation methods. Profile data can be input directly from a PerfDMF database and multiple profiles can be analyzed simultaneously. ParaProf can show parallel profile information in the form of bargraphs, callgraphs, scalable histograms, and cumulative plots. ParaProf is also capable of integrating multiple performance profiles for the same performance experiment but using different performance metrics for each. ParaProf uses scalable histogram and three-dimensional displays for larger datasets. To provide more sophisticated performance analysis capabilities, we developed support for parallel performance data mining in TAU. PerfExplorer [, ] is a framework for performance data mining motivated by our interest in automatic parallel performance analysis and by our concern for extensible and reusable performance tool technology. PerfExplorer is built on PerfDMF and targets large-scale performance analysis for single experiments on thousands of processors and for multiple experiments from parametric studies. PerfExplorer uses techniques such as clustering and dimensionality reduction to manage large-scale data complexity.
Summary The TAU Performance System has undergone several incarnations in pursuit of its primary objectives of flexibility, portability, integration, interoperability, and scalability. The outcome is a robust technology suite that has significant coverage of the performance problem solving landscape for high-end computing. TAU follows a direct performance observation methodology since it is based on the observation of effects directly associated with the program’s exection, allowing performance
Tensilica
data to be interpreted in the context of the computation. Hard issues of instrumentation scope and measurement intrusion have to be addressed, but these have been aggressively pursued and the technology enhanced in several ways during TAU’s lifetime. TAU is still evolving, and new capabilities are being added to the tools suite. Support for whole-system performance analysis, model-based optimization using performance expectations and knowledge-based data mining, and heterogeneous performance measurement are being pursued.
Related Entries Metrics Performance Analysis Tools
Bibliography . Bell R, Malony A, Shende S () A portable, extensible, and scalable tool for parallel performance profile analysis. In: European conference on parallel computing (EuroPar ), Klagenfurt . Brunst H, Kranzlmüller D, Nagel WE () Tools for scalable parallel program analysis – Vampir NG and DeWiz. In: Distributed and parallel systems, cluster and grid computing, vol . Springer, New York . Brunst H, Nagel W, Malony A () A distributed performance analysis architecture for clusters. In: IEEE international conference on cluster computing (Cluster ), pp –. IEEE Computer Society, Los Alamitos . Huck KA, Malony AD () Perfexplorer: a performance data mining framework for large-scale parallel computing. In: High performance networking and computing conference (SC’). IEEE Computer Society, Los Alamitos . Huck K, Malony A, Bell R, Morris A () Design and implementation of a parallel performance data management framework. In: International conference on parallel processing (ICPP ). IEEE Computer Society, Los Alamitos . Huck K, Malony A, Shende S, Morris A () Knowledge support and automation for performance analysis with PerfExplorer .. J Sci Program (–):– (Special issue on large-scale programming tools and environments) . Knüpfer A, Brendel R, Brunst H, Mix H, Nagel WE () Introducing the Open Trace Format (OTF). In: International conference on computational science (ICCS ). Lecture notes in computer science, vol . Springer, Berlin, pp – . Mayanglambam S, Malony A, Sottile M () Performance measurement of applications with GPU acceleration using CUDA. In: Parallel computing (ParCo), Lyon . Mohr B, Wolf F () KOJAK – a tool set for automatic performance analysis of parallel applications. In: European conference on parallel computing (EuroPar ). Lecture notes in computer science, vol . Springer, Berlin, pp –
T
. Nataraj A, Sottile M, Morris A, Malony AD, Shende S () TAUoverSupermon: low-overhead online parallel performance monitoring. In: European conference on parallel computing (EuroPar ), Rennes . Nataraj A, Morris A, Malony AD, Sottile M, Beckman P () The ghost in the machine: observing the effects of kernel operation on parallel application performance. In: High performance networking and computing conference (SC’), Reno . Nataraj A, Malony A, Morris A, Arnold D, Miller B () In search of sweet-spots in parallel performance monitoring. In: IEEE international conference on cluster computing (Cluster ), Tsukuba . Nataraj A, Malony A, Morris A, Arnold D, Miller B () TAUoverMRNet (ToM): a framework for scalable parallel performance monitoring. In: International workshop on scalable tools for high-end computing (STHEC ’), Kos . Nataraj A, Malony AD, Shende S, Morris A () Integrated parallel performance views. Clust Comput ():– . Shende S () The role of instrumentation and mapping in performance measurement. Ph.D. thesis, University of Oregon . Shende S, Malony A () The TAU parallel performance system. Int J Supercomput Appl High Speed Comput (, Summer):– (ACTS collection special issue) . Shende S, Malony AD, Cuny J, Lindlan K, Beckman P, Karmesin S () Portable profiling and tracing for parallel scientific applications using C++. In: SIGMETRICS symposium on parallel and distributed tools, SPDT’, Welches, pp – . Wolf F et al () Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications. In: Proceedings of the second HLRS parallel tools workshop, Stuttgart. Lecture notes in computer science. Springer, Berlin
®
TAU Performance System TAU
TBB (Intel Threading Building Blocks)
®
Intel
Threading Building Blocks (TBB)
Tensilica Green Flash: Climate Machine (LBNL)
T
T
Tera MTA
Tera MTA Burton Smith Microsoft Corporation, Redmond, WA, USA
Synonyms Cray MTA; Cray XMT; Horizon
Definition The MTA (for Multi-Threaded Architecture) is a highly multithreaded scalar shared-memory multiprocessor architecture developed by Tera Computer Company (renamed Cray Inc. in ) in Seattle, Washington. Work began in at The Institute for Defense Analyses Center for Computing Sciences on a closely related predecessor (Horizon), and development of both hardware and software was continuing at Cray Inc. as of .
Discussion Introduction The Tera MTA [] is in many respects a direct descendant of the Denelcor HEP computer []. Like the HEP, the MTA is a scalar shared-memory system equipped with full/empty bits at every -bit memory location and multiple protection domains to permit multiprogramming within a processor. However, the MTA introduced a few innovations including VLIW instructions without any register set partitioning, additional ILP via dependence data encoded in each instruction, twophase blocking synchronization, unlimited data breakpoints, speculative loads, division and square root to full accuracy using iterative methods, operating system entry via procedure calls, traps that never change privilege, and no interrupts at all. Software developed for the MTA introduced its share of novel ideas as well, including a user-mode runtime responsible for synchronization and work scheduling, negotiated resource management between the user-mode runtime and the operating system, an operating system that returns control to the user-mode runtime when the call blocks, a compiler for Fortran and C++ that parallelizes and restructures a wide variety of loops including those whose inter-iteration dependences require a parallel prefix computation, and dynamic scheduling of loop nests having mixed rectangular, triangular, and skyline loop bounds.
Beginnings While spending the summer of as an intern at Denelcor, UC Berkeley graduate student Stephen W. Melvin invented a scheme for organizing a register file in a fine-grain multi-threaded processor to let VLIW instructions enjoy multiple register accesses per instruction while preserving a flat register address space within each hardware thread. The idea was simple: organize the register file into multiple banks with each bank containing all of the registers for an associated subset of the hardware threads; let each issued instruction use multiple cycles to read and write its associated bank as many times as necessary to implement the instruction; and have the instruction issue logic refrain from issuing from threads associated with currently busy banks. From this beginning, an architectural proposal emerged that was whimsically referred to as “Vulture” and was later known as the “HEP array processor”. Denelcor envisioned heterogeneous systems with second-generation HEP processors sharing memory with processors based on this new VLIW idea. Denelcor filed for Chapter bankruptcy in mid whereupon its CTO, Burton J. Smith, joined the Supercomputing Research Center (now called the Center for Computing Sciences) of the Institute for Defense Analyses (IDA) in Maryland. His plan was to further evaluate the merits of the multithreaded VLIW ideas. It had already become clear that code generation and optimization for such a processor was only slightly more difficult than for the HEP but the resulting performance was potentially much higher. The design that resulted from collaborations on this topic at IDA over the years – was known as Horizon and was described in a series of papers presented at Supercomputing’ [, ]. Horizon instructions were bits wide and typically contained three operations: a memory reference (M) operation, an arithmetic or logic (A) operation, and a control (C) operation which did branches but could also do some arithmetic and logic. As many as ten five-bit register references might appear in any single instruction. The memory reference semantics were derived from those of the HEP but added data trap bits to implement data break points (watchpoints) and possibly other things. The A operations included fused multiplyadds for both integer and floating point arithmetic and a rich variety of operations on vectors and matrices of bits. Branches were encoded compactly as an opcode, an eight-bit condition mask, a two-bit condition code
Tera MTA
number specifying one of the four most recently generated three-bit condition codes, and two bits naming a branch target register that had been preloaded with the branch address. Most A- and C-operations emitted a condition code as an optional side-effect. Horizon increased ILP beyond three with an idea known as explicit-dependence lookahead, replacing the usual register reservation scheme. Intel would later employ a kindred concept for encoding dependence information within instructions in the Itanium architecture, calling it explicitly parallel instruction computing (EPIC). The Horizon version included in every instruction an explicit -bit unsigned integer, the lookahead that bounded from below the number of instructions separating this instruction from later ones that might depend on it. Subsequent instructions within the bound could overlap with the current instruction. To implement this scheme, every hardware thread was equipped with a three-bit flag counter, incremented when the thread is selected to issue its next instruction, and an array of eight three-bit lock counters. On instruction issue, the lock counter subscripted by the lookahead plus the flag (mod ) is incremented; when the instruction fully retires, that same lock is decremented. A thread is permitted to issue its next instruction when the lock subscripted by its flag is zero. Instruction instances were thus treated very much like operations in a static data flow machine. As a further refinement, branches were available to terminate lookahead along the unlikely control flow direction, potentially increasing ILP along the likelier one.
Tera Early in , Smith and a colleague, James E. Rottsolk, decided to start a company to build general-purpose parallel computer systems based on the Horizon concepts. They named the new company Tera, acquired initial funding from private investors and from the Defense Advanced Research Projects Agency (DARPA), and began to search for a suitable home. In August , Seattle, Washington was chosen and Tera began recruiting engineers. The University of Washington helped the company in its early days.
Languages The Tera language strategy [] was to add directives and pragmas to Fortran and C++ to guide compiler
T
loop parallelization and provide performance and compatibility with existing vector processors. A consistency model strongly resembling release consistency was adopted based on acquire and release synchronization points to let the compiler cache values in registers and restructure code aggressively. Basically, memory references could not move backward over acquires or forward over releases. Only the built-in synchronization based on full/empty bits was permitted at first; later, volatile variable references were made legal (but deprecated) for synchronization. A future statement borrowed from Multilisp [] was introduced in both Fortran and C++ to support task parallelism as well as divide-and-conquer data parallelism. It uses the full/empty bits to synchronize completion of the future with its invoker. The body of the future appears in-line and can reference variables in its enclosing environment.
Compiler Optimization The MTA compilers can automatically parallelize a variety of loops []. Consider the example below: void sort(int ∗ src, int ∗ dst, int nitems, int nvals) { int i, j, t[nvals], t[nvals]; for (j = ; j< nvals; j++) { t[j] = ;} for (i = ; i< nitems; i++) { t[src[i]]++;} //atomic update t[] = ; for (j = ; j
T
T
Tera MTA
To achieve as high a level of ILP as the instruction set can afford, software pipelining [] is exploited by distributing loops based on estimates of register pressure and then unrolling and packing to obtain a good schedule. Experiments at both IDA and Tera led to a modification of the lookahead scheme to overlap only memory references. Software pipelining was further enabled by implementing speculative loads. A poison bit is associated with every register in the thread, and memory protection violations can optionally be made to poison the destination register instead of raising an exception. Any instruction that attempts to use a poisoned value will trap instead. The speculative load feature allows prefetching in a software pipeline without having to “unpeel” final iterations to prevent accesses beyond mapped memory. In any case the instruction can be reattempted if the cause of the protection violation is remediable. Nests of parallelizable “for” loops may vary in iterations per loop, sometimes dynamically, making them hard to execute efficiently. The MTA compiler schedules these nests as a whole, even when inner loops have bounds that depend on outer iteration variables. First, code is generated to compute the total number of (inner loop) iterations of the whole nest. Functions are then generated to compute the iteration number of each loop from the total iteration count. Finally, code is generated to dynamically schedule the loop nest by having each worker thread acquire a “chunk” of total iterations using a variant of guided self-scheduling [], reconstruct the iteration variable bindings, and then jump into the loop nest to iterate until the chunk is consumed.
User-Level Runtime A user-level runtime environment was developed to help implement the language features and schedule finegrain parallel tasks. The Horizon architecture was modified to make full/empty synchronization operations lock the location and generate a user-level trap after a programmable limit on the number of synchronization retry attempts is exceeded. The runtime trap handler saves the state of the blocked task, initializes a queue of blocked tasks containing this one as its first element, and places a pointer to this new queue in the still-locked full/empty location. It then sets a trap bit and unlocks the location so that subsequent references of any kind will trap immediately. In this way the tasks that arrive later
either block, joining the queue of waiting tasks right away, or dequeue a waiting task immediately. The retry limit is set to match the time needed to save and later restore the task state, making this scheme within a factor of two of optimal. When memory references are frequent, the retry rate is throttled to be much less than that for new memory references, and if the processor is not starved for hardware threads the polling cost becomes almost negligible and the retry limit can be increased substantially. When formerly blocked tasks are unblocked, they are enqueued in a pool of runnable tasks. Since these tasks are equipped with a stack and may have additional memory associated with them, they are run in preference to tasks associated with future statements that have not yet run. Still, the number of blocked tasks can be substantial. To reduce memory waste, stacks are organized as linked lists of fixed-size blocks. Automatic arrays and other things that must be contiguously allocated are stored in the heap. Interprocedural analysis is used to avoid most stack bounds checks. Another modification to the Horizon architecture comprised instructions to allocate multiple hardware threads. Each protection domain has an operating system-imposed limit on the number of hardware threads in the domain and a reservation which can be increased (or decreased) by one of the new instructions []. One of them reserves a variable number, from zero up to a maximum specified as an argument, depending on availability. The other instruction either reserves the requested amount or none at all. In either case, the number of additional hardware threads actually allocated is returned in a register so a loop can be used to initiate them. The primary motivation for this reservation capability was to accommodate rapidly varying quantities of parallelism found in short parallel loops. Hardware threads can be materialized and put to work quickly when such opportunities are encountered.
Operating System Since the MTA’s operating system (OS) plays no role in user-level thread synchronization and allocates but does not micromanage the dynamic quantity of hardware threads, the usual OS invocation machinery (trapping) was rejected in favor of procedure calls. A protection ring-crossing instruction guarantees only valid OS
Tera MTA
entry points are called. The operating system can allocate its own stack space when and if necessary. If an OS call ultimately blocks, the hardware thread is returned to user level via a runtime entry point that associates the continuation of the original user computation with a “cookie” supplied by the OS. When the original OS call completes, the appropriate cookie is passed to the user runtime so it can unblock the user-level continuation. As a result, the operating system executes in parallel with the user-level program. This scheme was invented independently [] at the University of Washington and is referred to as scheduler activations. When an illegal operation is attempted, a trap to the user-level trap handler occurs. If OS intervention is required, the trap handler calls the OS to service the trap. To evict a process from a protection domain, all of its hardware threads are made to trap in response to an OS-generated signal. The OS also has the ability to kill all hardware threads in a protection domain.
Memory Mapping The MTA has a large virtual data address space, making high-performance memory address translation challenging. To address this issue, segmentation is used instead of paging, with a -bit virtual address comprising a -bit segment number and a -bit offset. Contiguous allocations of memory larger than MB can use multiple segments. Segment size granularity is KB. Each protection domain has base and limit registers that define its own range of segment numbers. A segment map entry specifies minimum privilege levels for loads and stores, and whether physical addresses are to be distributed across the multiple memory banks of the system by “scrambling” the address bits []. When memory is distributed in this fashion there are virtually no bank conflicts due to strided accesses. Program memory, as addressed by the program counters, is handled differently using a paging scheme with KB pages. The program space is mapped to data space via a non-distributed (i.e., local) segment.
T
and square root all use Newton’s method but nevertheless implement full accuracy and correct semantics. Excepting these few operations, denormalized arithmetic is fully supported in hardware. High-precision integer arithmetic is abetted by a -bit unsigned integer multiply instruction and the ability to propagate carry bits easily. There is also support for a -bit floating point format using pairs of -bit floats; the smaller value is insignificant with respect to the larger, thereby yielding bits of significand precision or more. The existence of a fused multiply-add makes this format relatively inexpensive to implement and use, but instruction modifications were needed to mitigate a variety of pathologies. A true -bit IEEE format would doubtless be preferable.
MTA- and MTA- The MTA- was a water-cooled system built from Gallium Arsenide logic. To provide adequate memory bandwidth, the processors sparsely populated a D toroidal mesh interconnection network. As an example, routing nodes were required for processors. To make network wiring implementable, one-third of the mesh links were elided. The first and only MTA- system was delivered to the San Diego Supercomputer Center in June . The MTA- was a major improvement in manufacturability over the MTA-. It used CMOS logic and had an interconnection network based on notions from group theory []. It was first delivered in to the US Naval Research Laboratory in Washington, DC. A few other MTA- systems were built before deliveries of the Cray XMT (q.v.) began in .
Related Entries Cray MTA Cray XMT Data Flow Computer Architecture Denelcor HEP EPIC Processors Futures Interconnection Networks
Arithmetic
Latency Hiding
The instruction set supports the usual variety of two’s complement and unsigned integers and both - and -bit IEEE floating point. Division for signed and unsigned integers [] along with floating point division
Little’s Law Memory Wall MIMD (Multiple Instruction, Multiple Data)
Machines
T
T
Terrestrial Ecosystem Carbon Modeling
Modulo Scheduling and Loop Pipelining Multilisp Multi-Threaded Processors Networks, Direct Networks, Multistage Processes, Tasks, and Threads Processors-in-Memory Shared-Memory Multiprocessors SPMD Computational Model Synchronization Ultracomputer, NYU VLIW Processors
. Anderson T, Bershad B, Lazowska E, Levy H () Scheduler activations: effective kernel support for the user-level management of parallelism. ACM T Comput Syst ():– . Norton A, Melton E () A class of Boolean linear transformations for conflict-free power-of-two stride access. In: Proceedings of the international conference on parallel processing, St. Charles, IL . Alverson R () Integer division using reciprocals. In: Proceedings of the th IEEE symposium on computer arithmetic, Grenoble . Akers S, Krishnamurthy B () A group-theoretic model for symmetric interconnection networks. IEEE T Comput C():– . Alverson G, Kahan S, Korry R, McCann C, Smith B () Scheduling on the Tera MTA. In: Proceedings of the first workshop on job scheduling strategies for parallel processing, Santa Barbara. Lecture Notes in Computer Science :–
Bibliography . Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B () The Tera computer system. In: Proceedings of the international conference on supercomputing, Amsterdam . Smith BJ () Architecture and applications of the HEP multiprocessor computer system. Proc SPIE Real-Time Signal Process IV :– . Kuehn JT, Smith BJ () The Horizon supercomputing system: architecture and software. In: Proceedings of the ACM/IEEE conference on supercomputing, Orlando . Thistle MR, Smith BJ () A processor architecture for Horizon. In: Proceedings of the ACM/IEEE conference on supercomputing, Orlando . Callahan D, Smith B () A future-based parallel language for a general-purpose highly-parallel computer. In: Selected papers of the second workshop on languages and compilers for parallel computing, Irvine . Halstead RH () MultiLisp: a language for concurrent symbolic computation. ACM T Program Lang Syst ():– . Alverson G, Briggs P, Coatney S, Kahan S, Korry R () Tera hardware-software cooperation. In: Proceedings of supercomputing, San Jose . Ladner RE, Fischer MJ () Parallel prefix computation. J ACM ():– . Callahan D () Recognizing and parallelizing bounded recurrences. In: Proceedings of the fourth workshop on languages and compilers for parallel computing, Santa Clara . Lam M () Software pipelining: an effective scheduling technique for VLIW machines. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation, Atlanta . Polychronopoulos C, Kuck D () Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE T Comput C-():– . Alverson G, Alverson R, Callahan D, Koblenz B, Porterfield A, Smith B () Exploiting heterogeneous parallelism on a multithreaded multiprocessor. In: Proceedings of the international conference on supercomputing, Washington, DC
Terrestrial Ecosystem Carbon Modeling Dali Wang , Daniel Ricciuto , Wilfred Post , Michael W. Berry Oak Ridge National Laboratory, Oak Ridge, TN, USA The University of Tennessee, Knoxville, TN, USA
Synonyms Carbon cycle research; System integration; Terrestrial ecosystem modeling; Uncertainty quantification
Definition A Terrestrial Ecosystem Carbon Model (TECM) is a category of process-based ecosystem models that describe carbon dynamics of plants and soils within global terrestrial ecosystems. A TECM generally uses spatially explicit information on climate/weather, elevation, soils, vegetation, and water availability as well as soil- and vegetation-specific parameters to make estimates of important carbon fluxes and carbon pool sizes in terrestrial ecosystems.
Discussion Introduction Terrestrial ecosystems are a primary component of research on global environmental change. Observational and modeling research on terrestrial ecosystems at the global scale, however, has lagged behind
Terrestrial Ecosystem Carbon Modeling
their counterparts for oceanic and atmospheric systems, largely because of the unique challenges associated with the tremendous diversity and complexity of terrestrial ecosystems. There are eight major types of terrestrial ecosystem: tropical rain forest, savannas, deserts, temperate grassland, deciduous forest, coniferous forest, tundra, and chaparral. The carbon cycle is an important mechanism in the coupling of terrestrial ecosystems with climate through biological fluxes of CO . The influence of terrestrial ecosystems on atmospheric CO can be modeled via several means at different timescales to incorporate several important processes, such as plant dynamics, change in land use, as well as ecosystem biogeography. Over the past several decades, many terrestrial ecosystem models (see the “Model Developments” section) have been developed to understand the interactions between terrestrial carbon storage and CO concentration in the atmosphere, as well as the consequences of these interactions. Early TECMs generally adapted simple box-flow exchange models, in which photosynthetic CO uptake and respiratory CO release are simulated in an empirical manner with a small number of vegetation and soil carbon pools. Demands on kinds and amount of information required from global TECMs have grown. Recently, along with the rapid development of parallel computing, spatially explicit TECMs with detailed process-based representations of carbon dynamics become attractive, because those models can readily incorporate a variety of additional ecosystem processes (such as dispersal, establishment, growth, mortality, etc.) and environmental factors (such as landscape position, pest populations, disturbances, resource manipulations, etc.), and provide information to frame policy options for climate change impact analysis.
Key Components of TECM . Fundamental dynamics
terrestrial
ecosystem
carbon
Terrestrial carbon processes can be described by an exchange between four major compartments: () foliage where photosynthesis occurs; () structural material, including roots and wood; () surface detritus or litter; and () soil organic matter (including peat). Nearly all life on Earth depends (directly or indirectly) on photosynthesis, in which carbon dioxide and water are used,
T
and oxygen is released. The majority of the carbon in the living vegetation of terrestrial ecosystems is found in woody material, which constitutes a major carbon reservoir in the carbon cycle. Detritus refers to leaf litter and other organic matter on or below the soil surface. Dead woody material, often called coarse woody debris, is a large component of the surface detritus in forest ecosystems. Detritus is typically colonized by communities of microorganisms which act to decompose (or remineralize) the material. Transformation and translocation of detritus is the source of soil organic matter, another major component in the global carbon cycle. Globally three times as much carbon is stored in soils as in the atmosphere with peatlands contributing a third of this. Thus even relatively small changes in soil C stocks might contribute significantly to atmospheric CO concentrations and thus global climate change. The soil carbon pool is vulnerable to impacts of human activity especially agriculture. A simplified scheme of carbon cycle dynamics is shown in Fig. . . Terrestrial carbon observations and experiments Early research generally focused on determining characteristics of individual plants and small soil samples, often in a laboratory setting. This type of research continues today and provides a wealth of information that is used to develop and to parameterize TECMs. However, successful modeling of the carbon cycle also requires understanding the structure and response of entire ecosystems. Observation networks involving ecosystems have expanded greatly in the past two decades. One important development for in situ monitoring of ecosystem-level carbon exchange has been the establishment of flux towers that use the eddy covariance method. Atmospheric CO concentration measurements using satellites, tall towers, and aircraft provide information about carbon dioxide fluxes over a larger scale. Finally, remote sensing products provide important information about changes in land use and vegetation characteristics (e.g., total leaf area) that can be used to either drive or validate TECMs. While these observations are important for characterizing the carbon cycle at present, they do not provide information about how the carbon cycle may change in the future as a result of climate change. In order to address those challenging questions, several large-scale ecosystem-level experiments have been conducted to mimic possible
T
T
Terrestrial Ecosystem Carbon Modeling
Maintenance respiration Growth respiration
Forest and grasslandphotosynthesis Heterotrophic respiration Detritus
Structural material (Roots and wood)
Litter Decomposition
Soil organic matter
Terrestrial Ecosystem Carbon Modeling. Fig. Simplified schematic of the carbon dynamics (photosynthesis, autotrophic respiration, allocation, and heterotrophic respiration) within a typical TECM model
future conditions (such as rising CO concentrations, potential future precipitation patterns and the future temperature scenarios) and associated impacts and mitigation options for terrestrial ecosystems. Figure illustrates some of these carbon observation systems and experiments. . Terrestrial ecosystem carbon model developments In the early s, several process-based conceptual models were developed to study the primary productivity of the biosphere and the uptake of anthropogenic CO emissions. Early box models were improved in the s to include more spatially explicit ecological representations of terrestrial ecosystems along with a significant push to understand the relationships between climatic measurements and properties of ecosystem processes. The concept of biomes was used to categorize terrestrial ecosystems using several climatically and geographically related factors (i.e., plant structure, leaf types, and climate), instead of the traditional classification by taxonomic similarity. In the s, rapid developments of general circulation models and scientific computing, along with the increasing availability of remote sensing data (from satellites), led to the development of land-surface models. These
models used satellite images to obtain information about the spatial distribution of surface properties (such as vegetation type, phenology, and density) along with spatially explicit forcing from numerical weather prediction reanalysis or coupled general circulation models (e.g., temperature and precipitation) to improve prediction and enhance the model representation of land-atmosphere water and energy interactions within global climate models. Recently, emphasis has been focused on improving the predictive capacity of climate models at the decadal to century scale through a better characterization of carbon cycle feedbacks with climate. For example, several TECMs are incorporating nutrient cycles and shifts in vegetation distribution, in response and a potential feedback to climate change.
The Contributions of Parallel Computing to TECM Developments The contributions of parallel computing to the TECM can be classified into three separate categories: () model construction, () model integration, and () model behavior controls. As more processes are incorporated into TECMs to replace simple empirical relationships, computational demands have increased. Since these processes are very
Terrestrial Ecosystem Carbon Modeling
T
Remote satellite sensing
Top-down CO2 exchange measurements (Carbontracker)
Bottom-up CO2 exchange measurements (Flux towers)
Atmospheric CO2 measurements network
Process studies Present scenarios Plant and soil measurements
Future scenarios CO2/Temperature /Precipitation/Nutrient manipulation
Terrestrial Ecosystem Carbon Modeling. Fig. Illustrated carbon observations and experiments
sensitive to environmental heterogeneity inherent in spatial patterns of temperature, radiation, precipitation, soil characteristics, etc., increased spatial resolution improved simulation accuracy. A new class of models is becoming more widespread for global scale TECMs. This class, instead of using a traditional system of differential equations, is agent-based and requires a fine grid spatial representation to represent the competition for resources among the coexisting agents. This approach dramatically increases the computational demand, but is more compatible with experimental and observational data and population scale vegetation change processes. Parallel computing has enabled models to be constructed with these additional complexities. Over several decades of research, TECMs have dramatically changed in structure and in the amount and kind of information required and produced making model integration a challenge. In addition, these models are now being incorporated into climate models to form Earth System Models (ESMs). From a parallel computing perspective, there is huge demand from the modeling community to develop a parallel model coupling framework (Earth System Modeling Frame (ESMF) is one of these kinds of efforts) to enable further parallel model developments and validations.
Instead of rewriting a package wrapper for each component, memory-based IO staging systems may provide an alternative method for fast and seamless coupling. There are two basic methods to provide climate forcing information for a TECM, from observation data or coupling to a global climate model simulation. Currently, model simulation can provide global data, but only at low spatial resolutions. Observation datasets are available at those fine resolutions, and can be used for validation over observed time frames at those observation stations. However, further research and parallel computing will be needed for gap-filling and downscaling those observation datasets for global terrestrial ecosystem carbon modeling. One consequence of TECM complexity is the increasing demand for better methodologies that can exploit ever-increasing rich datastreams and thereby improve model behavior (e.g., the ecosystem model’s sensitivity to the model parameters, and software structure). Quantitative methods need to be established to determine model uncertainty and reduce uncertainty through model-data analysis. As computers become larger and larger in the number of CPU cores, not necessarily faster and faster at the single CPU core level, we envision that ultrascale software designs for systematic
T
T
Terrestrial Ecosystem Carbon Modeling
uncertainty quantification for TECM will became one research area which will require the full advantage of parallel computing and statistics. As our understanding of the global carbon cycle improves, high fidelity, process-based models will continue to be developed, and the increasing complexity of these ecosystem model systems will require that parallel computing play an increasingly important role. We have explained several key components of terrestrial ecosystem carbon modeling, and have classified three categories that parallel computing can play significant contributions to the TECM developments. It is our view that parallel computing will increasingly be an integral part of modern terrestrial ecosystem modeling efforts, which require solid, strong partnerships between the high-performance computing community and the carbon cycle science community. Through such partnerships these two communities can share a common mission to advance our understanding of global change using computational sciences.
Related Entries Analytics, Massive-Scale Computational Sciences Exascale Computing
Bibliographic Notes and Further Reading As mentioned in the model development section, terrestrial carbon modeling started in the early s [, ], when beta-factor concept was developed to account for CO fertilization using a nonspatial representation of terrestrial carbon dynamics. In , Lieth described a model (MIAMI, the first gridded model) [] to estimate the primary productivity of the biosphere. A carbon accounting model was developed at Marine Biological Laboratory (MBL) at Woods Hole to track carbon fluxes associated with land-use change. Along with the success of International Biological Program, spatially distributed compartment models representing different ecosystem types responding to local environmental conditions were developed. Widely used examples include the Terrestrial Ecosystem Model (www.mbl.edu/eco/) at MBL, and CENTURY (www. nrel.colostate.edu/projects/century/) at Colorado State University. As satellite measurements of basic terrestrial
properties became available, several models were developed that utilized this information directly, including the Ames-Stanford Approach (CASA) (geo.arc.nasa. gov/sge/casa/bearth.html) and Biome-BGC (www.ntsg. umt.edu/models/bgc/). In the s, land surface components of climate models incorporated an aspect of terrestrial carbon cycling, namely photosynthesis, for the purpose of providing a mechanistic model of latent heat exchange with the atmosphere. The Simple Biosphere (SiB) biophysical model [] and BiosphereAtmosphere Transfer Scheme (BATS) [] at National Center for Atmospheric Research (NCAR), and STOMATE [] at Laboratoire des Sciences du Climat et de l’Environnement (LSCE) are examples. Later at NCAR, additional components of terrestrial carbon cycle were included in the Land Surface Model (LSM) (www.cgd.ucar.edu/tss/lsm/). The Community Land Model (CLM-CN) (www.cgd.ucar.edu/tss/clm/) is the successor of LSM and is being further developed as a community-based model. Agent-based models at the global scale, a class of what are called Dynamic Global Vegetation Models (DGVM), have been developed independently because of their data and computation demands. Developments in parallel computer systems are making incorporation of such dynamics plausible for earth system models. HYBRID [] was an early experimental model, and now prototypes exist for the NCAR CCSM land surface CLM-CN which is based on the Lund-Postdam-Jena (LPJ) model [] and called CLM-CN-DV. Evolved from the Ecosystem Demography (ED) model [, ], ENT [] is another Dynamic Global Terrestrial Ecosystem Model (DGTEM), being coupled with NASA’s GEOS- General Circulation Models. Observation networks involving ecosystems have expanded greatly in the past two decades. AmeriFlux (public.ornl.gov/ameriflux/) is an effort to use flux towers to monitor ecosystem-level carbon exchange with atmosphere. Since , more than such flux towers have been established on six continents representing every major biome. First established in Mauna Loa in , the CO measurements have become a global operation. Inversion techniques are used to infer the pattern of CO fluxes required to produce the observed CO concentrations; one such product using this technique is CarbonTracker (www.esrl.noaa.gov/ gmd/ccgg/carbontracker/), which provides weekly flux
Thin Ethernet
estimates that can be compared against output from process-based TECMs. Currently, a variety of remote sensing products (such as Moderate Resolution Imaging Spectroradiometer (MODIS)) are available to either drive or validate TECMs. Several experiments have been conducted or initiated to understand potential climate change impacts. The Free-Air CO Enrichment (FACE) (public.ornl. gov/face/global_face.shtml) experiment has been running for over a decade at several sites in different biomes to study the potential effects of higher CO concentration. The Throughfall Displacement Experiment (TDE) (tde.ornl.gov/) used elaborate systems to alter the amount of precipitation that is available to an ecosystem. A new experiment has been initiated at Oak Ridge National Laboratory to assess the responses of northern peatland ecosystems to increased temperature and exposures to elevated atmospheric CO concentrations (mnspruce.ornl.gov). More information on terrestrial ecosystem carbon modeling can be found in books devoted to this subject [, ].
T
. Friend AD, Stevens AK, Knox RG, Cannell MGR () A process-based, terrestrial biosphere model of ecosystem dynamics (Hybrid v.). Ecol Model :– . Prentice IC, Heimann M, Sitch S () The carbon balance of the terrestrial biosphere: ecosystem models and atmospheric observations. Ecol Appl :– . Moorcroft P, Hurtt GC, Pacala SW () A method for scaling vegetation dynamics: the ecosystem demography model (ED). Ecol Monogr ():– . Govindarajan S, Dietze MC, Agarwal PK, Clark JS () A scalable simulator for forest dynamics. In: Proceedings of the twentieth annual symposium on computational geometry SCG , Brooklyn, NY, pp –, doi:./. . Yang W, Ni-Meister W, Kiang NY, Moorcroft P, Strahler AH, Oliphant A () A clumped-foliage canopy radiative transfer model for a Global Dynamic Terrestrial Ecosystem Model II: Comparison to measurements. Agricultural and Forest Meteorology, ():–, doi:./j.agrformet... . Trabalka JR, Reichle DE (ed) () The changing carbon cycle: a global analysis. Springer-Verlag, Berlin . Field CB, Raupach MR (ed) () The global carbon cycle: integrating human, climate, and the natural world. Island, Washington, DC
Terrestrial Ecosystem Modeling Bibliography . Bacastow RB, Keeling CD () Atmospheric carbon dioxide and radiocarbon in the natural carbon cycle: II. Changes from AD to as deduced from a geochemical model. In: Woodwell GM, Pecan EV (eds) Carbon and the biosphere. CONF-. National Technical Information Service, Springfield, Virginia, pp – . Emanuel WR, Killough GG, Post WM, Shugart HH () Modeling terrestrial ecosystems in the global carbon cycle with shifts in carbon storage capacity by land-use change. Ecology (): – . Lieth H () Modeling the primary productivity of the world. In: Lieth H, Wittaker RH (eds) Primary productivity of the biosphere, ecological studies, vol . Springer-Verlag, New York, pp – . Sellers JP, Randell DA, Collatz GJ, Berry JA, Field CB, Dazlich DA, Zhang C, Collelo GD, Bounua L () A revised land surface parametrization (SiB ) for atmospheric GCMs. Part I: model formulation. J Climate :– . Dickinson R, Henderson-sellers A, Kennedy P () Biosphereatmosphere transfer scheme (BATS) version as coupled to the NCAR community climate model. Technical report, National Center for Atmospheric Research . Ducoudré N, Laval K, Perrier A () SECHIBA, a new set of parametrizations of the hydrologic exchanges at the land/ atmosphere interface within the LMD atmospheric general circulation model. J Climate ():–
Terrestrial Ecosystem Carbon Modeling
The High Performance Substrate HPS Microarchitecture
Theory of Mazurkiewicz-Traces Trace Theory
Thick Ethernet Ethernet
Thin Ethernet Ethernet
T
T
Thread Level Speculation (TLS) Parallelization
Thread Level Speculation (TLS) Parallelization Speculation, Thread-Level
are replaced by n outer loops used to enumerate the tiles and n inner loops used to execute all the iterations within a tile.
Discussion Introduction
Thread-Level Data Speculation (TLDS) Speculative Parallelization of Loops Speculation, Thread-Level
Thread-Level Speculation Speculative Parallelization of Loops Speculation, Thread-Level
Threads Processes, Tasks, and Threads
Tiling François Irigoin MINES ParisTech/CRI, Fontainebleau, France
Synonyms Blocking; Hyperplane partitioning; Loop blocking; Loop tiling; Supernode partitioning
Definition Tiling is a program transformation used to improve the spatial and/or temporal memory locality of a loop nest by changing its iteration order, and/or to reduce its synchronization or communication overhead by controlling the granularity of its parallel execution. Tiling adds some control overhead because the number of loops is doubled, and reduces the amount of parallelism available in the outermost loops. The n initial loops
Tiling is useful for most recent parallel computer architectures, with shared or distributed memory, since they all rely on locality to exploit their memory hierarchies and on parallelism to exploit several cores. It is also useful for heterogeneous architectures with hardware accelerators, and for monoprocessors with caches. Unlike many loop transformations, tiling is not a unimodular transformation. Iterations that are geometrically close in the loop nest iteration set are grouped in so-called tiles to be executed together atomically. Tiles are also called blocks when their edges are parallel to the axes or more generally when their facets are orthogonal to the base vectors. For instance, the parallel stencil written in C: for(i1=1;i1
can be transformed into: #pragma omp parallel for for(t1=1;t1
where t1 and t2 are tile coordinates and b1 and b2 are the tile or block sizes. Initially, this tiling transformation was called loop blocking by Allen & Kennedy and tiling by Wolfe [, ] before it was extended to slanted tiles under the name of supernode partitioning by Irigoin and Triolet []. Wolfe suggested to use systematically the
T
Tiling
i2
i2
5
5
4
4
3
3
2
2
1
1
0
i1 0
1
2
3
4
5
6
7
8
9
Tiling. Fig. Iteration space with dependence vectors
i1
0 0
1
2
3
4
5
6
7
8
9
Tiling. Fig. Tiled iteration set t2
name tiling as it is short and easy to understand. He uses it in his textbook about program transformations []. But loop blocking is still used because it is a proper subset of tiling: see, for instance, Allen and Kennedy []. Slanted tiles can be used with these sequential Fortran loops taken from Xue [] do i1 = 1, 9 do i2 = 1, 5 a(i1,i2) = a(i1,i2-2)+a(i1-3,i2+1) enddo enddo
whose -D iteration set and dependence vectors are shown in Fig. (All figures are taken or derived from Figure ., page , of Xue [] by courtesy of the publisher. Some were adapted or derived to fit the notations used in this entry.). These two loops can be transformed into do t1 = 0, 2 do t2 = max(it-1,0), (it+4)/2 do i1 = 3*t1+1, 3*t1+3 do i2 = max(-t1+2*t2+1,1), min((-t1+6*t2+9)/3,5) a(i1,i2) = a(i1,i2-2) +a(i1-3,i2+1) enddo enddo enddo enddo
using slanted tiles with vectors (, −) and (, ) shown in Fig. . The sets of integer points in each tile are not slanted in this case, but they are not horizontally aligned as shown by the grey areas. The tile iteration set is shown in Fig. . Note that Fortran allows negative array indices,
3 2 1 t1
0 0
1
2
Tiling. Fig. Iteration set for tiles, with tile dependence vectors
which makes references to a(1,-1) and a(-2,2) possibly legal. Mathematically speaking, this grouping/blocking is a partition of the loop iteration space that induces a renumbering and a reordering of the iterations. This reordering should not modify the program semantics. Hence, several issues are linked to tiling as to any other program transformations: Why should tiling be considered? What are the legal tilings for a given loop nest? How is an optimal tiling chosen? How is the transformed code generated?
Motivations for Tiling Tiling has several positive impacts. Depending on the target architecture, it reduces the synchronization overhead, the communication overhead, the cache coherency traffic, the number of external memory accesses, and the amount of memory required to execute a loop nest in an accelerator or a scratch pad memory, or the number of cache misses. Thus, the execution time and/or the energy used to execute the loop
T
T
Tiling
nest are reduced, or the execution with a small memory is made possible. Tiling also has two possibly negative impacts. The control overhead is increased, if only because the number of loops is doubled, and the amount of parallelism degree is smaller at the tile level because the initial parallelism is partly transferred within each tile and traded for locality and communication and synchronization overheads. The control overhead depends on the code generation phase, especially when partial tiles are needed to cover the boundaries of the iteration set. Tile selection depends on the target architecture. For shared memory multiprocessors, including multicores, the primary bottleneck is the memory bandwidth and tiling is used to improve the cache hit ratio by reducing the memory footprint, that is, the set of live variables that should be kept in cache, and to reduce the cache coherency traffic. Some array references in the initial code must exhibit some spatial and/or temporal locality for tiling to be beneficial. Tiling can be applied again, recursively or hierarchically, to increase locality at the different cache levels (L, L, L, L,…) and even at the register level by using very small tiles compatible with the number of registers. These register tiles are fully unrolled to exploit the registers. The tiling of tiles is also known as multilevel tiling. Tiling can also be used to increase locality at the virtual memory page level as shown in by McKeller and Coffman in []. For vector multiprocessors, the size of the tiles must be large enough to use the vector units efficiently, but small enough for their memory footprint to fit in the local cache, which is one of many trade-offs encountered in tile selection. For distributed memory multiprocessors, tiling is used to generate automatically distributed code. The tiles are mapped on the processors and the processors communicate data on or close to the tile boundaries. Let p be the edge length of a n-dimensional tile. The idea is to compute O(pn ) values and exchange only O(pn− ) values so as to overlap the computation with communication although computations are faster than communications. The amount of memory on each processor is supposed large enough not to be a constraint, but the value of p is adjusted to trade parallelism against communication and synchronization overheads. As mentioned above, these large tiles can be tiled again if
locality or parallelism is an issue at the elementary processor level. Heterogeneous processors using hardware accelerators, FPGA- or GPGPU- based, require the same kind of trade-offs. Either large tiles must be executed on the accelerator to benefit from its parallel architecture, or small tiles only are possible because of the limited local memory, but in both cases communications between the host and the accelerator must take place asynchronously during the computation. Tiling is used to meet the local memory or vector register size constraints, and to generate opportunities for asynchronous transfers overlapped with the computation. Multiprocessors System-on-Chip (MPSoC) designed for embedded processing may combine some local internal memories, a.k.a. scratchpad memories, and a global external memory, which make them distributed systems with a global memory. Tiling is used to meet the local memory constraints, but communications between the processors or between the processors and the external global memory must be generated. Other transformations, such as loop fusion and array contraction (see Parallelization, Automatic), are used in combination with tiling to reduce the communication and the execution time. A combination of loop fusion and loop distribution may give a better result than the tiling of fused loops, at least when the scratchpad memory is very small. Tiling can also be applied to multidimensional arrays instead of loop nests. This approach is used by High-Performance Fortran (see High Performance Fortran (HPF)). The code generation is derived from the initial code and from mapping constraints, based, for instance, on the owner-computes rule: the computation must be located where the result is stored. This idea may also be applied to speed up array IOs and out-of-core computations. Finally, tiling can be applied to more general spaces and sets. For instance, Griebl [] use the Polyhedron Model to map larger pieces of code to a unique space of large dimension. Sequences of loops can be mapped onto such a space and be tiled globally. Because of the many machine architectures that can benefit from tiling, numerous papers have been published on the subject. It is important to check what kind of architecture is targeted before reading or comparing them.
T
Tiling
i2
i2
5
5
4
4
3
3
2
2
1
1 i1
0 0
1
2
3
4
5
6
7
8
Tiling. Fig. First hyperplane partitioning with h = ( , ) and o = (, )
i1
0
9
0
1
2
3
4
5
6
7
8
9
Tiling. Fig. Same hyperplane partitioning with a smaller h = ( , ) i2
Legality of Tiling The general definition and legality of tiling were introduced by Irigoin and Triolet who gave sufficient legality conditions in []. Necessary conditions were added later by Xue []. The basic idea is to use parallel hyperplanes defined by a normal vector h to slice the iteration space Z n , where n is the number of nested loops, to obtain a partition. The partition of a set E is a set P of nonempty subsets of E, whose two-by-two intersection is always empty and whose union is equal to e. Each slice is mathematically speaking a part (there is no agreement about the naming of elements of P: part, block or cell are used. We chose to use part) and two iteration vectors j and j belong to the same part if ⌊h.(j − o)⌋ = ⌊h.(j − o)⌋, where . denotes the scalar product, ⌊ ⌋ the floor function, and o is an offset vector. For instance, using the normal vector h = ( , ) and the offset o = (, ), the iteration set of Fig. is partitioned in three parts shown in Fig. . Because of this definition, the slices, or parts become larger when the norm of h decreases (see Fig. ). To make sure that the parts can be executed one after the other, dependence cycles between two parts must be avoided. For instance, the diagonal partition in Fig. creates a cycle between the two subsets. Iteration (, ) must be executed before iteration (, ), so the left subset must be executed before the right subset, but iteration (, ) must be executed before iteration (, ), which is incompatible. Cycles between subsets are avoided if each dependence vector d in the loop nest meets the condition h.d ≥ . This condition does depend neither on ∣∣h∣∣, the norm of h, nor on the iteration set, which makes
5 4 3 2 1 i1
0 0
1
2
3
4
5
6
7
8
9
Tiling. Fig. Dependence cycle between two parts i2 5 4 3 2 1 i1
0 0
1
2
3
4
5
6
7
8
9
Tiling. Fig. Second hyperplane partitioning with h = ( , ) and o = (, )
all such legal tilings scalable to meet the different needs enumerated above. Several hyperplane partitionings h , h ,…can be combined to increase the number of parts and reduce their sizes (see Figs. and ). The vectors h , h ,…are usually grouped (Xue [] uses H to denote the transposed matrix of H, that is, the h vectors are transformed into affine forms) together in a matrix, H. When the
T
T
Tiling
i2 →
5
h2
4
→
h1
3 2
H=
1
→ → h1 h2
=
i1
0 0
1
2
3
4
5
6
7
8
9
1 1 3 6 1 0 2
Tiling. Fig. Hyperplane matrix H
Tiling. Fig. Combined -D hyperplane partitioning → p2
→ p1 → → 30 P = p1 p2 = −1 2
Tiling. Fig. Partitioning or clustering matrix
number of different hyperplane families hi used is equal to the number of nested loops n and when the hi are linearly independent vectors, the part sizes are bounded, regardless of the iteration set, and all parts of the iteration space are equal up to a translation if H − is an integer matrix. However, the parts of the iteration set L may differ because of the loop boundaries. See, for instance, tile (, ), the upper right tile in Fig. , which contains only three iterations. The transpose of H − , the partitioning matrix P, contains the edges of the tile and its determinant is the number of iterations within each tile. So P (Fig. ) is easier to visualize than H (Fig. ). Note that det(P) = , which is the maximum number of iterations within one tile, as can be seen on Fig. . A tiling H is defined by the number of hyperplane sets used, by the directions of the normal vectors hi and by their relative norms, that is, the tile shape, and by the number of iterations within a tile, that is, the tile size. The tile shape is defined when det(H) = . The tile size is controlled by a scaling coefficient. The tiling origin is another parameter impacting mostly the code generation, but also the execution time. Often, a
tiling selection is decomposed into the selection of a shape and the selection of a size and finally the choice of an offset. The legality conditions are summed up by H T R ≥ , where R is a matrix made of the rays of a convex cone containing all possible dependence vectors d, because the condition h.d ≥ is convex. The computation of R by a dependence test is explained by Irigoin and Triolet [] and the dependence cone is one of many approximations of the dependence vector set. When dependences are uniform (see Dependences), R can be built directly with the dependence vectors. The valid hyperplanes h belong to another cone, dual of R. A necessary and sufficient condition, ⌊H T d⌋ ⪰ , was introduced by Xue [] in , where ⪰ is the lexicographic order, but it does not bring any practical improvement over the previous sufficient condition. Xue also provided an exact legality test based on integer programming and using information about the iteration set L(j), that is, the loop bounds of the initial loop nest. Note also that the subset of tilings such that det(H) = is the set of unimodular transformations and that H T d ⪰ is their legality condition (see Loop Nest Parallelization).
Tile Selection and Optimal Tiling Tile selection requires some choice criterion, and optimal tile selection some cost function. The cost function is obviously dependent on the target machine, which makes many optimal tilings possible since many target architectures can benefit from tiling. Also, if the cost function is the execution time, the tiling per se is only part of the compilation scheme. The execution time of one tile depends on the scheduling of the local iterations, for instance because each processor has some
Tiling
vector capability. It also depends on the tiles previously executed on the same processor, whether a cache or a local memory is used, that is, it depends on the mapping of tiles on processor. And finally, the total execution time of the tiled nest depends on the schedule and on the mapping of the tiles on the logical or physical resources, threads, or cores. In other words, any optimal tiling is optimal with respect to a cost function modeling the execution time or the energy for a given target. Models used to derive analytical optimal solutions often assume that the execution and the communication times are respectively proportional to the computation and communication volumes, which is not realistic, especially with multicores, superword parallelism, and several levels of cache memories. Models may also assume that the number of processors available (because of multicore and multithreaded architectures, the definition of processor, virtual or physical, is not well defined. Here, the processor is not a chip, but rather the total number of physical threads in a multicore or the total number of user processes running simultaneously in the machine) is greater than the number of tiles that can be executed simultaneously, which simplifies the mapping of tiles on processors. When the target architecture has some kind of implicit vector capability, for instance when the cache lines are loaded, a partial tiling with rank(H) < n, that is, a set of hyperplane partitionings, may be more effective than a full tiling. Tiles are not longer bounded by the hyperplanes, but they remain bounded by the initial loop nest iteration set. Because the execution time of real machine becomes more and more complex with the number of transistors used, iterative compilation is used when performance is key. The code is compiled and executed with different tile sizes and the best tile size is retained. Symbolic tiling is useful to speed up the process. Some decisions can be made at run time. For instance, Rastello et al. [] use run-time scheduling to speed-up the execution. Note that they overlap the computations and the communications related to one tile, which somehow breaks the tile atomicity constraint.
Tiled Code Generation As for tiling optimality, tile code generation depends on the target machine. A parallel machine requires the
T
mapping and the parallel execution of the tiles. A distributed memory machine also requires communication generation. A processor with a memory hierarchy and/or a vector capability requires loop optimization at the tile level. The minimal requirement is that all iterations of the initial loop nest are performed by the tiled nest. Let vector j be an iteration of loop nest L, t a tile coordinate, and l the local coordinate of an iteration within a tile t. Since no redundant computations are added by hyperplane partitioning, the relationship between j and (t, l) is a one-to-one mapping from the initial set of iterations L(j) to the new iteration set T(t, l). Ancourt and Irigoin [] show that an affine relationship can be built between j and (t, l) and that the new loop bounds for t and l can be derived from this relationship and from L when the matrix H is numerically known and when L is a parametric polyhedron, that is, when the loop bounds are affine functions of loop indices and parameters. To simplify array subscript expressions, the code may be generated using t and j instead of t and l. This optimization is used in the code examples given above. This loop nest generation is sufficient for shared memory machines, although it is better to generate several versions of the tile code, one for the full tiles, and several ones for the partial tiles on the iteration set boundaries, in order to reduce the average control overhead. Multilevel tiling is also used to reduce the overhead due to partial tiles on the boundaries (see the top left and top right tiles in Fig. ). This does not specify the mapping of tiles onto threads or cores when the tile parallelism is greater than the number of processors. But locality-aware scheduling lets tiles inherit data from other tiles previously executed on the same thread as suggested by Xue and Huang in [] who minimize the number of partitioning hyperplanes, that is, the rank of H. Parametric tiling does not require H to be numerically known at compile time. The tile size, if not the tile shape, can be adjusted at run time or optimized dynamically. A technique is proposed by Renganarayanna et al. [] for multilevel tiling, and another one by Hartono et al. [, ]. Note that parallelism within tiles or across tiles is obtained by wavefronting, a unimodular loop transformation (see Loop Nest Parallelization), unless the initial loop nest is fully parallel, in which case cone R is
T
T
Tiling
empty or reduced to {}. For instance, the tiles (, ) and (, ) on Fig. can be computed in parallel.
Applicability Tiling and hyperplane partitioning are defined for perfectly nested loops only, but many algorithms, including matrix multiply, are made of non-perfectly nested loops. It is possible to move all non-perfectly nested statements into the loop nest body by adding guards (a.k.a. statement sinking), but these guards must then be carefully moved or removed when the tile code is generated. The issue is tackled directly by Ahmed et al. [] and Griebl [, ], who avoid statement sinking by mapping all statement in another space (see Polyhedron Model) and by applying transformations, including tiling, on this space before code generation, and by Hartono et al. [] who use a polyhedral representation of the code to generate multilevel tilings of imperfectly nested loops. See the Polyhedron Model entry for more information about the mapping of a piece of code onto a polyhedral space. Tiling can also be applied to loop nests containing commutative and associative reductions, but this does not fit the general legality condition HR ≥ . Tiling can be applied by the programmer. For instance, -D tiling improved with array padding has been used to optimize -D PDE solvers and -D stencil codes. Tiling has been used to optimize some instances of dynamic programming, the resolution of the heat equation, and even some sparse computations. Because tiling is difficult to apply by hand, source-to-source tilers have been developed. Finally, Guo et al. suggest in [] to support tiling at the programming language level, using hierarchically tiled arrays (HTA) to keep the code readable, while letting the programmer be in control.
In other words, the P matrix is replaced by the product of a diagonal matrix Λ, which defines a rectangu lar tiling, a.k.a. loop blocking, and of det(P) P, and the tiling by a sequence of easier transformations. This is advocated by Allen & Kennedy in their textbook []. Reducing tiling to blocking via basis changes, for example, using a Smith normal form of H, is also often used to optimize the tile shape and size, but some of the tile shape problem remains and the constraints of the iteration set L usually become more complex. Strip-mining is a -D tiling, a degenerated case of hyperplane partitioning. It is used to adapt the parallelism available to the hardware resources, for instance vector registers. Loop interchange is a unimodular transformation. Like all unimodular transformations, it is an extreme case of tiling with no tiling effect because det(H) = , that is, each tile contains only one element. It is often used to increase locality. Loop unroll-and-jam first unrolls an outer loop by some factor k, that is, it is a strip-mining followed by a full unroll of the new loop. Then, the replicated innermost loops are fused (jammed). This is equivalent to a rectangular hyperplane partitioning with blocking factors (k, , , . . .), followed by an unrolling of the tile loop. Unroll-and-jam can be applied to several outer loops with several factors, which again is equivalent to a rectangular tiling with the same factors followed by an unrolling of the tile loops. Unroll-and-jam is used to increase locality and is effective like tiling if some references exhibit temporal locality along outer loops. Tiling is designed to forbid redundant computations. However, overlap between tiles can reduce communications at the expense of additional computation. Data overlaps are also used to compile HPF (HighPerformance Fortran) using the owner compute rule. Finally, tiling is also related to the partitioning of systolic arrays, used to fit a large parametric size iteration set on a fixed-size chip.
Related Loop Transformations −
The partitioning matrix P = (H T ) can be built stepby-step by a combination of loop skewing, or more generally any unimodular loop transformations, stripminings (-D tiling), and loop interchanges. Loop skewing is a unimodular transformation used to change the iteration coordinates and to make loop blocking legal because the new loop nest obtained is fully permutable.
Future Directions Although tiling is a powerful transformation by itself, and quite complex to use, it does not include some other key transformations such as loop fusion or loop peeling. Furthermore, the loop body is handled as a unique statement although it may contain sequences, tests, and loops.
Tiling
So more complex code transformations were advocated in by Wolf and Lam [] to optimize parallelism and locality. More recently, Griebl [] and Bondhugula et al. [] use the polyhedral framework (see Polyhedron Model) to handle each elementary statement individually, at least within static control pieces of code. This is also attempted within gcc with the Graphite plug-in. Otherwise, it is possible to move away from the complexity of tiling by replacing it with sequences of simpler transformations, including hyperplane partitioning. The difficulty is then to decide which sequence leads to an optimal or at least to a well-performing code. In case the execution time of each iteration is different or even very different, the tile equality constraint could be lifted up to obtain a nonuniform partitioning. This has already been done in the -D case. In such cases, strip-mining is replaced by more complex partitions to map the parallel iterations onto the processors. The larger partitions are executed first to reduce the imbalance between processors at the end without increasing the control overhead at the beginning (see Nested loops scheduling).
Related Entries Code Generation Dependences Dependence Abstractions Dependence Analysis Distributed-Memory Multiprocessor HPF (High Performance Fortran) Locality of Reference and Parallel Processing Loop Nest Parallelization Parallelization, Automatic Polyhedron Model Shared-Memory Multiprocessors Systolic Arrays Unimodular Transformations
Bibliographic Notes and Further Reading The best reference about tiling is the book written by Xue []. It provides the necessary background on linear algebra and program transformations. It starts with rectangular tilings before moving to slanted, that is, parallelepiped, tilings. Code generation for distributed
T
memory machines and tiling optimizations are finally addressed. Tile size optimization is addressed explicitly by Coleman and McKinley [] to eliminate cache capacity, self-interference, and cross-interference misses, that is, for locality improvement. For shared memory machines, Högstedt et al. [] introduce the concepts of idle time and rise for a tiling, and optimal tile shape for a multiprocessor with a memory hierarchy. They propose an algorithm to select an non-rectangular optimal tile shape for a shared memory multiprocessor, with enough processors to use all parallelism available. They assume that the tile execution time is proportional to its volume. Rastello and Robert [] provide a closed form for the tile shape that minimizes the number of cache misses during the execution of a rectangular tile for a given cache size and for parallel loops, that is, without tiling legality constraints. They also provide a heuristic to optimize the same function for any shape of tiles. For distributed memory machines, the contributions to the quest for analytic solutions are numerous. Boulet et al. [] carefully include a question mark in their paper title, (Pen)-ultimate tiling?. Hodzic and Shang give several closed forms for the size and the relative side lengths []. Xue [] uses communication/computation overlap and provides a closed form for the optimal tile size. Tile shapes are restrained to orthogonal shapes by Andonov et al. [] in order to find an optimal solution. For -D iteration spaces, using the BSP model and an unbounded number of processors, Andonov et al. [] provide closed forms for the optimal tiling parameters and the optimal number of processors. Agarwal et al. [, ] introduce a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches and for distributed shared-memory multiprocessors. More recently, Bondhugula et al. [] tile sequences of imperfectly nested loops for locality and parallelism. They use an analytical model and integer linear optimization. Carter et al. introduce hierarchical tiling for superscalar machines [], but the tuning is handmade. Renganarayanna and Rajopadhye [] determine optimal tile sizes with a BSP-like model. Strzodka et al. [] use multilevel tiling to speed up stencil computations
T
T
Tiling
by optimizing simultaneously locality, parallelism and vectorization. For distributed memory machines, communication code must be generated too. See Ancourt [], Tang [], Xue [], Chapter and , and finally Goumas et al. [], who generate MPI code automatically. Goumas et al. propose [] a tile code generation algorithm for parallelepiped tiles. This can be used for general tiles thanks to changes of basis.
Bibliography . Agarwal A, Kranz D, Natarajan V () Automatic partitioning of parallel loops for cache-coherent multiprocessors. In: International conference on parallel processing (ICPP), Syracuse University, Syracuse, NY, – August , vol , pp – . Agarwal A, Kranz DA, Natarajan V (September ) Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst ():– . Ahmed N, Mateev N, Pingali K () Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In: Proceedings of the th international conference on supercomputing, Santa Fe, – May , pp – . Allen R, Kennedy K () Optimizing compilers for modern architectures: a dependence-based approach. Morgan-Kaufmann. San Francisco, pp – . Ancourt C, Irigoin F () Scanning polyhedra with DO loops. In: Third ACM symposium on principles and practice of parallel programming, Williamsburg, VA, pp – . Andonov R, Balev S, Rajopadhye S, Yanev N (July ) Optimal semi-oblique tiling. In: Proceedings of the th annual ACM symposium on parallel algorithms and architectures, Crete Island, pp – . Andonov R, Rajopadhye SV, Yanev N () Optimal orthogonal tiling. In: Proceedings of the fourth international Euro-Par conference on parallel processing, Southampton, – Sept , pp – . Bondhugula U, Baskaran M, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P () Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Proceedings of the joint European conferences on theory and practice of software th international conference on compiler construction, Budapest, Hungary, March– April . Bondhugula U, Hartono A, Ramanujam J, Sadayappan P (June ) A practical automatic polyhedral parallelizer and locality optimizer. In: PLDI . ACM SIGPLAN Not () . Boulet P, Darte A, Risset T, Robert Y () (Pen)-ultimate tiling? Integr: VLSI J :– . Carter L, Ferrante J, Hummel SF () Hierarchical tiling for improved superscalar performance. In: Proceedings of the ninth international symposium on parallel processing, Santa Barbara, – April , pp –
. Coleman S, McKinley KS (June ) Tile size selection using cache organization and data layout. In: PLDI’; ACM SIGPLAN Not ():– . Goumas G, Athanasaki M, Koziris N () Automatic code generation for executing tiled nested loops onto parallel architectures. In: Proceedings of the ACM symposium on applied computing, Madrid, Spain, – March . Goumas G, Drosinos N, Athanasaki M, Koziris N (November ) Message-passing code generation for non-rectangular tiling transformations. Parallel Computing (): – . Griebl M (July ) On tiling space-time mapped loop nests. In: Proceedings of the th annual ACM symposium on parallel algorithms and architectures, Crete Island, pp – . Griebl M (June ) Automatic parallelization of loop programs for distributed memory architectures. Habilitation thesis, Department of Informatics and Mathematics, University of Passau. http://www.fim.uni-passau.de/cl/publications/docs/Gri.pdf . Guo J, Bikshandi G, Fraguela BB, Garzaran MJ, Padua D () Programming with tiles. In: Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming, Salt Lake City, UT, USA, – Feb . Hartono A, Manikandan Baskaran M, Bastoul C, Cohen A, Krishnamoorthy S, Norris B, Ramanujam J, Sadayappan P () Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the rd international conference on supercomputing, Yorktown Heights, NY, USA, – June . Hodzic E, Shang W (December ) On time optimal supernode shape. IEEE Trans Parallel Distrib Syst ():– . Högstedt K, Carter L, Ferrante J (March ) On the parallel execution time of tiled loops. IEEE Trans Parallel Distrib Syst ():– . Irigoin F, Triolet R () Supernode partitioning. In: Fifteenth annual ACM symposium on principles of programming languages, San Diego, CA, pp – . Jiménez M, Llabería JM, Fernández A (July ) Register tiling in nonrectangular iteration spaces. ACM Trans Program Lang Syst ():– . Manikandan Baskaran M, Hartono A, Tavarageri S, Henretty T, Ramanujam J, Sadayappan P () Parameterized tiling revisited. In: CGO’: proceedings of the eighth annual IEEE/ACM international symposium on code generation and optimization, pp – . McKeller AC, Coffman EG () The organization of matrices and matrix operations in a paged multiprogramming environment. Commun ACM ():– . Rastello F, Rao A, Pande S (February ) Optimal task scheduling at run time to exploit intra-tile parallelism. Parallel Comput ():– . Rastello F, Robert Y (May ) Automatic partitioning of parallel loops with parallelepiped-shaped tiles. IEEE Trans Parallel Distrib Syst ():– . Renganarayana L, Rajopadhye S () A geometric programming framework for optimal multi-level tiling. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, – Nov , p
Titanium
. Renganarayanan L, Kim D, Rajopadhye S, Strout MM (June ) Parameterized tiled loops for free. In: PLDI’, ACM SIGPLAN Not () . Strzodka R, Shaheen M, Pajak D, Seidel H-P () Cache oblivious parallelograms in iterative stencil computations. In: ICS’: proceedings of the th ACM international conference on supercomputing, Tsukuba, Japan, pp – . Tang P, Xue J () Generating efficient tiled code for distributed memory machines. Parallel Comput ():– . Wolf ME, Lam MS (October ) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst ():– . Wolfe MJ () Iteration space tiling for memory hierarchies. In: Rodrigue G (ed) Parallel processing for scientific computing. SIAM, Philadelphia, pp – . Wolfe MJ () More iteration space tiling. In: Proceedings of the ACM/IEEE conference on supercomputing, Reno, NV, – Nov , pp – . Wolfe MJ () High performance compilers for parallel computing. Addison-Wesley Longman, Boston . Xue J () Loop tiling for parallelism. Kluwer, Boston . Xue J, Cai W (June ) Time-minimal tiling when rise is larger than zero. Parallel Comput ():– . Xue J, Huang C-H (December ) Reuse-driven tiling for improving data locality. Int J Parallel Program ():–
Titanium Katherine Yelick , Susan L. Graham , Paul Hilfinger , Dan Bonachea , Jimmy Su , Amir Kamil , Kaushik Datta , Phillip Colella , Tong Wen University of California at Berkeley and Lawrence Berkeley National Laboratory, Berkeley, CA, USA University of California, Berkeley, CA, USA Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Definition Titanium is a parallel programming language designed for high-performance scientific computing. It is based on JavaTM and uses a Single Program Multiple Data (SPMD) parallelism model with a Partitioned Global Address Space (PGAS).
Discussion Introduction Titanium is an explicitly parallel dialect of JavaTM designed for high-performance scientific programming
T
[, ]. The Titanium project started in , at a time when custom supercomputers were losing market share to PC clusters. The motivation was to create a language design and implementation that would enable portable programming for a wide range of parallel platforms by striking an appropriate balance between expressiveness, user-provided information about concurrency and memory locality, and compiler and runtime support for parallelism. The goal was to design a language that could be used for high performance on some of the most challenging applications, such as those with adaptivity in time and space, unpredictable dependencies, and sparse, hierarchical, or pointer-based data structures. The strategy was to build on the experience of several Partitioned Global Address Space (PGAS) languages, but to design a higher-level language offering object orientation with strong typing and safe memory management in the context of applications requiring high performance and scalable parallelism. Titanium uses Java as the underlying base language, but is neither a strict superset nor subset of that language. Titanium adds general multidimensional arrays, support for extending the value types in the language, and an unordered loop construct. In place of Java threads, which are used for both program structuring and concurrency, Titanium uses a static thread model with a partitioned address space to allow for locality optimizations.
Titanium’s Parallelism Model Titanium uses a Single Program Multiple Data (SPMD) parallelism model, which is familiar to users of message-passing models. The following simple Titanium program illustrates the use of built-in methods Ti.numProcs() and Ti.thisProc(), which query the environment for the number of threads (or processes) and the index within that set of the executing thread. The example prints these indices in arbitrary order. The number of Titanium threads need not be equal to the number of physical processors, a feature that is often useful when debugging parallel code on singleprocessor machines. However, high-performance runs typically use a one-to-one mapping between Titanium threads and physical processors. class HelloWorld { public static void main (String [] argv) {
T
T
Titanium
System.out.println("Hello from proc " + Ti.thisProc() + " out of " + Ti.numProcs()); } }
Titanium supports Java’s synchronized blocks, which are useful for protecting asynchronous accesses to shared objects. Because many scientific applications use a bulk-synchronous style, Titanium also has a barriersynchronization construct, Ti.barrier(), as well as a set of collective communication operations to perform broadcasts, reductions, and scans. A novel feature of Ti- tanium’s parallel execution model is that barriers must be textually aligned in the program – not only must all threads reach a barrier before any one of them may proceed, but they must all reach the same textual barrier. For example, the following program is not legal in Titanium: if (Ti.thisProc() == 0) Ti.barrier(); //illegal barrier else Ti.barrier();//illegal barrier
Aiken and Gay developed the static analysis the compiler uses to enforce this alignment restriction, based on two key concepts []: ●
A single method is one that must be invoked by all threads collectively. Only single methods can execute barriers. ● A single-valued expression is an expression that is guaranteed to take on the same sequence of values on all processes. Only single-valued expressions may be used in conditional expressions that affect which barriers or single-method calls get executed. The compiler automatically determines which methods are single by finding barriers or (transitively) calls to other single methods. Single-valued expressions are required in statements that determine the flow of control to barriers, ensuring that the barriers are executed by all threads or by none. Titanium extends the Java type system with the single qualifier. Variables of singlequalified type may only be assigned values from singlevalued expressions. Literals and values that have been broadcast are simple examples of single-valued expressions. The following example illustrates these concepts. Because the loop contains barriers, the expressions in the for-loop header must be single-valued. The compiler can check that property statically, since the variables
are declared single and are assigned from single-valued expressions. int single allTimestep = 0; int single allEndTime = broadcast inputTimeSteps from 0; for (; allTimestep < allEndTime; allTimestep)++{ < read values belonging to other threads > Ti.barrier(); < compute new local values > Ti.barrier(); }
Barrier analysis is entirely static and provides compiletime prevention of barrier-based deadlocks. It can also be used to improve the quality of concurrency analysis used in optimizations. Single qualification on variables and methods is a useful form of program design documentation, improving readability by making replicated quantities and collective methods explicitly visible in the program source and subjecting these properties to compiler enforcement.
Titanium’s Memory Model The two basic mechanisms for communicating between threads are accessing shared variables and sending messages. Shared memory is generally considered easier to program, because communication is one-sided: Threads can access shared data at any time without interrupting other threads, and shared data structures can be directly represented in memory. Titanium is based on a Partitioned Global Address Space (PGAS) model, which is similar to shared memory but with an explicit recognition that access time is not uniform. As shown in Fig. , memory is partitioned such that each partition has affinity to one thread. Memory is also partitioned orthogonally into private and shared memory, with stack variables living in private memory, and heap objects, by default, living in the shared space. A thread may access any variable that resides in shared space, but has fast access to variables in its own partition. Objects created by a given thread will reside in its own part of the memory space. Titanium statically makes an explicit distinction between local and global references: A local reference must refer to an object within the same thread partition, while a global reference may refer to either a remote or
Titanium
T
Titanium. Fig. Titanium’s partitioned global address space memory model
local partition. In Fig. , instances of l are local references, whereas g and nxt are global references and can therefore cross partition boundaries. The motivation for this distinction is performance. Global references are more general than local ones, but they often incur a space penalty to store affinity information and a time penalty upon dereference to check whether communication is required. References in Titanium are global by default, but may be designated local using the local type qualifier. The compiler performs type inference to automatically label variables as local []. The partitioned memory model is designed to scale well on distributed memory platforms without the need for caching of remote data and the associated coherence protocols. Titanium also runs well on shared memory multiprocessors and uniprocessors, where the partitioned-memory model may not correspond to any physical locality on the machine and the global references generally incur no overhead relative to local ones. Naively written Titanium programs may ignore the partitioned-memory model and, for example, allocate all data structures in one thread’s shared memory partition or perform fine-grained accesses on remote data. Such programs would run correctly on any platform but would likely perform poorly on a distributed memory platform. In contrast, a program that carefully manages its data-structure partitioning and access behavior in order to scale well on distributed memory hardware is likely to scale well on shared memory platforms as well. The partitioned model provides the ability to start with a functional, shared memory style code and incrementally tune performance for distributed memory hardware by reorganizing the affinity of key data structures or adjusting access patterns in program bottlenecks to improve communication performance.
Titanium Arrays Java arrays do not support sub-array objects that are shared with larger arrays, nonzero base indices, or true multidimensional arrays. Titanium retains Java arrays for compatibility, but adds its own multidimensional array support, which provides the same kinds of sub-array operations available in Fortran . Titanium arrays are indexed by integer tuples known as points and built on sets of points, called domains. The design is taken from that of a language for Finite Different Calculations, FIDIL, designed by Colella and Hilfinger []. Points and domains are first-class entities in Titanium – they can be stored in data structures, specified as literals, passed as values to methods, and manipulated using their own set of operations. For example, NAS multigrid (MG) benchmark requires a grid. The problem has periodic boundaries, which are implemented using a one-deep layer of surrounding ghost cells, resulting in a grid. Such a grid can be constructed with the following declaration: double [3d] gridA = new double [[-1,-1,-1]:[256,256,256]];
The D Titanium array gridA has a rectangular index set that consists of all points [i, j, k] with integer coordinates such that − ⩽ i, j, k ⩽ . Titanium calls such an index set a rectangular domain of Titanium type RectDomain, since all the points lie within a rectangular box. Titanium also has a type Domain that represents an arbitrary set of points, but Titanium arrays can only be built over RectDomains. Titanium arrays may start at an arbitrary base point, as the example with a [−, −, −] base shows. In this example, the grid was designed to have space for ghost regions, which are
T
T
Titanium
all the points that have either − or as a coordinate. On machines with hierarchical memory systems, gridA resides in memory with affinity to exactly one process, namely the process that executes the above statement. Similarly, objects reside in a single logical memory space for their entire lifetime (there is no transparent migration of data), though they are accessible from any process in the parallel program. The power of Titanium arrays stems from array operators that can be used to create alternative views of an array’s data, without an implied copy of the data. While this is useful in many scientific codes, it is especially valuable in hierarchical grid algorithms like Multigrid and Adaptive Mesh Refinement (AMR). In a Multigrid computation on a regular mesh, there is a set of grids at various levels of refinement, and the primary computations involve sweeping over a given level of the mesh performing nearest neighbor computations (called stencils) on each point. To simplify programming, it is common to separate the interior computation from computation at the boundary of the mesh, whether those boundaries come from partitioning the mesh for parallelism or from special cases used at the physical edges of the computational domain. Since these algorithms typically deal with many kinds of boundary operations, the ability to name and operate on sub-arrays is useful.
Domain Calculus Titanium’s domain calculus operators support subarrays both syntactically and from a performance standpoint. The tedious business of index calculations and array offsets has been migrated from the application code to the compiler and runtime system. For example, the following Titanium code creates two blocks that are logically adjacent, with a boundary of ghost cells around each to hold values from the adjacent block. The shrink operation creates a view of gridA by shrinking its domain on all sides, but does not copy any of its elements. Thus, gridAInterior will have indices from [, , ] to [, , ] and will share corresponding elements with gridA. The copy operation in the last line updates one plane of the ghost region in gridB by copying only those elements in the intersection of the two arrays. Operations on Titanium arrays such as copy are not opaque method calls to the Titanium compiler.
The compiler recognizes and treats such operations specially, and thus can apply optimizations to them, such as turning blocking operations into non-blocking ones. double [3d] gridA = new double [[-1,-1,-1]:[256,256,256]]; double [3d] gridB = new double [[-1,-1,256]:[256,256,512]]; //define interior without creating a copy double [3d] gridAInterior = gridA.shrink(1); //update overlapping ghost cells from neighboring block //by copying values from gridA to gridB gridB.copy(gridAInterior);
The above example appears in a NAS MG implementation in Titanium [], except that gridA and gridB are themselves elements of a higher-level array structure. The copy operation as it appears here performs contiguous or noncontiguous memory copies, and may perform interprocessor communication when the two grids reside in different processor memory spaces. The use of a global index space across distinct array objects (made possible by the arbitrary index bounds of Titanium arrays) makes it easy to select and copy the cells in the ghost region, and is also used in the more general case of adaptive meshes.
Unordered Loops, Value Types, and Overloading The foreach construct provides an unordered looping construct designed for iterating through a multidimensional space. In the foreach loop below, the point p plays the role of a loop index variable. foreach (p in gridAInterior.domain()) { gridB[p] = applyStencil(gridAInterior, p); }
The applyStencil method may safely refer to elements that are one point away from p, since the loop is over the interior of a larger array. This one loop concisely expresses an iteration over a multidimensional domain that would correspond to a multi-level loop nest in other languages. A common class of loop bounds and indexing errors is avoided by having the compiler and runtime system automatically manage the iteration boundaries for the multidimensional traversal. The foreach loop is a purely serial iteration construct – it is not a data-parallel construct. In addition, if the order of loop execution is irrelevant to a computation, then using a foreach loop
Titanium
to traverse the points in a domain explicitly allows the compiler to reorder loop iterations to maximize performance – for instance, by performing automatic cache blocking and tiling optimizations []. It also simplifies bounds-checking elimination and array access strength-reduction optimizations. The Titanium immutable class feature provides language support for defining application-specific primitive types (often called “lightweight” or “value” classes), allowing the creation of user-defined unboxed objects, analogous to C structs. Immutables provide efficient support for extending the language with new types which are manipulated and passed by value, avoiding pointer-chasing overheads which would otherwise be associated with the use of tiny objects in Java. One compelling example of the use of immutables is for defining a Complex number class, which was used in a Titanium implementation of the NAS FT benchmark. Titanium also allows for operator overloading, a feature that was strongly desired by application developers on the team, and was used in the FT example to simplify the expressions on complex values.
Distributed Arrays Titanium also supports the construction of distributed array data structures, which are built from local pieces rather than declared as distributed types. This reflects the design emphasis on adaptive and sparse data structures in Titanium, rather than the simpler “regular array” computations that could be supported with simpler flat arrays. The general pointer-based distribution mechanism combined with the use of arbitrary base indices for arrays provides an elegant and powerful mechanism for shared data. The following code is a portion of the parallel Titanium code for a multigrid computation. It is run on myBlock
myBlock
t1
Point< > startCell = myBlockPos * numCellsPerBlockSide; Point< > endCell = startCell + (numCellsPerBlock Side - [1,1,1]); double [3d] myBlock = new double[startCell:endCell]; //"blocks" is used to create "blocks3D" array double [1d] single [3d] blocks = new double [0:(Ti.numProcs()-1)] single [3d]; blocks.exchange(myBlock); //create local "blocks3D" array double [3d] single [3d] blocks3D = new double [[0,0,0]:numBlocksInGridSide – [1,1,1]]single [3d]; //map from "blocks" to "blocks3D" array foreach (p in blocks3D.domain()) blocks3D[p] = blocks[procForBlockPosition(p)];
Each processor computes its start and end indices by performing arithmetic operations on Points. These indices are used to create a local myBlock array. Every processor also allocates its own D array blocks. Then, by combining the myBlock arrays using the exchange operation, blocks becomes a distributed data structure. As shown in Fig. , the exchange operation performs an all-to-all broadcast and stores each processor’s contribution in the corresponding element of its local blocks array. To create a more natural mapping, a D processor array is used, with each element containing a reference to a particular local block. By using global indices in the local block – meaning that each block has a different set of indices that overlap only in the area of ghost regions – the copy operations described above can be used to update the ghost cells. The generality of Titanium’s distributed data structures is not fully utilized in the example of a uniform mesh, but in an adaptive block structured mesh, a union of rectangles can be used to
myBlock
t2
every processor and creates the blocksD distributed array, which can access any processor’s portion of the grid.
myBlock
Blocks 3D t0
T
t3
Logical arrangement of blocks based on indices
Titanium. Fig. Distributed D array in titanium’s PGAS address space. The pointers in the blocksD array are shown only for thread t for simplicity
T
T
Titanium
fill a spatial area, and the global indexing and global address space used to simplify much more complicated ghost region updates.
Implementation Techniques and Research The Titanium compiler translates Titanium code into C code, and then hands that code off to a C compiler to be compiled and linked with the Titanium runtime system and, in the case of distributed memory back ends, with the GASNet communication system []. The choice of C as a target was made to achieve portability, and produces reasonable performance without the overhead of a virtual machine. GASNet is a one-sided communication library that is used within a number of other PGAS language implementations, including Co-Array Fortran, Chapel, and multiple UPC implementations. GASNet is itself designed for portability, and it runs on top of Ethernet (UDP) and MPI, but there are optimized implementations for most of the high-speed networks that are used in clusters and supercomputers designs. Titanium can also run on shared memory systems using a runtime layer based on POSIX Threads, and on combinations of shared and distributed memory by combining this with GASNet. Titanium, like Java, is designed for memory safety, and the Titanium runtime system includes the Boehm-Weiser garbage collector for shared memory code. To handle distributed memory environments, the runtime system tracks references that leak to remote nodes, but also adds a scalable region-based memory management concept to the language along with compiler analysis []. Aggressive program analysis is crucial for effective optimization of parallel code. In addition to serial loop optimizations [] and some standard optimizations to reduce the size and complexity of generate C code, the compiler performs a number of novel analyses on parallelism constructs. For example, information about what sections of code may operate concurrently is useful for many optimizations and program analyses. In combination with alias analysis, it allows the detection of potentially erroneous race conditions, the removal of unnecessary synchronization operations, and the ability to provide stronger memory consistency guarantees. Titanium’s textually aligned barriers divide the code into independent phases, which can be exploited to improve the quality of concurrency analysis. The single-valued
expressions are also used to improve concurrency analysis on branches. These two features allow a simple graph encoding of the concurrency in a program based on its control-flow graph. We have developed quadratictime algorithms that can be applied to the graph in order to determine all pairs of expressions that can run concurrently. Alias analysis identifies pointer variables that may, must, or cannot reference the same object. The Titanium compiler uses alias analysis to enable other analyses (such as locality and sharing analysis), and to find places where it is valid to introduce restrict qualifiers in the generated C code, enabling the C compiler to apply more aggressive optimizations. The Titanium compiler’s alias analysis is a Java derivative of Andersen’s pointsto analysis with extensions to handle multiple threads. The modified analysis is only a constant factor slower than the sequential analysis, and its running time is independent of the number of runtime threads.
Application Experience A number of benchmarks and larger applications have been written in Titanium, starting with some of the NAS Benchmarks []. In addition, Yau developed a distributed matrix library that supports blocked-cyclic layouts and implemented Cannon’s Matrix Multiplication algorithm, Cholesky and LU factorization (without pivoting). Balls and Colella built a D version of their Method of Local Corrections algorithm for solving the Poisson equation for constant coefficients over an infinite domain []. Bonachea, Chapman, and Putnam built a Microarray Optimal Oligo Selection Engine for selecting optimal oligonucleotide sequences from an entire genome of simple organisms, to be used in microarray design. The most ambitious efforts have been applications frameworks for Adaptive Mesh Refinement (AMR) algorithms and Immersed Boundary Method simulations [] by Tong Wen and Ed Givelberg, respectively. In both cases, these application efforts have taken a few years and were preceded by implementations of Titanium codes for specific problem instances, e.g., AMR Poisson by Luigi Semenzato, AMR gas dynamics [] by Peter McCorquodale and Immersed Boundaries for simulation of the heart by Armando Solar-Lezama and cochlea by Ed Givelberg, with various optimization and analysis efforts by Sabrina Merchant, Jimmy Su, and Amir Kamil.
TOP
The performance results show good scalability on the applications problems on up to hundreds of separate distributed memory nodes, and performance that is in some cases comparable to applications written in C++ or FORTRAN with message passing. The compiler is a research prototype and does not have all of the static and dynamic optimizations one would expect from a commercial compiler, but even serial running-time comparisons show competitive performance. No formal productivity studies involving humans have been done, but a variety of case studies have shown that the global address space combined with a powerful multidimensional array abstraction and the data abstraction support derived from Java leads to code that is elegant and concise.
Related Entries Coarray Fortran
T
. Liblit B, Aiken A () Type systems for distributed data structures. In: The th ACM SIGPLAN-SIGACT symposium on principles of programming languages (POPL), Boston, January . McCorquodale P, Colella P () Implementation of a multilevel algorithm for gas dynamics in a high-performance Java dialect. In: International parallel computational fluid dynamics conference (CFD’) . Pike G, Semenzato L, Colella P, Hilfinger PN () Parallel D adaptive mesh refinement in Titanium. In: th SIAM conference on parallel processing for scientific computing, San Antonio, TX, March . Su J, Yelick K () Automatic support for irregular computations in a high-level language. In: th International Parallel and Distributed Processing Symposium (IPDPS) . Yelick K, Hilfinger P, Graham S, Bonachea D, Su J, Kamil A, Datta K, Colella P, Wen T () Parallel languages and compilers: perspective from the titanium experience. Int J High Perform Comput App :– . Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishnamurthy A, Hilfinger P, Graham S, Gay D, Colella P, Aiken A () Titanium: a high-performance Java dialect. Concur: Pract Exp :–
PGAS (Partitioned Global Address Space) Languages UPC
Bibliography . Aiken A, Gay D () Barrier inference. In: Principles of programming languages, San Diego, CA . Balls GT, Colella P () A finite difference domain decomposition method using local corrections for the solution of Poisson’s equation. J Comput Phys ():– . Bonachea D () GASNet specification. Technical report CSD-, University of California, Berkeley . Datta K, Bonachea D, Yelick K () Titanium performance and potential: an NPB experimental study. In: th international workshop on languages and compilers for parallel computing (LCPC). Hawthorne, NY, October . Gay D, Aiken A () Language support for regions. In: SIGPLAN conference on programming language design and implementation. Washington, DC, pp – . Givelberg E, Yelick K Distributed immersed boundary simulation in titanium. http://titanium.cs.berkeley.edu, . Hilfinger PN, Colella P () FIDIL: a language for scientific processing. In: Grossman R (ed) Symbolic computation: applications to scientific computing. SIAM, Philadelphia, pp – . Kamil A, Yelick K () Hierarchical pointer analysis for distributed programs. Static Analysis Symposium (SAS), Kongens Lyngby, Denmark, August –, . Kamil A, Yelick K () Enforcing textual alignment of collectives using dynamic checks. In: nd international workshop on languages and compilers for parallel computing (LCPC), October . Also appears in Lecture notes in computer science, vol . Springer, Berlin, pp –. DOI: ./----
Web Documentation Bibliography GASNet Home Page. http://gasnet.cs.berkeley.edu/ Titanium Project Home Page at http://titanium.cs.berkeley.edu.
TLS Speculation, Thread-Level Speculative Parallelization of Loops
TOP Jack Dongarra, Piotr Luszczek University of Tennessee, Knoxville, TN, USA
Definition TOP is a list of fastest supercomputers in the world ranked by their performance achieved from running the LINPACK Benchmark. The list is assembled twice a year and officially presented at two supercomputing conferences: one in Europe and one in the USA. This list has been put together since .
T
T
TOP
Discussion Statistics on high-performance computers are of major interest to manufacturers, users, and potential users. These people wish to know not only the number of systems installed, but also the location of the various supercomputers within the high-performance computing community and the applications for which a computer system is being used. Such statistics can facilitate the establishment of collaborations, the exchange of data and software, and provide a better understanding of the high-performance computer market. Statistical lists of supercomputers are not new. Every year since , Hans Meuer has published system counts of the major vector computer manufacturers, based principally on those at the Mannheim Supercomputer Seminar. In the early s, a new definition of supercomputer was needed to produce meaningful statistics. After experimenting with metrics based on processor count in , the idea was born at the University of Mannheim to use a detailed listing of installed systems as the basis. In early , Jack Dongarra was convinced to join the project with the LINPACK benchmark. A first test version was produced in May . Today the TOP list is compiled by Hans Meuer of the University of Mannheim, Germany, Jack Dongarra of the University of Tennessee, Knoxville, and Erich Strohmaier and Horst Simon of NERSC/Lawrence Berkeley National Laboratory. New statistics are required that reflect the diversification of supercomputers, the enormous performance difference between low-end and high-end models, the increasing availability of massively parallel processing (MPP) systems, and the strong increase in computing power of the high-end models of workstation suppliers (SMP). To provide a new statistical foundation, the authors of the TOP decided in to assemble and maintain a list of the most powerful computer systems. The list is updated twice a year. The first of these updates always coincides with the International Supercomputer Conference in June (submissions are accepted until April ), the second one is presented in November at the IEEE Super Computer Conference in the USA (submissions are accepted until October st). The list is assembled with the help of high-performance computer experts, computational scientists, and manufacturers.
In the present list (called the TOP), computers are ranked by their performance on the LINPACK Benchmark. The list is freely available at http://www. top.org/ where users can create additional sublists and statistics out of the TOP database on their own. The main objective of the TOP list is to provide a ranked list of general-purpose systems that are in common use for high-end applications. A general-purpose system is expected to be able to solve a range of scientific problems. The TOP list shows the most powerful commercially available computer systems known. To keep the list as compact as possible, only a part of the information is shown: ● ● ● ● ● ● ● ● ● ● ● ●
Nworld – Position within the TOP ranking Manufacturer – Manufacturer or vendor Computer – Type indicated by manufacturer or vendor Installation Site – Customer Location – Location and country Year – Year of installation/last major update Field of Application #Proc. – Number of processors (Cores) Rmax – Maximal LINPACK performance achieved Rpeak – Theoretical peak performance Nmax – Problem size for achieving Rmax N/ – Problem size for achieving half of Rmax
In the TOP List table, the computers are ordered first by their Rmax value. In the case of equal performances (Rmax value) for different computers, a choice was made to order by Rpeak. For sites that have the same computer, the order is by memory size and then alphabetically.
Method of Solution In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations (Ax = b, for a dense matrix A) in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be / n + O(n ) double point floating point operations (Rmax value is computed by dividing this count by the time taken to solve). Here a floating point operation is an addition or multiplication of -bit operands. This excludes the use of a fast matrix multiply algorithm like “Strassen’s
Topology Aware Task Mapping
Method” or algorithms which compute a solution in a precision lower than full precision ( bit floating point arithmetic) and refine the solution using an iterative approach. This is done to provide a comparable set of performance numbers across all computers. Submitters of the results are free to implement their own solution as long as the above criteria are met. A reference implementation of the benchmark called HPL (High Performance LINPACK) is provided at: http://www.netlib. org/benchmark/hpl/. In addition to satisfying the rules, HPL also verifies the result with a numerical check of the obtained solution.
Restrictions The main objective of the TOP list is to provide a ranked list of general-purpose systems that are in common use for high-end applications. The authors of the TOP reserve the right to independently verify submitted LINPACK Benchmark [] results, and exclude systems from the list, which are not valid or not general purpose in nature. A system is considered to be of general purpose if it is able to be used to solve a range of scientific problems. Any system designed specifically to solve the LINPACK benchmark problem or have as its major purpose the goal of a high TOP ranking will be disqualified. The systems in the TOP list are expected to be persistent and available for use for an extended period of time. In that period, it is allowed to submit new results which will supersede any prior submissions. Thus, an improvement over time is allowed. The TOP authors will reserve the right to deny inclusion in the list if it is suspected that the system violates these conditions. The TOP List keepers can be reached by sending email to info at top.org. The TOP list can be found at www.top.org.
Related Entries Benchmarks HPC Challenge Benchmark LINPACK Benchmark Livermore Loops
Bibliography . Dongarra JJ, Luszczek P, Petitet A () The LINPACK benchmark: past, present, and future. Concurr Comput Pract Exp :–
T
Topology Aware Task Mapping Abhinav Bhatele University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms Graph embedding; MPI process mapping
Definition Topology aware task mapping refers to the mapping of communicating parallel objects, tasks, or processes in a parallel application on nearby physical processors to minimize network traffic, by considering the communication of the objects or tasks and the interconnect topology of the machine.
Discussion Introduction Processors in modern supercomputers are connected together using a variety of interconnect topologies: meshes, tori, fat-trees, and others. Increasing size of the interconnect leads to an increased sharing of resources (network links and switches) among messages and hence network contention. This can potentially lead to significant performance degradation for certain classes of parallel applications. Sharing of links can be avoided by minimizing the distance traveled by messages on the network. This is achieved by mapping communicating objects or tasks on nearby physical processors on the network topology and is referred to as topology aware task mapping. Topology aware mapping is a technique to minimize communication traffic over the network and hence optimize performance of parallel programs. It is becoming increasingly relevant for obtaining good performance on current supercomputers. The general mapping problem is known to be NPhard [, ]. Apart from parallel computing, topology aware mapping also has applications in graph embedding in mathematics and VLSI circuit design. The problem of embedding one graph on another while minimizing some metric has been well studied in mathematics. Layout of VLSI circuits to minimize length of the longest wire is another problem that requires mapping of one grid on to another. However, the problems
T
T
Topology Aware Task Mapping
to be tackled are different in several aspects in parallel computing from mathematics or circuit layout. For example, in VLSI, the size of the host graph can be larger than that of the guest graph whereas, in parallel computing, typically, the host graph is equal to or smaller than the guest graph. Research on topology aware mapping in parallel computing began in the s with a paper by Bokhari []. Work in this area has primarily involved the development of heuristics that target different mapping scenarios. Heuristics typically provide close to optimal solutions in a reasonable time. Arunkumar et al. [] categorize various heuristic techniques into – deterministic, randomized, and random start heuristics. Over the years, specific techniques better suited for certain architectures were developed – for hypercubes and array processors in the s and meshes and tori in the s. In the recent years, developers of certain parallel applications have also developed application specific mapping techniques to map their codes on to modern supercomputers [–]. Prior to a survey of the various heuristic techniques for task mapping, a brief description of the existing
interconnect topologies and the kinds of communication graphs that are prevalent in parallel applications is essential.
Interconnect Topologies Various common and radical topologies have been deployed in supercomputers, ranging from hypercubes to fat-trees to three-dimensional tori and meshes. They can be divided into two categories: . Direct networks: In direct networks, each processor is connected to a few other processors directly. A message travels from source to destination by going through several links connecting the processors. Hypercubes, tori, meshes, etc., are all examples of direct networks (see Figs. and ). Several modern supercomputers currently have a three-dimensional (D) mesh or torus interconnect topology. IBM Blue Gene/L and Blue Gene/P machines are D tori built from blocks of a torus of size ×× nodes. Cray XT machines (XT and XE) are also D tori. The primary difference between IBM and Cray machines is that on IBM machines, each allocated job partition
Topology Aware Task Mapping. Fig. A three-dimensional mesh and a two-dimensional torus
Topology Aware Task Mapping. Fig. A four-dimensional hypercube and a three-level fat-tree network
Topology Aware Task Mapping
is a contiguous mesh or torus. However, on Cray machines, nodes are randomly selected for a job and do not constitute a complete torus. . Indirect networks: Indirect networks have switches which route the messages to the destination. No two processors are connected directly and messages always have to go through switches to reach their destination. Fat-tree networks are examples of indirect networks (see Fig. ). Infiniband, IBM’s Federation interconnect and SGI Altix machines are examples of fat-tree networks. LANL’s RoadRunner also has a fat-tree network. Some of these networks benefit from topology aware task mapping more than others. A significant percentage of the parallel machines in the s had a hypercube interconnect and hence, much of the research then was directed toward such networks. More recent work involves optimizing applications on D meshes and tori.
Communication Graphs Tasks in parallel applications can interact in a variety of ways in terms of the specific communication partners, number of communicators, global versus localized communication, etc. All applications can be classified into a few different categories based on different parameters governing the communication patterns: Static versus dynamic communication: Depending on whether the communication graph of the application changes at runtime, graph can be classified as static or dynamic. If the communication graph is static, topology aware mapping can be done offline and used for the entirety of the run. If the communication is dynamic, periodic remapping depending on the changes in the communication graph are required. Several categories of parallel applications such as Lattice QCD, Ocean Simulations, and Weather Simulations have a stencillike communication pattern that does not change during the run. On the other hand, molecular dynamics and cosmological simulations have dynamic communication patterns. Regular versus irregular communication: Communication in a parallel application can be regular (structured) or irregular (unstructured). An example of regular communication is a five-point stencil-like application where every task communicates with four of its neighbors.
T
When no specific pattern can be attributed to the communication graph, it is classified as irregular or unstructured. Unstructured grid computations are examples of applications with irregular communication graphs. Point-to-point versus collective communication: Some applications primarily use point-to-point messages with minimal global communication. Others, however use collective operations such as broadcasts, reductions, and all-to-alls between all or a subset of processors. Different mapping algorithms are required to optimize different types of communication patterns. Parallel applications can also be classified into computation bound or communication bound depending on the relative amount of communication involved. A large body of parallel applications spend a small portion of their overall execution time doing communication. Such applications will typically be unaffected by topology aware mapping. Communication-bound latency-sensitive applications benefit most in terms of performance from topology aware task mapping.
The Mapping Process An algorithm for mapping of tasks in an application requires two inputs – the communication graph of an application and the machine topology of the allocated job partition. Given these two inputs, the aim is to map communicating objects or tasks close to one another on nearby physical processors. The success of a mapping algorithm is evaluated in terms of minimizing or maximizing some function which correlates well with the contention on the network or actual application performance.
Objective Functions Mapping algorithms aim at minimizing some metric referred to as an objective function which should be chosen carefully. A good objective function is one that does an accurate evaluation of a mapping solution in terms of yielding better performance. Objective functions are also important to compare the optimality of different mapping solutions. Several objective functions which have been used for different mapping algorithms are listed below: ●
Overlap between guest and host graph edges: One metric to determine the quality of the mapping is
T
T
Topology Aware Task Mapping
the number of edges in the guest graph which fall on the host graph. This metric is referred to as the cardinality of the mapping by Bokhari []. The mapping which yields the highest cardinality is the best. ● Maximum dilation: This metric is used for architectures and applications where the longest edge in the communication graph determines the performance. In other words, the message that travels the maximum number of hops or links on the network determines the overall performance [, ]. Maximum dilation = maxni= ∣di ∣
()
The mapping which leads to the smallest dilation for any edge in the guest graph on the processor interconnect is the best. ● Hop-bytes: This is the weighted sum of all edges in the communication graph multiplied by their dilation on the processor graph as per the mapping algorithm [, ]. n
Hop-bytes = ∑ di × bi
()
i=
where di is the number of hops or links traveled by the message on the network and bi is the size of the message in bytes. Hop-bytes is a measure of the total communication traffic on the network and hence, an approximate indication of the contention. A smaller value for hop-bytes indicates less contention on the network. Average hops per byte is another way of expressing the same metric, n
∑ di × bi Average hops per byte =
i= n
()
∑ bi i=
The last two objective functions, maximum dilation and hop-bytes, are typically used today and are applicable in different scenarios. The choice of one over the other depends upon the parallel application and the architecture for which the mapping is being performed.
Heuristic Techniques for Mapping Owing to the general applicability of mapping in various fields, a huge body of work exists targeting this problem. Many techniques used for solving combinatorial optimization problems can be used for obtaining
solutions to the mapping problem. Simulated annealing, genetic algorithms, and neural network–based heuristics are examples of such physical optimization techniques. Other heuristic techniques are recursive partitioning, pairwise exchanges, and clustering and geometry-based mapping. Arunkumar et al. [] categorize various heuristic techniques into – deterministic, randomized, and random start heuristics. The following sections discuss some of the mapping techniques classified into these categories.
Deterministic Heuristics In this class, the choice of search path is deterministic and typically a fixed search strategy is used taking the domain-specific knowledge about the parallel application into account. Yu et al. [] present folding and embedding techniques to obtain deterministic solutions for mapping of two- and three-dimensional grids on to D mesh topologies. Their topology mapping library provides support for MPI virtual topology functions on IBM Blue Gene machines. Bhatele [] uses domain-specific knowledge and communication patterns of parallel application for heuristic techniques such as “affine transformation” inspired mapping and guided graph traversals to map on to D tori. The mapping library developed as a result can map application graphs that are regular (n-dimensional grids) as well as those that are irregular. Several application developers such as those of Blue Matter [], Qbox [], and OpenAtom [] have developed application specific mapping algorithms to map tasks on to processor topologies. Recursive graph partitioning– based strategies which partition both the application and processor graph for mapping also fall under this category []. Algorithms using deterministic algorithms are typically the fastest among the three categories. Randomized Heuristics This category of solutions does not depend on domainspecific knowledge and uses search techniques that are randomized, yielding different solutions in successive executions. Neural networks, genetic algorithms, and simulated annealing–based heuristics are example of this class. Bokhari’s algorithm of pairwise exchanges accompanied by probabilistic jumps also falls under this category.
Topology Aware Task Mapping
In genetic algorithm–based heuristics [], possible mapping solutions are first encoded in some manner and a random population of such patterns is generated. Then different genetic operators such as crossover and mutation are applied to derive new generations from old ones. Certain criteria are used to estimate the fitness of a selection and unfit solutions are rejected. Given a termination rule, the best solution among the population is taken to be the solution at termination. Obtaining an exact solution to the mapping problem is difficult and iterative algorithms tend to produce solutions that are not globally optimal. The technique of simulated annealing provides a mechanism to escape local optima and hence is a good fit for mapping problems. The most important considerations for a simulated annealing algorithm are deciding a good objective function and an annealing schedule. This technique has been used for processor and link assignment by Midkiff et al. [] and Bhanot et al. [].
Random Start Heuristics In some algorithms, a random initial mapping is chosen and then improved iteratively. Such solutions fall under the category of random start heuristics. Techniques such as pairwise exchanges and recursive partitioning fall under this category. The technique of pairwise exchanges that starts from an initial assignment, is a simple brute force method which has been used with different variations to tackle the mapping problem []. The basic idea is simple: An objective function or metric to be optimized is selected and then an initial mapping of the guest graph on the host graph is determined. Then, a pair of nodes is chosen, either randomly or based on some selection criteria and their mappings are interchanged. If the metric or objective function becomes better, the exchange is preserved and the process is repeated, until some termination criterion is achieved. Another technique in this class is task clustering followed by cluster allocation. In the clustering phase, tasks are clustered into groups equal to the number of processors using recursive min-cut algorithms. Then these clusters are allocated to the processors by starting with a random assignment and iteratively improving it by local exchanges. The first phase aims at minimizing intercluster communication without comprising load balancing while the second phase aims at minimizing
T
inter-processor communication. This is especially useful for models such as Charm++ where the number of tasks is much larger than the number of processors.
Future Directions The emergence of new architectures and network topologies requires modifying existing algorithms and developing new ones to suit them. As an example, the increase in number of cores per node adds another dimension to the network topology and should be taken into account. Algorithms also need to be developed for new parallel applications. There is a growing need for runtime support in the form of an automated mapping framework that can map applications intelligently on to the processor topology. This will reduce the burden on application developers to map individual applications and will also help reuse algorithms across similar communication graphs. Bhatele et al. [] are making some efforts in this direction. There is an increasing demand for support in the MPI runtime for mapping of MPI virtual topology functions []. The increase in size of parallel machines and in the number of threads in a parallel program requires parallel and distributed techniques for mapping. Gathering the entire communication graph on one processor and applying sequential centralized techniques will not be feasible in the future. Hence, an effort should be made towards developing strategies which are distributed, scalable, and can be run in parallel. Hierarchical multilevel graph partitioning techniques are one such effort in this direction.
Related Entries Cray XT and Cray XT Series of Supercomputers Cray XT and Seastar -D Torus Interconnect Graph Partitioning Hypercubes and Meshes IBM Blue Gene Supercomputer Infiniband Interconnection Networks Load Balancing, Distributed Memory Locality of Reference and Parallel Processing Processes, Tasks, and Threads Routing (Including Deadlock Avoidance) Space-Filling Curves Task Graph Scheduling
T
T
Torus
Bibliographic Notes and Further Reading Bokhari [] wrote one of the first papers on task mapping for parallel programs. A good discussion of the various objective functions used for comparing mapping algorithms can be found in []. Fox et al. [] divide the various mapping algorithms into physical optimization and heuristic techniques. Arunkumar et al. [] provide another classification into deterministic, randomized, and random start heuristics. Application developers attempting to map their parallel codes can gain insights from mapping algorithms developed by individual application groups [–]. Bhatele and Kale have been developing an automatic mapping framework for mapping of Charm++ and MPI applications to the processor topology []. They are also developing techniques for parallel and distributed topology aware mapping.
Bibliography . Bokhari SH () On the mapping problem. IEEE Trans Comput ():– . Kasahara H, Narita S () Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput :– . Arunkumar S, Chockalingam T () Randomized heuristics for the mapping problem. Int J High Speed Comput (IJHSC) ():– . Fitch BG, Rayshubskiy A, Eleftheriou M, Ward TJC, Giampapa M, Pitman MC () Blue matter: approaching the limits of concurrency for classical molecular dynamics. In: SC’: Proceedings of the ACM/IEEE conference on Supercomputing, Tampa, ACM Press, New York, – Nov . Gygi F, Draeger EW, Schulz M, Supinski BRD, Gunnels JA, Austel V, Sexton JC, Franchetti F, Kral S, Ueberhuber C, Lorenz J () Large-scale electronic structure calculations of high-Z metals on the blue gene/L platform. In: SC’: Proceedings of the ACM/IEEE conference on Supercomputing, ACM Press, New York . Bhatelé A, Bohm E, Kalé LV () Optimizing communication for Charm++ applications by reducing network contention. Concurr Comput ():– . Lee S-Y, Aggarwal JK () A mapping strategy for parallel processing. IEEE Trans Comput ():– . Berman F, Snyder L () On mapping parallel algorithms into parallel architectures. J Parallel Distrib Comput ():– . Ercal F, Ramanujam J, Sadayappan P () Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the rd conference on Hypercube concurrent computers and applications, ACM Press, New York, pp – . Agarwal T, Sharma A, Kalé LV () Topology-aware task mapping for reducing communication contention on large parallel
.
.
. .
.
.
.
machines, In: Proceedings of IEEE International Parallel and Distributed Processing Symposium , Rhodes Island, – Apr . IEEE, Piscataway Yu H, Chung I-H, Moreira J () Topology mapping for blue gene/L supercomputer. In: SC’: Proceedings of the ACM/IEEE conference on Supercomputing, Tampa, – Nov . ACM, New York, p Bhatele A () Automating topology aware mapping for supercomputers. Ph.D. thesis, Dept. of Computer Science, University of Illinois. http://hdl.handle.net// (August ) Kernighan BW, Lin S () An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J ():– Bollinger SW, Midkiff SF () Processor and link assignment in multicomputers using simulated annealing. In: ICPP, vol , Aug , pp – Bhanot G, Gara A, Heidelberger P, Lawless E, Sexton JC, Walkup R () Optimizing task layout on the blue gene/L supercomputer. IBM J Res Dev (/):– Bhatele A, Gupta G, Kale LV, Chung I-H () Automated mapping of regular communication graphs on mesh interconnects. In: Proceedings of International Conference on High Performance Computing & Simulation (HiPCS) , Caen, June– July . IEEE, Piscataway Mansour N, Ponnusamy R, Choudhary A, Fox GC () Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers. In: ICS’: Proceedings of the th International Conference on Supercomputing, Tokyo, – July . ACM, New York, pp –
Torus Networks, Direct
Total Exchange Allgather
Trace Scheduling Stefan M. Freudenberger Zürich, Switzerland
Definition Trace scheduling is a global acyclic instruction scheduling technique in which the scheduling region consists
Trace Scheduling
of a linear acyclic sequence of basic blocks embedded in the control flow graph. Trace scheduling differs from other global acyclic scheduling techniques by allowing the scheduling region to be entered after the first instruction. Trace scheduling was the first global instruction scheduling technique that was proposed and successfully implemented in both research and commercial compilers. By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware, trace scheduling provided the basis for making large amounts of instruction-level parallelism practical. Its first commercial implementation demonstrated that commercial codes could be statically compiled for multi-issue architectures, and thus greatly influenced and contributed to the performance of superscalar architectures. Today, the ideas of trace scheduling and its descendants are implemented in most compilers.
Discussion Introduction Global scheduling techniques are needed for processors that expose instruction-level parallelism (ILP), that is, processors that allow multiple operations to execute simultaneously. This situation may independently arise for two reasons: either because a processor issues more than a single operation during each clock cycle, or because a processor allows issuing independent operations while deeply pipelined operations are still executing. The number of independent operations that need to be found for an ILP processor is a function of both the number of operations issued per clock cycle, and the latency of operations, whether computational or memory. The latency of computational operations depends upon the design of the functional units. The latency of memory operations depends upon the design and latencies of caches and main memory, as well as on the availability of prefetch and cache-bypassing operations. Global scheduling techniques are needed for these processors because the number of independent operations available in a typical basic block is too small to fully utilize their available hardware resources. By expanding the scheduling region, more operations become available for scheduling. Global scheduling techniques differ from other global code motion techniques (such as
T
loop-invariant code motion or partial redundancy elimination) because they take into account the available hardware resources (such as available functional units and operation issue slots). Instruction scheduling techniques can be broadly classified based on the region that they schedule, and whether this region is cyclic or acyclic. Algorithms that schedule only single basic blocks are known as local scheduling algorithms; algorithms that schedule multiple basic blocks at once are known as global scheduling algorithms. Global scheduling algorithms that operate on entire loops of a program are known as cyclic scheduling algorithms, while methods that impose a scheduling barrier at the end of a loop body are known as acyclic scheduling algorithms. Global scheduling regions include regions consisting of a single basic block as a “degenerate” form of region, and acyclic schedulers may consider entire loops but, unlike cyclic schedulers, stop at the loops’ back edges (a back edge points to an ancestor in a depth-first traversal of the control flow graph; it captures the flow from one iteration of the loop to the start of the next iteration). All scheduling algorithms can benefit from hardware support. When control-dependent operations that can cause side effects move above their controlling branch, they need to be either executed conditionally so that their effects only arise if the operation is executed in the original program order, or any side effects must be delayed until the point at which the operation would have been executed originally. Hardware techniques to support this include predication of operations, implicit or explicit register renaming, and mechanisms to suppress or delay exceptions in order to prevent an incorrect exception to be signaled. Predication of operations controls whether the side effects of the predicated operations become visible to the program state through an additional predicate operand. The predicate operand can be implicit (such as the conditional execution of operations in branch delay slots depending on the outcome of the branch condition) or explicit (through an additional machine register operand); in the latter case, the predicate operand could simply be the same predicate that controls the conditional branch on which the operation was controldependent in the original flow graph (in which case the predicated operation could move just above a single conditional branch). Register renaming refers to the technique where additional machine registers are
T
T
Trace Scheduling
used to hold the results of an operation until the point where the operation would have occurred in the original program order. Global scheduling algorithms principally consist of two phases: region formation and schedule construction. Algorithms differ in the shape of the region and the global code motions permitted during scheduling. Depending on the region and the allowed code motions, compensation code needs to be inserted at appropriate places in the control flow graph to maintain the original program semantics; depending on the code motions allowed during scheduling, compensation code needs to be inserted during the scheduling phase of the compiler. Trace scheduling allows traces to be entered after the first operation and before the last operation. This complicates the determination of compensation code because the location of rejoin points cannot be done before a trace has been scheduled. This leads to the following overall trace scheduling loop: while (unscheduled operations remain) { select trace T construct schedule for T bookkeeping determine rejoin points to T generate compensation code }
The remainder of this entry first discusses region formation and schedule construction in general and as it applies to trace scheduling, and then compares trace scheduling to other acyclic global scheduling techniques. Cyclic scheduling algorithms are discussed elsewhere.
Region Formation – Trace Picking Traces were the first global scheduling region proposed, and represent contiguous linear paths through the code (Fig. ). More formally, a trace consists of the operations of a sequence of basic blocks B , B , …, Bn with the properties that: ●
Each basic block is a predecessor of the next in the sequence (i.e., for each k = , …, n − , Bk is a predecessor of Bk+ , and Bk+ is a successor of Bk in the control flow graph).
A
B
G
C A D
H
10
90
5
5
B E
F
Trace Scheduling. Fig. Trace selection. The left diagram shows the selected trace. The right diagram illustrates the mutual-most-likely trace picking heuristic: assume that A is the last operation of the current trace, and that B is one of A’s successors. Here B is the most likely successor of A, and A is the most likely predecessor of B
●
For any j, k there is no path Bj → Bk → Bj except for those that include B (i.e., the code is cycle free, except that the entire region can be part of some encompassing loop).
Note that this definition does not exclude forward branches within the region, nor control flow that leaves the region and reenters it at a later point. This generality has been controversial in the research community because many felt that the added complexity of its implementation was not justified by its added benefit and has led to several alternative approaches that are discussed below. Of the many ways in which one can form traces, the most popular algorithm employs the following simple trace formation algorithm: ●
Pick the as-yet unscheduled operation with the largest expected execution frequency as the seed operation of the trace. ● Grow the trace both forward in the direction of the flow graph as well as backward, picking the mutually most-likely successor (predecessor) operation to the currently last (first) operation on the trace. ● Stop growing a trace when either no mutually mostlikely successor (predecessor) exists, or when some heuristic trace length limit has been reached.
Trace Scheduling
The mutually most-likely successor S of an operation P is the operation with the properties that: ● ●
S is the most likely successor of P; P is the most likely predecessor of S.
For this definition, it is immaterial whether the likelihood that S follows P (P precedes S) is based on available profile data collected during earlier runs of the program, has been determined by a synthetic profile, or is based on source annotations in the program. Of course, the more benefit is derived from having picked the correct trace, the greater is the penalty when picking the wrong trace. Trace picking is the region formation technique used for trace scheduling. Other acyclic region formation techniques and their relationship to trace scheduling are discussed below.
Region Enlargement Trace selection alone typically does not expose enough ILP for the instruction scheduler of a typical ILP processor. Once the limit on the length of a “natural” trace has been reached (e.g., the entire loop body), region-enlargement techniques can be employed to further increase the size of the region, albeit at the cost of a larger code size for the program. Many enlargement techniques exploit the fact that programs iterate
T
and grow the size of a region by making extra copies of highly iterated code, leading to a larger region that contains more ILP. These code-replicating techniques have been criticized by advocates of other approaches, such as cyclic scheduling and loop-level parallel processing, because comparable benefits to larger schedule regions may be found using other techniques. However, no study appears to exist that quantifies such claims. The simplest and oldest region-enlargement technique is loop unrolling (Fig. ): to unroll a loop, duplicate its body several times, change the targets of the back edges of each copy but the last to point to the header of the next copy (so that the back edges of the last copy point back to the loop header of the first copy). Variants of loop unrolling include pre-/post-conditioning of a loop by k for counted for loops with unknown loop bounds (leading to two loops: a “fixup loop” that executes up to k iterations; and a “main loop” that is unrolled by k and has its internal exits removed; the fixup loop can precede or follow the main loop), and loop peeling by the expected small iteration count. When the iteration count of the fixup loop of a pconditioned loop is small (which it typically is), the fixup loop is completely unrolled. Typically, loop unrolling is done before region formation so that the enlarged region becomes available
Original loop
Unrolled by 4
Pre-conditioned by 4
Post-conditioned by 4
L: if … goto E body goto L E:
L: if … goto E body if … goto E body if … goto E body if … goto E body goto L E:
if … goto L body if … goto L body if … goto L body L: if … goto E body body body body goto L E:
L: if … goto X body body body body goto L X: if … goto E body if … goto E body if … goto E body E:
Trace Scheduling. Fig. Simplified illustration of variants of loop unrolling. “if” and “goto” represent the loop control operations; “body” represents the part of the loop without loop-related control flow. In the general case (e.g., a while loop) the loop exit tests remain inside the loop. This is shown in the second column (“unrolled by ”). For counted loops (i.e., for loops), the compiler can condition the unrolled loop so that the loop conditions can be removed from the main body of the loop. Two variants of this are shown in the two rightmost columns. Modern compilers will typically precede the loop with a zero trip count test and place the loop condition at the bottom of the loop. This removes the unconditional branch from the loop
T
T
Trace Scheduling
Unrolled by 4 i=0 L: if (i > N) goto E body(i) i=i+1 if (i > N) goto E body(i) i=i+1 if (i > N) goto E body(i) i=i+1 if(i > N) goto E body(i) i=i+1 goto L E:
After renaming i=0 L: if (i > N) goto E body(i) i1 = i + 1 if (i1 > N) goto E body(i1) i2 = i1 + 1 if (i2 > N) goto E body(i2) i3 = i2 + 1 if(i3 > N) goto E body(i3) i = i3 + 1 goto L E:
After copy propagation i=0 L: if (i > N) goto E body(i) i1 = i + 1 if (i1 > N) goto E body(i1) i2 = i + 2 if (i2 > N) goto E body(i2) i3 = i + 3 if(i3 > N) goto E body(i3) i=i+4 goto L E:
Trace Scheduling. Fig. Typical induction variable manipulations for loops. Downward arrows represent flow dependences; upward arrows represent anti dependences. Only the critical dependences are shown
to the region selector. This is done to keep the region selector simpler but may lead to phase-ordering issues, as loop unrolling has to guess the “optimal” unroll amount. At the same time, when loops are unrolled before region formation then the resulting code can be scalar optimized in the normal fashion; in particular height-reducing transformations that remove dependences between the individual copies of the unrolled loop body can expose a larger amount of parallelism between the individual iterations (Fig. ). Needless to say, if no parallelism between the iterations exists or can be found, loop unrolling is ineffective. Loop unrolling in many industrial compilers is often rather effective because a heuristically determined small amount of unrolling is sufficient to fill the resources of the target machine.
Region Compaction – Instruction Scheduler Once the scheduling region has been selected, the instruction scheduler assigns functional units of the target machine and time slots in the instruction schedule to each operation of the region. In doing so, the scheduler attempts to minimize an objective cost function while maintaining program semantics and obeying the resource limitations of the target architecture. Often, the objective cost function is the expected execution time,
but other objective functions are possible (for example, code size and energy efficiency could be part of an objective function). The semantics of a program defines certain sequential constraints or dependences that must be maintained by a valid execution. These dependences preclude some reordering of operations within a program. The data flow of a program imposes data dependences, and the control flow of a program imposes control dependences. (Note the difference between control flow and control dependence: block B is control dependent on block A if A precedes B along some path, but B does not postdominate A. In other words, the result of the control decision made in A directly affects whether or not B is executed.) There are three types of data dependences: readafter-write dependences (also called RAW, flow, or true dependences), write-after-read dependences (also called WAR or anti dependences), and write-after-write dependences (also called WAW or output dependences). The latter two types are also called false dependences because they can be removed by renaming. There are two types of control dependences: split dependences may prevent operations from moving below the exit of a basic block, and join dependences may prevent operations from moving above the entrance to a basic block. Control dependence does not constrain the relative order of operations within a
Trace Scheduling
basic block but rather expresses constraints on moving operations between basic blocks. Both data and control dependences represent ordering constraints on the program execution, and hence induce a partial ordering on the operations. Any partial ordering can be represented as a directed acyclic graph (DAG), and DAGs are indeed often used by scheduling algorithms. Variants to the simple DAG are the data dependence graph (DDG), and the program dependence graph (PDG). All these graphs represent operations as nodes and dependences as edges (some graphs only express data dependences, while others include both data and control dependences).
Code Motion Between Adjacent Blocks Two fundamental techniques, predication and speculation, are employed by schedulers (or earlier phases) to transform or remove control dependence. While it is sometimes possible to employ either technique, they represent independent techniques, and usually one is more natural to employ in a given situation. Speculation is used to move operations above a branch that is highly weighted in one direction; predication is used to collapse short sequences of alternative operations following a branch that is nearly equally likely in each direction. Predication can also play an important role in software pipelining. Speculative code motion (or code hoisting and sometimes code sinking) moves operations above controldominating branches (or below joins for sinking). In principle, this transformation does not always maintain the original program semantics, and in particular it may change the exception behavior of the program. If an operation may generate an exception and the exception recovery model does not allow speculative exceptions to be dismissed (ignored), then the compiler must generate recovery code that raises the exception at the original program point of the speculated operation. Unlike predication, speculation actually removes control dependences, and thus potentially reduces the length of the critical path of execution. Depending on the shape and size of recovery code, and if multiple operations are speculated, the addition of recovery code can lead to a substantial amount of code. Predication is a technique where with hardware support operations have an additional input operand, the predicate operand, which determines whether any
T
effects of executing the operations are seen by the program execution. Thus, from an execution point of view, the operation is conditionally executed under the control of the predicate input. Hence changing a controldependent operation to its predicated equivalent that depends on a predicate that is equivalent to the condition of the control dependence turns control dependence into data dependence.
Trace Compaction There are many different scheduling techniques, which can broadly be classified by features into cycle versus operation scheduling, linear versus graph-based, cyclic versus acyclic, and greedy versus backtracking. However, for trace scheduling itself the scheduling technique employed is not of major concern; rather, trace scheduling distinguishes itself from other global acyclic scheduling techniques by the way the scheduling region is formed, and by the kind of code motions permitted during scheduling. Hence these techniques will not be described here, and in the following, a greedy graph-based technique, namely list scheduling, will be used.
Compensation Code During scheduling, typically only a very small number of operations can be moved freely between basic blocks without changing program semantics. Other operations may be moved only when additional compensation code is inserted at an appropriate place in order to maintain original program semantics. Trace scheduling is quite general in this regard. Recall that a trace may be entered after the first instruction, and exited before the last instruction. In addition, trace scheduling allows operations in the region (trace) to move freely during scheduling relative to entries (join points) to and exits (split points) from the current trace. A separate bookkeeping step restores the original program semantics after trace compaction through the introduction of compensation code. It is this freedom of code motion during scheduling, and the introduction of compensation code between the scheduling of individual regions, that represents a major difference between trace scheduling and other acyclic scheduling techniques.
T
T
Trace Scheduling
X
X X
X
A
A B
B
B
A
B
A
C
C
C
C
Y
Y
b
No compensation
X
X
Z A
Z B
B
Split compensation X
X A
B′
B B
A
C
A′
W Y
Y
B′
A C
C
Z
Z
C
W
Y
Y
c
W
Y
Y
a
A′
W
Join compensation
d
Join–Split compensation
Trace Scheduling. Fig. Basic scenarios for compensation code. In each diagram, the left part shows the selected trace, the right part shows the compacted code where operation B has moved above operation A
Since trace scheduling allows operations to move above join points as well as below split points (conditional branches) in the original program order, the bookkeeping process includes the following kinds of compensation. Note that a complete discussion of all the intricacies of compensation code is well beyond the scope of this entry; however, the following is a list of the simple concepts that form the basis of many of the compensation techniques used in compilers. No compensation (Fig. a). If the global motion of an operation on the trace does not change the relative order of operations with respect to split and join points, no compensation code is needed. This covers the situation when an operation moves above a split, in which case the operation becomes speculative, and requires compensation depending on the recovery model of exceptions: in the case of dismissible speculation, no compensation code is needed; in the case of recovery speculation, the compiler has to emit a recovery block to guarantee the timely delivery of exceptions for correctly speculated operations.
Split compensation (Fig. b). When an operation A moves below a split operation B (i.e., a conditional branch), a copy of A (called A′ ) must be inserted on the off-trace split edge. When multiple operations move below a split operation, they are all copied on the offtrace edge in source order. These copies are unscheduled, and hence will be picked and scheduled later during the trace scheduling of the program. Join compensation (Fig. c). When an operation B moves above a join point A, a copy of B (called B′ ) must be copied on the off-trace join edge. When multiple operations move above a join point, they are all copied on the off-trace edge in source order. Join–Split compensation (Fig. d). When splits are allowed to move above join points, the situation becomes more complicated: when the split is copied on the rejoin edge, it must account for any split compensation and therefore introduce additional control paths with additional split copies. These rules define the compensation code required to correctly maintain the semantics of the original
Trace Scheduling
T
X X
X
C
C
A
A
A
B
U
B
C’
U
B
U
C Y
Y
Y
Trace Scheduling. Fig. Compensation copy suppression. The left diagram shows the selected trace. The middle diagram shows the compacted code where operation C has moved above operation A together with the normal join compensation. The right diagram shows the result of compensation copy suppression assuming that C is available at Y
program. The following observations can be used to heuristically control the amount of compensation code that is generated. To limit split compensation, the Multiflow Trace Scheduling compiler [], the first commercial compiler to implement trace scheduling, required that all operations that precede a split on the trace precede the split on the schedule. While this limits the amount of available parallelism, the intuitive explanation is that a trace represents the most likely execution path; the ontrace performance penalty of this restriction is small; and off-trace the same operations would have to be executed in the first place. Multiflow’s implementation excluded memory-store operations from this heuristic because in Multiflow’s Trace architecture stores were unconditional and hence could not move above splits; they were allowed to move below splits to avoid serialization between stores and loop exits in unrolled loops. The Multiflow compiler also restricted splits to remain in source order. Not only did this reduce the amount of compensation code, it also ensured that all paths created by compensation code are subsets of paths (possibly rearranged) in the flow graph before trace scheduling. Another observation concerns the possible suppression of compensation copies [] (Fig. ): sometimes an operation C that moves above a join point following an operation B actually moves to a position on the trace that dominates the join point. When this happens, and the result of C is still available at the join point, no copy of C is needed. This situation often arises when
loops with internal branches are unrolled. Without copy suppression, such loops can generate large amounts of redundant compensation code.
Bibliographic Notes and Further Reading The simplest form of a scheduling region is a region where all operations come from a single-entry single-exit straight-line piece of code (i.e., a basic block). Since these regions do not contain any internal control flow, they can be scheduled using simple algorithms that maintain the partial order given by data dependences. (For simplicity, it is best to require that operations that could incur an exception must end their basic block, allowing the exception to be caught by an exception handler.) Traces and trace scheduling were the first regionscheduling techniques proposed. They were introduced by Fisher [, ] and described more carefully in Ellis’ thesis []. By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware trace scheduling provided the basis for making VLIW machines practical. Trace scheduling was implemented in the Multiflow compiler []; by demonstrating that commercial codes could be statically compiled for multi-issue architectures, this work also greatly influenced and contributed to the performance of superscalar architectures. Today, ideas of trace scheduling and its descendants are implemented in most compilers (e.g., GCC, LLVM, Open, Pro, as well as commercial compilers).
T
T
Trace Scheduling
Trace scheduling inspired several other global acyclic scheduling techniques. The most important linear acyclic region-scheduling techniques are presented next.
Superblocks Hwu and his colleagues on the IMPACT project have developed a variant of trace scheduling called superblock scheduling. Superblocks are traces with the added restriction that the superblock must be entered at the top [, ]. Hence superblocks can be joined only before the first or after the last operation in the superblock. As such, superblocks are single-entry, multiple-exit traces. Since superblocks do not contain join points, scheduling a superblock cannot generate any join or join–split compensation. By also prohibiting motion below splits, superblock scheduling avoids the need of generating compensation code outside the schedule region, and hence does not require a separate bookkeeping step. With these restrictions, superblock formation can be completed before scheduling starts, simplifying its implementation. Superblock formation often includes a technique called tail duplication to increase the size of the superblock: tail duplication copies any operations that follow a rejoin in the original control flow graph and that are part of the superblock into the rejoin edge, thus effectively lowering the rejoin point to the end of the superblock. This is done at superblock formation time, before any compaction takes place []. A variant of superblock scheduling that allows speculative code motion is sentinel scheduling [].
Hyperblocks A different approach to global acyclic scheduling also originated with the IMPACT project. Hyperblocks are superblocks that have eliminated internal control flow using predication []. As such, hyperblocks are singleentry, multiple-exit traces (superblocks) that use predication to eliminate internal control flow.
Treegions Treegions [, ] consist of the operations from a list of basic blocks B , B , …, Bn with the properties that: ●
For each j > , Bj has exactly one predecessor.
●
For each j > , the predecessor Bi of Bj is also on the list, where i < j.
Hence, treegions represent trees of basic blocks in the control flow graph. Since treegions do not contain any side entrances, each path through a treegion yields a superblock. Like superblock compilers, treegion compilers employ tail duplication and other regionenlarging techniques. More recent work by Zhou and Conte [, ] shows that treegions can be made quite effective without significant code growth.
Nonlinear Regions Nonlinear region approaches include percolation scheduling [] and DAG-based scheduling []. Trace scheduling- [] extends treegions by removing the restriction on side entrances. However, its implementation proved so difficult that its proposer eventually gave up on it, and no formal description or implementation of it is known to exist.
Related Entries Modulo Scheduling and Loop Pipelining
Bibliography . Aiken A, Nicolau A () Optimal loop parallelization. In: Proceedings of the SIGPLAN conference on programming language design and implementation, June , pp – . Chang PP, Warter NJ, Mahlke SA, Chen WY, Hwu WW () Three superblock scheduling models for superscalar and superpipelined processors. Technical Report CRHC--. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign . Chang PP, Mahlke SA, Chen WY, Warter NJ, Hwu WW () IMPACT: an architectural framework for multiple-instructionissue processors. In: Proceedings of the th annual international symposium on computer architecture, May , pp – . Ellis JR () Bulldog: a compiler for VLIW architectures. PhD thesis, Yale University . Fisher JA () Global code generation for instruction-level parallelism: trace scheduling-. Technical Report HPL--. Hewlett-Packard Laboratories . Fisher JA () Trace scheduling: a technique for global microcode compaction, IEEE Trans Comput, July , ():– . Fisher JA () The optimization of horizontal microcode within and beyond basic blocks. PhD dissertation. Technical Report COO--. Courant Institute of Mathematical Sciences, New York University, New York, NY
Trace Theory
. Freudenberger SM, Gross TR, Lowney PG () Avoidance and suppression of compensation code in a trace scheduling compiler, ACM Trans Program Lang Syst, July , ():– . Havanki WA () Treegion scheduling for VLIW processors. MS thesis. Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC . Havanki WA, Banerjia S, Conte TM () Treegion scheduling for wide issue processors. In: Proceedings of the fourth international symposium on high-performance computer architecture, February , pp – . Hwu WW, Mahlke SA, Chen WY, Chang PP, Warter NJ, Bringmann RA, Ouellette RG, Hank RE, Kiyohara T, Haab GE, Holm JG, Lavery DM (May ) The superblock: an effective technique for VLIW and superscalar compilation. J Supercomput, (–):– . Lowney PG, Freudenberger SM, Karzes TJ, Lichtenstein WD, Nix RP, O’Donnell JS, Ruttenberg JC () The Multiflow trace scheduling compiler, J Supercomput, May , (-):– . Mahlke SA, Lin DC, Chen WY, Hank RE, Bringmann RA () Effective compiler support for predicated execution using the hyperblock. In: Proceedings of the th annual international symposium on microarchitecture, , pp – . Mahlke SA, Chen WY, Bringmann RA, Hank RE, Hwu WW, Rau BR, Schlansker MS () Sentinel scheduling: a model for compiler-controlled speculative execution, ACM Trans Comput Syst, November , ():– . Moon SM, Ebcioglu K () Parallelizing nonnumerical code with selective scheduling and software pipelining, ACM Trans Program Lang Syst, November , ():– . Zhou H, Conte TM () Code size efficiency in global scheduling for ILP processors. In: Proceedings of the sixth annual workshop on the interaction between compilers and computer architectures, February , pp – . Zhou H, Jennings MD, Conte TM () Tree traversal scheduling: a global scheduling technique for VLIW/EPIC processors. In: Proceedings of the th annual workshop on languages and compilers for parallel computing, August , pp –
T
concurrent or parallel systems. Traces, or equivalently, elements in a free partially commutative monoid, are given by a sequence of letters (or atomic actions). Two sequences are assumed to be equal if they can be transformed into each other by equations of type ab = ba, where the pair (a, b) belongs to a predefined relation between letters. This relation is usually called partial commutation or independence. With an empty independence relation, that is, without independence, the setting coincides with the classical theory of words or strings.
Discussion Introduction The analysis of sequential programs describes a run of a program as a sequence of atomic actions. On an abstract level such a sequence is simply a string in a free monoid over some (finite) alphabet of letters. This purely abstract viewpoint embeds program analysis into a rich theory of combinatorics on words and a theory of automata and formal languages. The approach has been very fruitful from the early days where the first compilers have been written until now where research groups in academia and industry develop formal methods for verification. Efficient compilers use autoparallelization, which provides a natural example of independence of actions resulting in a partial commutation relation. For example, let a; b; c; a; d; e; f be a sequence of arithmetic operations where: (a) x := x + y, (b) x := x − z, (c) y := y⋅z (d) w := w,
Trace Theory Volker Diekert , Anca Muscholl Universität Stuttgart FMI, Stuttgart, Germany Université Bordeaux , Talence, France
Synonyms Partial computation; Theory of Mazurkiewicz-traces
Definition Trace Theory denotes a mathematical theory of free partially commutative monoids from the perspective of
(e) z := y⋅z,
( f ) z := x + y⋅w.
A concurrent-read-exclusive-write protocol yields a list of pairs of independent operations (a, d), (a, e), (b, c), (b, d), (c, d), and (d, e), which can be performed concurrently or in any order. The sequence can therefore be performed in four parallel steps {a} ; {b, c} ; {a, d, e} ; { f }, but as d commutes with a, b, c the result of a; b; c; a; d; e; f is equal to a; d; b; c; a; e; f , and two processors are actually enough to guarantee minimal parallel execution time, since another possible schedule is {a, d} ; {b, c} ; {a, e} ; { f }. Trace theory yields a tool to do such (data-independent) transformations automatically.
T
T
Trace Theory
Parallelism and concurrency demand for specific models, because a purely sequential description is neither accurate nor possible in all cases, for example, if asynchronous algorithms are studied and implemented. Several formalisms have been proposed in this context. Among these models there are Petri nets, Hoare’s CSP and Milner’s CCS, event structures, and branching temporal logics. The mathematical analysis of Petri nets is however quite complicated and much of the success of Hoare’s and Milner’s calculus is due to the fact that it stays close to the traditional concept of sequential systems relying on a unified and classical theory of words. Trace theory follows the same paradigm; it enriches the theory of words by a very restricted, but essential formalism to capture the main aspects of parallelism: In a static way a set I of independent letters (a, b) is fixed, and sequences are identified if they can be transformed into each other by using equations of type ab = ba for (a, b) ∈ I. In computer science this approach appeared for the first time in the paper by Keller on Parallel Program Schemata and Maximal Parallelism published in . Based on the ideas of Keller and the behavior of elementary net systems, Mazurkiewicz introduced in the notion of trace theory and made its concept popular to a wider computer science community. Mazurkiewicz’s approach relies on a graphical representation for a trace. This is a nodelabeled directed acyclic graph, where arcs are defined by the dependence relation, which is by definition the complement of the independence relation I. Thereby, a concurrent run has an immediate graphical visualization, which is obviously convenient for practice. The picture of the two parallel executions {a} ; {b, c} ; {a, d, e} ; { f } and {a, d} ; {b, c} ; {a, e} ; { f } can be depicted as follows, which represents (the Hasse diagrams of) isomorphic labeled partial orders: a
b
a
c
e
f
a
b
a
d
c
e
f
d
Moreover, the graphical representation yields immediately a correct notion of infinite trace, which is not clear when working with partial commutations. In the following years it became evident that trace theory indeed copes with some important phenomena such
as true concurrency. On the other hand it is still close to the classical theory of word languages describing sequential programs. In particular, it is possible to transfer the notion of finite sequential state control to the notion of asynchronous state control. This important result is due to Zielonka; it is one of the highlights of the theory. There is a satisfactory theory of recognizable languages relating finite monoids, rational operations, asynchronous automata, and logic. This leads to decidability results and various effective operations. Moreover, it is possible to develop a theory of asynchronous Büchi automata, which enables in trace theory the classical automata theory-based approach to automated verification.
Mathematical Definitions and Normal Forms Trace theory is founded on a rigorous mathematical approach. The underlying combinatorics for partial commutation were studied in mathematics already in in the seminal Lecture Notes in Mathematics Problèmes combinatoires de commutation et réarrangements by Cartier and Foata. The mathematical setting uses a finite alphabet Σ of letters and the specification of a symmetric and irreflexive relation I ⊆ Σ × Σ, called the independence relation. Conveniently, its complement D = Σ × Σ ⊆ I is called the dependence relation. The dependence relation has a direct interpretation as graph as well. For the dependency used in the first example above it looks as follows: b D = a
e
f
d
c
The intended semantics is that independent letters commute, but dependent letters must be ordered. Taking ab = ba with (a, b) ∈ I as defining relations one obtains a quotient monoid M(Σ, I), which has been called free partial commutative monoid or simply trace monoid in the literature. The elements are finite (Mazurkiewicz-)traces. For I = /, traces are just words in Σ∗ ; for a full independence relation, that is, D = id Σ , traces are vectors in some Nk , hence Parikh-images of words. The general philosophy is that the extrema Σ∗ and Nk are
Trace Theory
well understood (which is far from being true), but the interesting and difficult problems arise when M(Σ, I) is neither free nor commutative. For effective computations and the design of algorithms appropriate normal forms can be used. For the lexicographic normal form it is assumed that the alphabet Σ is totally ordered, say a < b < c < ⋯ < z. This defines a lexicographic ordering on Σ∗ exactly the same way words are ordered in a standard dictionary. The lexicographic normal form of a trace is the minimal word in Σ∗ representing it. For example, if I is given by {(a, d), (d, a), (b, c), (c, a)}, then the trace defined by the sequence badacb is the congruence class of six words: {baadbc, badabc, bdaabc, baadcb, badacb, bdaacb} . Its lexicographic normal form is the first word baadbc. An important property of lexicographic normal forms has been stated by Anisimov and Knuth. A word is in lexicographic normal form if and only if it does not contain a forbidden pattern, which is a factor bua where a < b ∈ Σ and the letter a commutes with all letters appearing in bu ∈ Σ∗ . As a consequence, the set of lexicographic normal forms is a regular language. The other main normal is due to Foata. It is a normal form that encodes a maximal parallel execution. Its definition uses steps, where a step means here a subset F ⊆ Σ of pairwise independent letters. Thus, a step requires only one parallel execution step. A step F yields a trace by taking the product Πa∈F a over all its letters in any order. The Foata normal form is a sequence of steps F ⋯ Fk such that F , . . . , Fk are chosen from left to right with maximal cardinality. The sequence {a, d} ; {b, c} ; {a, e} ; { f } above has been the Foata normal form of abcadef . The graphical representation of a trace due to Mazurkiewicz can be viewed as a third normal form. It is called the dependence graph representation; and it is closely related to the Foata normal form. Say a trace t is specified by some sequence of letters t = a ⋯ an . Each index i ∈ V = {, . . . , n} is labeled by the letter ai . Finally, arcs (i, j) ∈ E are introduced if and only if both (ai , aj ) ∈ D and i < j. In this way an acyclic directed graph G(t) is defined which is another unique representation of t. The information about t is also contained in the induced partial order (i.e., the transitive closure
T
of G(t)) or in its Hasse-diagram (i.e., removing all transitive arcs from G(t)).
Computation of Normal Forms There are efficient algorithms that compute normal forms in polynomial time. A very simple method uses a stack for each letter of the alphabet Σ. An input word is scanned from right to left, so the last letter is read first. When processing a letter a it is pushed on its stack and a marker is pushed on the stack of all the letters b (b ≠ a), which do not commute with a. Once the word has been processed its lexicographic normal form, the Foata normal form, and the Hasse-diagram of the dependence graph representation can be obtained straightforwardly. For example, the sequence a; b; c; a; d; e; f (with a dependence relation as depicted above) yields stacks as follows:
a b
c
a a
b
c
d
e
d
e
f f
Regular Sets A fundamental concept in formal languages is the notion of a regular set. Kleene’s Theorem says that a regular set can be specified either by a finite deterministic (resp. nondeterministic) automaton DFA (resp. NFA) or, equivalently, by a regular expression. Regular expressions are also called rational expressions. They are defined inductively by saying that every finite set denotes a rational expression and if R, S is rational, then R ∪ S, R ⋅ S, and R∗ are rational expressions, too. The semantics of a rational expression is defined in any monoid M since the semantics of R ∪ S, R⋅S is obvious, and R∗ can be viewed as the union ⋃k∈N Rk . For starfree expressions one does not allow the star-operation, but one adds complementation, denoted, for example, by R with the semantics M /R. In trace theory a direct translation of Kleene’s Theorem fails, but it can be replaced by a generalization due to Ochma´nski. If (a, b) is a pair of independent letters, then (ab)∗ is a rational expression, but due to ab = ba it represents all strings with an equal number
T
T
Trace Theory
of a’s and b’s which is clearly not regular. With three pairwise independent letters (abc)∗ is not even contextfree. A general formal language theory distinguishes between recognizable and rational sets. A subset L of a trace monoid is called recognizable, if its closure is a regular word language. Here the closure refers to all words in Σ ∗, which represent some trace in L. A subset L is called rational, if L can be specified by some regular (and hence rational) expression. Using the algebraic notion of homomorphism this can be rephrased as follows. Let φ be the canonical homomorphism of Σ∗ onto M(Σ, I), which simply means the interpretation of a string as its trace. Now, L is recognizable if and only if φ− (L) is a regular word language, and L is rational if and only if L = φ(K) for some regular word language K. As a consequence of Kleene’s Theorem all recognizable trace languages are rational, but the converse fails as soon as there is a pair of independent letters, that is, the trace monoid is not free. Given a recognizable trace language L, the corresponding word language φ− (L) is accepted by some NFA (actually some DFA), which satisfies the so-called I-diamond property. This means whenever it holds (a, b) ∈ I and a state p leads to a state q by reading the word ab, then it is in state p also possible to read ba and this leads to state q, too. NFAs satisfying the I-diamond property accept closed languages only. Therefore they capture exactly the notion of recognizability for traces. It has been shown that the concatenation of two recognizable trace languages is recognizable, in particular star-free languages (i.e., given by star-free expressions) are recognizable. However, the example (ab)∗ above shows that the star-operation leads to non-recognizable sets as soon as the trace monoid is not free. Métivier and Ochma´nski have introduced a restricted version where the star-operation is allowed only when applied to languages L where all traces t ∈ L are connected. This means the dependence graph G(t) is connected or, equivalently, there is no nontrivial factorization t = uv where all letters in u are independent of all letters in v. A theorem shows that L∗ is still recognizable, if L is connected (i.e., all t ∈ L are connected) and recognizable. Ochma´nski’s Theorem yields also the converse: A trace language L is recognizable if and only if it can be specified by a rational expression where the star-operation is restricted to connected subsets. As word languages are always connected this is a proper generalization of the
classical Kleene’s Theorem. Yet another characterization of recognizable trace languages is as follows: They are in one-to-one correspondence with regular subsets inside the regular set LexNF ⊆ Σ ∗ of lexicographic normal forms. The correspondence associates with L ⊆ M(Σ, I) the set K = φ − (L)∩LexNF. A rational expression for K is a rational expression for L, where the star-operation is restricted to connected languages.
Decidability Questions The Star Mystery The Star Problem is to decide for a given recognizable trace language L ⊆ M(Σ, I) whether L∗ is recognizable. It is not known whether the star problem is decidable, even if it is restricted to finite languages L. The surprising difficulty of this problem has been coined as the star mystery by Ochma´nski. It has been shown by Richomme that the Star Problem is decidable, if (Σ, I) does not contain any C (cycle of four letters) as an induced subgraph. Undecidability Results for Rational Sets For rational languages (unlike as for recognizable languages) some very basic problems are known to be undecidable. The following list contains undecidable decision problems, where the input for each instance consists of an independence alphabet (Σ, I) and rational trace languages R, T ⊆ M(Σ, I) specified by rational expressions. Inclusion question: Does R ⊆ T hold? Equality question: Does R = T hold? Universality question: Does R = M(Σ, I) hold? Complementation question: Is M(Σ, I) / R a rational? ● Recognizability question: Is R recognizable? ● Intersection question: Does R ∩ T = / hold? ● ● ● ●
On the positive side, if I is transitive, then all six problems above are decidable. This is also a necessary condition for the first five problems in the list. Transitivity of the independence alphabet means in algebraic terms that the trace monoid is a free product of free and free commutative monoids, like, for example, {a, b}∗ ∗ N . The intersection problem is simpler. It is known that the problem Intersection is decidable if and only if (Σ, I)
Trace Theory
is a transitive forest. It is also well known that transitive forests are characterized by forbidden induced subgraphs C and P (cycle and path, resp., of four letters).
Asynchronous Automata Whereas recognizable trace languages can be defined as word languages accepted by DFAs or NFAs with I-diamond property, there is an equivalent distributed automaton model called asynchronous automata. Such an automaton is a parallel composition of finite-state processes synchronizing over shared variables, whereas a DFA satisfying the I-diamond property is still a device with a centralized control. An asynchronous automaton A has, by definition, a distributed finite state control such that independent actions may be performed in parallel. The set of global states is modeled as a direct product Q = ∏p∈P Qp , where the Qp are states of the local component p ∈ P and P is some finite index set (a set of processors). For each letter a ∈ Σ there is a read domain R(a) ⊆ P and a write domain W(a) ⊆ P where for simplicity W(a) ⊆ R(a). Processors p and q share a variable a if and only if p, q ∈ R(a). The transitions are given by a family of partially defined functions δ p , where each processor p reads the status in the local components of its read domain and changes states in local components of its write domain. Accordingly to the read-and-write-conflicts being allowed, four basic types are distinguished: ●
Concurrent-Read-Exclusive-Write (CREW ), if R(a) ∩ W(b) = / for all (a, b) ∈ I. ● Concurrent-Read-Owner-Write (CROW ), if R(a) ∩ W(b) = / for all (a, b) ∈ I and W(a) ∩ W(b) = / for all a ≠ b. ● Exclusive-Read-Exclusive-Write (EREW ), if R(a) ∩ R(b) = / for all (a, b) ∈ I. ● Exclusive-Read-Owner-Write (EROW ), if R(a) ∩ R(b) = / for all (a, b) ∈ I and W(a) ∩ W(b) = / for all a ≠ b. The local transition functions (δ p )p∈P give rise to a partially defined transition function on global states δ : (∏p∈P Qp ) × Σ → ∏p∈P Qp . If A is of any of the four types above, then the action of a trace t ∈ M(Σ, I) on global states is well defined. This allows to see an asynchronous automaton as an I-diamond DFA. There are effective translations from
T
one model to the other. The most compact versions can be obtained by a CREW model, therefore it is of prior practical interest. Zielonka has shown in his thesis (published in ) the following deep theorem in trace theory: Every recognizable trace language can be accepted by some finite asynchronous automaton. The proof of this theorem is very technical and complicated. Moreover, the original construction was doubly exponential in the size of an I-diamond automaton for the language L. Therefore, it is part of ongoing research to simplify its construction, in particular since efficient constructions are necessary to make the result applicable in practice. The best result to date is due to Genest et al. They provide a construction where the size of the obtained asynchronous automaton is polynomial in the size of a given DFA and simply exponential in the number of processes. They also show that the construction is optimal within the class of automata produced by Zielonka-type constructions, which yields a nontrivial lower bound on the size of asynchronous automata. A rather direct construction of asynchronous automata is known for triangulated dependence alphabets, which means that all chordless cycles are of length . For example, complete graphs and forests are triangulated.
Infinite Traces The theory of infinite traces has its origins in the mids when Flé and Roucairol considered the problem of serializability of iterated transactions in data bases. A suitable definition of an infinite trace uses the dependence graph representation due to Mazurkiewicz. Just as in the finite case an infinite sequence t = a a ⋯ of letters yields an infinite node-labeled acyclic directed graph G(t), where now each i ∈ V = N is labeled by the letter ai , and again arcs (i, j) ∈ E are introduced if and only if both (ai , aj ) ∈ D and i < j. It is useful to consider finite and infinite objects simultaneously as an infinite trace may split into connected components where some of them might be finite. The notion of real trace has been introduced to denote either a finite or an infinite trace. If t , t , . . . is (finite or infinite) sequence of finite traces, then the product t t ⋯ is a well-defined real trace. It is a finite trace if almost all ti are empty and an infinite trace otherwise. In particular, one can define the ω-product Lω for every set L of finite traces and one enriches the set of rational expressions by this operation.
T
T
Trace Theory
The set R(Σ, I) of real traces can be embedded into a monoid of complex traces where the imaginary component is a subset of Σ. This alphabetic information is necessary in order to define an associative operation of concatenation. (Over complex traces Lω is defined for all subsets L.) Many results from the theory of finite traces transfer to infinite traces according to the same scheme as for finite and infinite words.
Logics MSO and First-Order Logic Formulae in monadic second-order logic (MSO) are built up upon first-order variables x, y, . . . ranging over vertices and second-order variables X, Y, . . . ranging over subsets of vertices. There are Boolean constants true and false, the logical connectives ∨, ∧, ¬, and quantification ∃, ∀ for the first- and second-order variables. In addition there are four types of atomic formulae: x ∈ X, x = y, (x, y) ∈ E, and λ(x) = a. A first-order formula is a formula without any secondorder variable. A sentence is a closed formula, that is, a formula without free variables. The semantics of an MSO-sentence is defined for every node-labeled graph [V, E, λ] (here: V = set of vertices, E = set of edges, λ : V → Σ = vertex labeling). Identifying a trace t with its dependence graph G(t), the truth value of t ⊧ ψ is therefore well defined for every sentence ψ. The trace language defined by a sentence ψ is L(ψ) = {t ∈ R(Σ, I) ∣ t ⊧ ψ}. It follows a notion of first-order and second-order definability of trace languages.
Temporal Logic Linear temporal logic, LTL, can be inductively defined inside first order as formulae with one free variable, as soon as the transitive closure (x, y) ∈ E∗ is expressible in first order (as it is the case for trace monoids). There are no quantifiers, but all Boolean connectives. The atomic formulae are λ(x) = a. If φ(x), ψ(x) are LTL-formulae, then EX φ(x) and (φ U ψ)(x) are LTLformulae. In temporal logic (x, y) ∈ E∗ means that y is in the future of the node x. The semantics of EX φ(x) is exists next, thus φ(y) holds for a direct successor of x. The semantics of (φ U ψ)(x) reflects an until operator, it says that in the future of x there is some z that satisfies ψ(z) and all y in the future of x but in the strict
past of z satisfy φ(y). Hence, condition φ holds until ψ becomes true. There are dual past-tense operators, but they do not add expressivity. For LTL one can also give a syntax without any free variable and a global semantics where the evaluation is based on the prefix relation of traces. The local semantics as defined above is for traces a priori expressively weaker, but it was shown that both, the global and local LTL have the same expressive power as first-order logic. This was done by Thiagarajan and Walukiewicz in for global LTL and by Diekert and Gastin in for local LTL, respectively. Both results extend a famous result of Kamp from words to traces. The complexity of the satisfiability problem (or model checking) is however quite different. In global semantics it is nonelementary, whereas in local semantics it is in PSPACE (= class of problems solvable on a Turing machine in polynomial space.)
Fragments For various applications fragments of first-order logics suffice. This has the advantage that simpler constructions are possible and that the complexity of model checking is possibly reduced. A prominent fragment is first-order logic with at most two names for variables. Two-variable logics capture the core features of XML navigational languages like XPath. Over words and over traces two variable logic FO [E] can be characterized algebraically via the variety of monoids DA (referring to the fact that regular D-classes are aperiodic semigroups), in logic by Next-Future and YesterdayPast operators, and in terms of rational expressions via unambiguous polynomials. It turns out that the satisfiability problem for two-variable logic is NP-complete (if the independence alphabet is not part of the input). The extension of these results from words to traces is due to Kufleitner.
Logics, Algebra, and Automata The connection between logic and recognizability uses algebraic tools from the theory of finite monoids. If h : M(Σ, I) → M is a homomorphism to a finite monoid M and L ⊆ R(Σ, I) is a set of real traces, then one says that h recognizes L, if for all t ∈ L and factorizations t = t t ⋯ into finite traces ti the following inclusion holds: h− (t )h− (t )⋯ ⊆ L. This allows to speak of
Trace Theory
aperiodic languages if some recognizing monoid is aperiodic. A monoid M is aperiodic, if for all x ∈ M there is some n ∈ N such that xn+ = xn . A deep result states that a language is first-order definable if and only if it is recognized by a homomorphism to a finite aperiodic monoid. Algebraic characterizations lead to decidability of fragments. For example, it is decidable whether a recognizable language is aperiodic or whether it can be expressed in two-variable first-order logic. Another way to define recognizability is via Büchi automata. A Büchi automaton for real traces is an I-diamond NFA with a set of final states F and a set of repeated states R. It accepts a trace if the run stops in F or if repeated states are visited infinitely often. If its transformation monoid is aperiodic it is called aperiodic, too. There is also a notion of asynchronous (cellular) Büchi automaton, and it is known that every I-diamond Büchi automaton can be transformed into an equivalent asynchronous cellular Büchi automaton. The main result connecting logic, recognizability, rational expressions, and algebra can be summarized by saying that the following statements in the first block (second block resp.) are equivalent for all trace languages L ⊆ R(Σ, I): MSO definability: . L is definable in monadic second-order logic. . L is recognizable by some finite monoid. . L is given as a rational expression where the star is restricted to connected languages. . L is accepted by some asynchronous Büchi automaton. First-order definability: . L is definable in first-order logic. . L is definable in LTL (with global or local semantics). . L is recognizable by some finite and aperiodic monoid. . L is star-free.
Automata-Based Verification The automata theoretical approach to verification uses the fact that systems and specifications are both modeled with finite automata. More precisely, a system is given as a finite transition system A, which is typically realized as an NFA without final states. So, the system
T
allows finite and infinite runs. The specification is written in some logical formalism, say in the linear temporal logic LTL. So the specification is given by some formula φ, and its semantics L(φ) defines the runs that obey the specification. Model checking means to verify the inclusion L(A) ⊆ L(φ). This is equivalent to L(A) ∩ L(¬φ) = /. Once an automaton B with L(B) = L(¬φ) has been constructed, standard methods yield a product automaton for L(A) ∩ L(B). The check for emptiness becomes a reachability problem in directed graphs. A main obstacle is the combinatorial explosion when constructing the automaton B. But this works in practice nevertheless reasonable well, because typical specifications are simple enough to be understood (hopefully) by the designer, so they are short. From a theoretical viewpoint the complexity of model checking for MSO and first order is nonelementary, but for (local) LTL is still in PSPACE. This approach is mostly applied and very successful where runs can be modeled as sequences. Trace theory provides the necessary tools to extend these methods to asynchronous systems. A first step in this direction has been implemented in the framework of partial order reduction. Another application of trace theory is the analysis of communication protocols.
Traces and Asynchronous Communication Trace automata like asynchronous ones model concurrency in the same spirit as Petri nets, using shared variables. A more complex model arises when concurrent processes cooperate over unbounded, fifo communication channels. A communicating automaton is defined over a set P of processes, together with point-to-point communication channels Ch ⊆ {(p, q) ∈ P ∣ p ≠ q}. It consists of a tuple of NFAs Ap , one for each process p ∈ P. Each NFA Ap has a set of local states Qp and transition relation δ p ⊆ Qp × Σ p × Qp . The set Σ p of local actions of process p consists of send-actions p!q(m) (of message m to process q, (p, q) ∈ Ch) and receive-actions p?r(m) (of message m from process r, (r, p) ∈ Ch), respectively. The semantics of such an automaton is defined through configurations consisting of a tuple of local states (one for each process) and a tuple of word contents (one for each channel). In terms of partial orders the semantics of runs corresponds to message sequence charts
T
T
Trace Theory
(MSCs), a graphical notation for fifo message exchange. In contrast with asynchronous automata, communicating automata have an infinite state space and are actually Turing powerful; thus, most algorithmic questions about them are undecidable. The theory of recognizable trace languages enjoys various nice results known from word languages, for example, in terms of logics and automata. Since communicating automata are Turing powerful, one needs restrictions in order to obtain, for example, logical characterizations. A natural restriction consists in imposing bounds on the size of the channels. Such bounds come in two versions, namely, as universal and existential bounds, respectively. The existential version of channel bounds is optimistic and considers all those runs that can be rescheduled on bounded channels. The universal version is pessimistic and considers only those runs that, independent of the scheduling, can be executed with bounded channels. Thus, communicating automata with an universal channel bound are finite state, whereas with an existential channel bound they are infinite state systems. Kuske proposed an encoding of runs of communicating automata with bounded channels into trace languages. Using this encoding, the set of runs (MSCs) of a communicating automaton is the projection of a recognizable trace language (for a universal bound), respectively the set of MSCs generated by the projection of a recognizable trace language (for an existential bound). This correspondence has the same flavor as the distinction between recognizable and rational trace languages, respectively. The logic MSO over MSCs is defined with an additional binary message-predicate relating matching send and receive events. Henriksen et al. and Genest et al., respectively, have shown that the equivalence between MSO and automata extends to communicating automata with universal and existential channel bound, respectively. Another equivalent characterization exists in terms of MSC-graphs, similar to star-connected expressions for trace languages. These expressiveness results are complemented by decidable instances of the model-checking problem.
Related Entries Asynchronous Iterative Algorithms CSP (Communicating Sequential Processes)
Formal Methods–Based Tools for Race, Deadlock, and Other Errors Multi-Threaded Processors Parallel Computing Parallelization, Automatic Peer-to-Peer Petri Nets Reordering Synchronization Trace Scheduling Verification of Parallel Shared-Memory Programs, Owicki-Gries Method of Axiomatic
Bibliographic Notes and Further Reading Trace theory has its origin in enumerative combinatorics when Cartier and Foata found a new proof of the MacMahon Master Theorem in the framework of partial commutation by combining algebraic and bijective ideas []. The Foata normal form was defined in this Lecture Note. In computer science the key idea to use partial commutation as tool to investigate parallel systems was laid by Keller [], but it was only by the influence of the technical report of Mazurkiewicz [] when these ideas were spread to a wider computer science community, in particular to the Petri-net community. It was also Mazurkiewicz who coined the notion Trace theory and who introduced the notion of dependence graphs as a visualization of traces. The characterization of lexicographic normal forms by forbidden pattern is due to Anisimov and Knuth []. The investigation of recognizable (regular, rational resp.) languages is central in the theory of traces. The characterization of recognizable languages in terms of star-connected regular expressions is due to Ochma´nski []. The notion of asynchronous automaton is due to Zielonka. The major theorem showing that all recognizable languages can be accepted by asynchronous automata is his work (built on his thesis) []. The research on asynchronous automata is still an important and active area. The best constructions so far are due to Genest et al., where also nontrivial lower bounds were established []. The theory of infinite traces has its origin in the mid-s. A definition of a real trace as a prefix-closed and directed subset of real traces and its characterization by dependence graphs is given in a survey by
Transactional Memories
Mazurkiewicz []. The theory of recognizable real trace languages has been initiated by Gastin in . The generalization of the Kleene–Büchi–Ochma´nski Theorem to real traces is due to Gastin, Petit, and Zielonka []. Diekert and Muscholl gave a construction for deterministic asynchronous Muller automata accepting a given recognizable real trace language. Ebinger initiated the study of LTL for traces in his thesis in . But it took quite an effort until Diekert and Gastin were able to show that LTL (in local semantics) has the same expressive power as first-oder logic []. The advantage of a local LTL is that model checking in PSPACE, whereas in its global semantics it becomes nonelementary by a result of Walukiewicz []. The PSPACE-containment has been shown for a much wider class of logics by Gastin and Kuske []. Diekert, Horsch, and Kufleitner [] give a survey on fragments of first-order logic in trace theory. The Büchilike equivalence between automata and MSO for existentially bounded communicating automata has been shown by Genest, Kuske, and Muscholl []. The translation from MSO into automata uses the equivalence for trace languages, but needs some additional, quite technical construction specific to communicating automata. Very much of the material used in the present discussion can be found in The Book of Traces, which was edited by Diekert and Rozenberg []. The book surveys also a notion of semi-commutation (introduced by Clerbout and Latteux), and it provides many hints for further reading. Current research efforts concentrate on the topic of distributed games and controller synthesis for asynchronous automata.
Bibliography . Anisimov AV, Knuth DE () Inhomogeneous sorting. Int J Comput Inf Sci :– . Cartier P, Foata D () Problèmes combinatoires de commutation et réarrangements. Lecture notes in mathematics, vol . Springer, Heidelberg . Diekert V, Gastin P () Pure future local temporal logics are expressively complete for Mazurkiewicz traces. Inf Comput :–. Conference version in LATIN , LNCS : –, . Diekert V, Horsch M, Kufleitner M () On first-order fragments for Mazurkiewicz traces. Fundamenta Informaticae :– . Diekert V, Rozenberg G (eds) () The book of traces. World Scientific, Singapore . Gastin P, Kuske D () Uniform satisfiability in pspace for local temporal logics over Mazurkiewicz traces. Fundam Inf (–): –
T
. Gastin P, Petit A, Zielonka WL () An extension of Kleene’s and Ochma´nski’s theorems to infinite traces. Theoret Comput Sci :–, x . Genest B, Gimbert H, Muscholl A, Walukiewicz I () Optimal Zielonka-type construction of deterministic asynchronous automata. In: Abramsky S, Gavoille C, Kirchner C, Meyer auf der Heide F, Spirakis PG (eds) ICALP (). Lecture notes in computer science, vol . Springer, pp – . Genest B, Kuske D, Muscholl A () A Kleene theorem and model checking algorithms for existentially bounded communicating automata. Inf Comput :– . http://dx.doi.org/./j.ic...;DBLP, http://dblp. uni-trier.de . Keller RM () Parallel program schemata and maximal parallelism I. Fundamental results. J Assoc Comput Mach ():– . Mazurkiewicz A () Concurrent program schemes and their interpretations. DAIMI Rep. PB , Aarhus University, Aarhus . Mazurkiewicz A () Trace theory. In: Brauer W et al. (eds) Petri nets, applications and relationship to other models of concurrency. Lecture notes in computer science, vol . Springer, Heidelberg, pp – . Ochma´nski E (Oct ) Regular behaviour of concurrent systems. Bull Eur Assoc Theor Comput Sci (EATCS) :– . Walukiewicz I () Difficult configurations – on the complexity of LTrL. In: Larsen KG, et al. (eds) Proceedings of the th International Colloquium Automata, Languages and Programming (ICALP’), Aalborg (Denmark). Lecture notes in computer science, vol . Springer, Heidelberg, pp – . Zielonka WL () Notes on finite asynchronous automata. R.A.I.R.O. Informatique Théorique et Applications :–
Tracing Performance Analysis Tools Scalasca TAU
Transactional Memories Maurice Herlihy Brown University, Providence, RI, USA
Synonyms Locks; Monitors; Multiprocessor synchronization
Introduction Transactional memory (TM) is an approach to structuring concurrent programs that seeks to provide better scalability and ease-of-use than conventional approaches based on locks and conditions. The term is
T
T
Transactional Memories
commonly used to refer to ideas that range from programming language constructs to hardware architecture. This entry will survey how transactional memory affects each of these domains. The major chip manufacturers have, for the time being, given up trying to make processors run faster. Moore’s law has not been repealed: Each year, more and more transistors fit into the same space, but their clock speed cannot be increased without overheating. Instead, attention has turned toward chip multiprocessing (CMP), in which multiple computing cores are included on each processor chip. In the medium term, advances in technology will provide increased parallelism, but not increased single-thread performance. As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must learn to make more effective use of increasing parallelism. This adaptation will not be easy. Conventional programming practices typically rely on combinations of locks and conditions, such as monitors [], to prevent threads from concurrently accessing shared data. Locking makes concurrent programming possible because it allows programmers to reason about certain code sections as if they were executed atomically. Nevertheless, the conventional approach suffers from a number of shortcomings. First, programmers must decide between coarsegrained locking, in which a large data structure is protected by a single lock, and fine-grained locking, in which a lock is associated with each component of the data structure. Coarse-grained locking is relatively easy to use, but permits little or no concurrency, thereby preventing the program from exploiting multiple cores. By contrast, fine-grained locking is substantially more complicated because of the need to ensure that threads acquire all necessary locks (and only those, for good performance), and because of the need to avoid deadlock when acquiring multiple locks. Such designs are further complicated because the most efficient engineering solution may be platform dependent, varying with different machine sizes, workloads, and so on, making it difficult to write code that is both scalable and portable. Second, locking provides poor support for code composition and reuse. For example, consider a lock-based queue that provides atomic enq() and deq()
methods. Ideally, it should be easy to transfer an item atomically from one queue to another, but such elementary composition simply does not work. It is necessary to lock both queues at the same time to make the transfer atomic. If the queue methods synchronize internally, then there is no way to acquire and hold both locks simultaneously. If the queues export their locks, then modularity and safety are compromised, because the integrity of the objects depends on whether their users follow ad hoc conventions correctly. Finally, such basic issues as the mapping from locks to data, that is, which locks protect which data, and the order in which locks must be acquired and released, are all based on convention, and violations are notoriously difficult to detect and debug. For these and other reasons, today’s software practices make concurrent programs too difficult to develop, debug, understand, and maintain.
The Transactional Model A transaction is a sequence of steps executed by a single thread. Transactions are atomic: Each transaction either commits (it takes effect) or aborts (its effects are discarded). Transactions are linearizable []: They appear to take effect in a one-at-a-time order. Transactional memory supports a computational model in which each thread announces the start of a transaction, executes a sequence of operations on shared objects, and then tries to commit the transaction. If the commit succeeds, the transaction’s operations take effect; otherwise, they are discarded. Sometimes we refer to these transactions as memory transactions. Memory transactions satisfy the same formal serializability and atomicity properties as the transactions used in conventional database systems, but they are intended to address different problems. Unlike database transactions, memory transactions are short-lived activities that access a relatively small number of objects in primary memory. Database transactions are persistent: When a transaction commits, its changes are backed up on a disk. Memory transactions need not be persistent, and involve no explicit disk I/O. To illustrate why memory transactions are attractive from a software engineering perspective, consider the problem of constructing a concurrent FIFO queue that permits one thread to enqueue items at the tail of
Transactional Memories
the queue at the same time another thread dequeues items from the head of the queue, at least while the queue is non-empty. Any problem so easy to state, and that arises so naturally in practice, should have an easily devised, understandable solution. In fact, solving this problem with locks is quite difficult. In , Michael and Scott published a clever and subtle solution []. It speaks poorly for fine-grained locking as a methodology that solutions to such simple problems are challenging enough to be publishable. By contrast, it is almost trivial to solve this problem using transactions. Figure shows how the queue’s enqueue method might look in a language that provides direct support for transactions. It consists of little more than enclosing sequential code in a transaction
class Queue
T
atomic { x = q.deq (); } orElse { x = q . deq (); } Transactional Memories. Fig. The orElse statement: waiting on multiple conditions
block. In practice, of course, a complete implementation would include more details (such as how to respond to an empty queue), but even so, this concurrent queue implementation is a remarkable achievement: It is not, by itself, a publishable result. Conditional synchronization can be accomplished in the transactional model by means of the retry construct []. As illustrated in Fig. , if a thread attempts to dequeue from an empty queue, it executes retry , which rolls back the partial effects of the atomic block, and re-executes that block later when the object’s state has changed. The retry construct is attractive because it is not subject to the lost wake-up bug that can arise using monitor conditions. Transactions also admit compositions that would be impossible using locks and conditions. Waiting for one of several conditions to become true is impossible using objects with internal monitor condition variables. A novel aspect of retry is that such composition becomes easy. Figure shows a code snippet illustrating the orElse statement, which joins two or more code blocks. Here, the thread executes the first block. If that block calls retry , then that subtransaction is rolled back, and the thread executes the second block. If that block also calls retry , then the orElse as a whole pauses, and later reruns each of the blocks (when something changes) until one completes.
Motivation TM is commonly used to address three distinct problems: first, a simple desire to make highly concurrent data structures easy to implement; second, a more ambitious desire to support well-structured large-scale concurrent programs; and third, a pragmatic desire to make conventional locking more concurrent. Here is a survey of each area.
T
T
Transactional Memories
Lock-Free Data Structures A data structure is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps, even if some subset of the threads halt in arbitrary places. A data structure that relies on locking cannot be lock-free because a thread that acquires a lock and then halts can prevent non-faulty threads from making progress. Lock-free data structures are often awkward to implement using today’s architectures which typically rely on compare-and-swap for synchronization. The compare-and-swap instruction takes three arguments, and address a, an expected value e, and an update value u. If the value stored at a is equal to e, then it is atomically replaced with u, and otherwise it is unchanged. Either way, the instruction sets a flag indicating whether the value was changed. Often, the most natural way to define a lock-free data structure is to make an atomic change to several fields. Unfortunately, because compare-and-swap allows only one word (or perhaps a small number of contiguous words) to be changed atomically, designers of lock-free data structures are forced to introduce complex multistep protocols or additional levels of indirection that create unwelcome overhead and conceptual complexity. The original TM paper [] was primarily motivated by a desire to circumvent these restrictions.
Software Engineering TM is appealing as a way to help programmers structure concurrent programs because it allows the programmer to focus on what the program should be doing, rather than on the detailed synchronization mechanisms needed. For example, TM relieves the programmer of tasks such as devising specialized locking protocols for avoiding deadlocks, and conventions associating locks with data. A number of programming languages and libraries have emerged to support TM. These include Clojure [], .Net [], Haskell []. Java [, ], C++ [], and others. Several groups have reported experiences converting programs from locks to TM. The TxLinux [] project replaced most of the locks in the Linux kernel with transactions. Syntactically, each transaction appears to be a lock-based critical section, but that code is executed speculatively as a transaction (see Section .). If an I/O call is detected, the transaction is rolled
back and restarted using locks. Using transactions primarily as an alternative way to implement locks minimized the need to rewrite and restructure the original application. Damron et al. [] transactionalized the Berkeley DB lock manager. They found the transformation more difficult than expected because simply changing critical sections into atomic blocks often resulted in a disappointing level of concurrency. Critical sections often shared data unnecessarily, usually in the form of global statistics or shared memory pools. Later on, we will see other work that reinforces the notion the need to avoid gratuitous conflicts means that concurrent transactional programs must be structured differently than concurrent lock-based programs. Pankratius et al. [] conducted a user study where twelve students, working in pairs, wrote a parallel desktop search engine. Three randomly chosen groups used a compiler supporting TM, and three used conventional locks. The best TM group were much faster to produce a prototype, the final program performed substantially better, and they reported less time spent on debugging. However, the TM teams found performance harder to predict and to tune. Overall, the TM code was deemed easier to understand, but the TM teams did still make some synchronization errors. Rossbach et al. [] conducted a user study in which undergraduates implemented the same programs using coarse-grained and fine-grained locks, monitors, and transactions. Many students reported they found transactions harder to use than coarse-grain locks, but slightly easier than fine-grained locks. Code inspection showed that students using transactions made many fewer synchronization errors: Over % of students made errors with fine-grained locking, while less than % made errors using transactions.
Lock Elision Transactions can also be used as a way to implement locking. In lock elision [], when a thread requests a lock, rather than waiting to acquire that lock, the thread starts a speculative transaction. If the transaction commits, then the critical section is complete. If the transaction aborts because of a synchronization conflict, then the thread can either retry the transaction, or it can actually acquire the lock.
Transactional Memories
Here is why lock elision is attractive. Locking is conservative: A thread must acquire a lock if it might conflict with another thread, even if such conflicts are rare. Replacing lock acquisition with speculative execution enhances concurrency if actual conflicts are rare. If conflicts persist, the thread can abandon speculative execution and revert to using locks. Lock elision has the added advantage that it does not require code to be restructured. Indeed, it can often be made to work with legacy code. Azul Systems [] has a JVM that uses (hardware) lock elision for contended Java locks, with the goal of accelerating “dusty deck” Java programs. The run-time system keeps track of how well the hardware transactional memory (HTM) is doing, and decides when to use lock elision and when to use conventional locks. The results work well for some applications, modestly well for others, and poorly for a few. The principal limitation seems to be the same as observed by Damron et al. []: many critical sections are written in a way that introduces gratuitous conflicts, usually by updating performance counters. Although these are not real conflicts, the HTM has no way to tell. Rewriting such code can be effective, but requires abandoning the goal of speeding up “dusty deck” programs.
Hardware Transactional Memory Most hardware transactional memory (HTM) proposals are based on straightforward modifications to standard multiprocessor cache-coherence protocols. When a thread reads or writes a memory location on behalf of a transaction, that cache entry is flagged as being transactional. Transactional writes are accumulated in the cache or write buffer, but are not written back to memory while the transaction is active. If another thread invalidates a transactional entry, a data conflict has occurred, that transaction is aborted and restarted. If a transaction finishes without having had any of its entries invalidated, then the transaction commits by marking its transactional entries as valid or as dirty, and allowing the dirty entries to be written back to memory in the usual way. One limitation of HTM is that in-cache transactions are limited in size and scope. Most hardware transactional memory proposals require programmers to be aware of platform-specific resource limitations such as cache and buffer sizes, scheduling quanta,
T
and the effects of context switches and process migrations. Different platforms provide different cache sizes and architectures, and cache sizes are likely to change over time. Transactions that exceed resource limits or are repeatedly interrupted will never commit. Ideally, programmers should be shielded from such complex, platform-specific details. Instead, TM systems should provide full support even for transactions that cannot execute directly in hardware. Techniques that substantially increase the size of hardware transactions include signatures [] and permissions-only caches []. Other proposals support (effectively) unbounded transactions by allowing transactional metadata to overflow caches, and for transactions to migrate from one core to another. These proposals include TCC [], VTM [], OneTM [], UTM [], TxLinux [], and LogTM [].
Software Transactional Memory Software transactional memory (STM) is an alternative to direct hardware support for TM. STM is a software system that provides programmers with a transactional model through a library or compiler interface. In this section, we describe some of the questions that arise when designing an STM system. Some of these questions concern semantics, that is, how the STM behaves, and other concern implementation, that is, how the STM is structured internally.
Weak vs Strong Isolation How should threads that execute transactions interact with threads executing non-transactional code? One possibility is strong isolation [] (sometimes called strong atomicity), which guarantees that transactions are atomic with respect to non-transactional accesses. The alternative, weak isolation (or weak atomicity), makes no such guarantees. HTM systems naturally provide strong atomicity. For STM systems, however, strong isolation may be too expensive. The distinction between strong and weak isolation leaves unanswered a number of other questions about STM behavior. For example, what does it mean for an unhandled exception to exit an atomic block? What does I/O mean if executed inside a transaction? One appealing approach is to say that transactions behave as if they were protected by a single global lock (SGL) [, , ].
T
T
Transactional Memories
One limitation of the SGL semantics is that it does not specify the behavior of zombie transactions: transactions that are doomed to abort because of synchronization conflicts, but continue to run for some duration before the conflict is discovered. In some STM implementations, zombie transactions may see an inconsistent state before aborting. When a zombie aborts, its effects are rolled back, but while it runs, observed inconsistencies could provoke it to pathological behavior that may be difficult for the STM system to protect against, such as dereferencing a null pointer or entering an infinite loop. Opacity [] is a correctness condition that guarantees that all uncommitted transactions, including zombies, see consistent states.
I/O and System Calls What does it mean for a transaction to make a system call (such as I/O) that may affect the outside world? Recall that transactions are often executed speculatively, and a transaction that encounters a synchronization conflict may be rolled back and restarted. If a transaction creates a file, opens a window, or has some other external side effect, then it may be difficult or impossible to roll everything back. One approach is to allow irrevocable transactions [, , ] that are not executed speculatively, and so never need to be undone. An irrevocable transaction cannot explicitly abort itself, and only one such transaction can run at a time, because of the danger that multiple irrevocable transactions could deadlock. An alternative approach is to provide a mechanism to escape from the transactional system. Escape actions [] and open nested transactions citeNi allow a thread to execute statements outside the transaction system, scheduling application-specific commit and abort handlers to be called if the enclosing transaction commits or aborts. For example, an escape action might create a file, and register a handler to abort that file if the transaction aborts. Escape mechanisms can be misused, and often their semantics are not clearly defined. Using open nested transactions, for example, care must be taken to ensure that abort handlers do not deadlock.
Exploiting Object Semantics STM systems typically synchronize on the basis of read/write conflicts. As a transaction executes, it records
the data items it read in a read set, and the data items it wrote in a write set. Two transactions conflict if one transaction’s read or write set intersects the other’s write set. Conflicting transactions cannot both commit.Synchronizing via read/write conflicts has one substantial advantage: it can be done automatically without programmer participation. It also has a substantial disadvantage: It can severely and unnecessarily restrict concurrency for certain shared objects. If these objects are subject to high levels of contention (that is, they are “hot-spots”), then the performance of the system as a whole may suffer. This problem can be addressed by open nested transactions, as described above in Section ., but open nested transactions are difficult to use correctly, and lack the expressive power to deal with certain common cases []. Another approach is to use type-specific synchronization and recovery to exploit concurrency inherent in an object’s high-level specification. One such mechanism is transactional boosting [], which allows threadsafe (but non-transactional) object implementations to be transformed into highly concurrent transactional implementations by allowing method calls to proceed in parallel as long as their high-level specifications are commutative.
Eager vs Lazy Update There are two basic ways to organize transactional data. In an eager update system, data objects are modified in place, and each transaction maintains an undo log allowing it to undo its changes if it aborts. The dual approach is lazy (or deferred) update, where each transaction computes optimistically on its local copy of the data, installing the changes if it commits, and discarding them if it aborts. An eager system makes committing a transaction more efficient, but makes it harder to ensure that zombie transactions see consistent states.
Eager vs Lazy Conflict Detection STM systems differ according to when they detect conflicts. In eager conflict detection schemes, conflicts are detected before they arise. When one transaction is about to create a conflict with another, it may consult a contention manager, defined below, to decide whether to pause, giving the other transaction a chance to finish, or to proceed and cause the other to abort. By contrast,
Transactional Memories
a lazy conflict detection scheme detects conflicts when a transaction tries to commit. Eager detection may abort transactions that could have committed lazily, but lazy detection discards more computation, because transactions are aborted later.
Contention Managers In many STM proposals, conflict resolution is the responsibility of a contention manager[] module. Two transactions conflict if they access the same object and one access is a write. If one transaction discovers it is about to conflict with another, then it can pause, giving the other a chance to finish, or it can proceed, forcing the other to abort. Faced with this decision, the transaction consults a contention management module that encapsulates the STM’s conflict resolution policy. The literature includes a number of contention manager proposals [–], ranging from exponential backoff to priority-based schemes. Empirical studies have shown that the choice of a contention manager algorithm can affect transaction throughput, sometimes substantially.
Visible vs Invisible Reads Early STM systems [] used either invisible reads, in which each transaction maintains per-read metadata to be revalidated after each subsequent read, or visible reads, in which each reader registers its operations in shared memory, allowing a conflicting writer to identify when it is about to create a conflict. Invisible read schemes are expensive because of the need for repeated validation, while visible read schemes were complex, expensive, and not scalable. More recent STM systems such as TL [] or SKYSTM [] use a compromise solution, called semivisible reads, in which read operations are tracked imprecisely. Semi-visible reads conservatively indicate to the writer that a read-write conflict might exist, avoiding expensive validation in the vast majority of cases.
Privatization It is sometimes useful for a thread to privatize [] a shared data structure by making it inaccessible to other threads. Once the data structure has been privatized, the owning thread can work on the data structure directly, without incurring synchronization costs. In principle,
T
privatization works correctly under SGL semantics, in which every transaction executes as if it were holding a “single global lock.” Unfortunately, care is required to ensure that privatization works correctly. Here are two possible hazards. First, the thread that privatizes the data structure must observe all changes made to that data by previously committed transactions, which is not necessarily guaranteed in an STM system where updates are lazy. Second, a doomed (“zombie”) transaction must not be allowed to perform updates to the data structure after it has been privatized.
Bibliographic Notes and Further Reading The most comprehensive TM survey is the book Transactional Memory by Larus and Rajwar []. Of course, this area changes rapidly, and the best way to keep up with current developments is to consult the the Transactional Memory Online web page at: http://www.cs.wisc. edu/trans-memory/.
Bibliography . Hoare CAR () Monitors: an operating system structuring concept. Commun ACM ():– . Herlihy MP, Wing JM () Linearizability: a correctness condition for concurrent objects. ACM T Progr Lang Sys ():– . Michael MM, Scott ML () Simple, fast, and practical nonblocking and blocking concurrent queue algorithms. In: PODC, Philadelphia. ACM, New York, pp – . Harris T, Marlow S, Peyton-Jones S, Herlihy M () Composable memory transactions. In: PPoPP ’: Proceedings of the tenth ACM SIGPLAN symposium on principles and practice of parallel programming, Chicago. ACM, New York, pp – . Herlihy M, Moss JEB (May ) Transactional memory: architectural support for lock-free data structures. In: International symposium on computer architecture, San Diego . Hickey R () The clojure programming language. In: DLS ’: Proceedings of the symposium on dynamic languages, Paphos. ACM, New York, pp – . Microsoft Corporation. Stm.net. http://msdn.microsoft.com/ en-us/devlabs/ee.aspx . Korland G Deuce STM. http://www.deucestm.org/ . S. Microsystems. DSTM. http://www.sun.com/download/ products.xml?id=fbe . Intel Corporation. C++ STM compiler. http://software.intel.com/ en-us/articles/intel-c-stm-compiler-prototype-edition-/ . Rossbach CJ, Hofmann OS, Porter DE, Ramadan HE, Aditya B, Witchel E () TxLinux: using and managing hardware transactional memory in an operating system. In: SOSP ’: Proceedings of twenty-first ACM SIGOPS symposium on operating systems principles, Stevenson. ACM, New York, pp –
T
T
Transactions, Nested
. Damron P, Fedorova A, Lev Y, Luchangco V, Moir M, Nussbaum D () Hybrid transactional memory. In: ASPLOSXII: Proceedings of the th international conference on architectural support for programming languages and operating systems, Boston. ACM, New York, pp – . Pankratius V, Adl-Tabatabai A-R, Otto F (Sept ) Does transactional memory keep its promises? Results from an empirical study. Technical Report -, University of Karlsruhe . Rossbach CJ, Hofmann OS, Witchel E (Jun ) Is transactional memory programming actually easier? In: Proceedings of the th annual workshop on duplicating, deconstructing, and debunking (WDDD), Austin . Rajwar R, Goodman JR () Speculative lock elision: enabling highly concurrent multithreaded execution. In: MICRO : Proceedings of the th annual ACM/IEEE international symposium on microarchitecture, Austin. IEEE Computer Society, Washington, DC, pp – . Click C (Feb ) Experiences with hardware transactional memory. http://blogs.azulsystems.com/cliff///and-nowsome-hardware-transactional-memory-comments.html . Yen L, Bobba J, Marty MR, Moore KE, Volos H, Hill MD, Swift MM, Wood DA () LogTM-SE: decoupling hardware transactional memory from caches. In: HPCA ’: Proceedings of the IEEE th international symposium on high performance computer architecture, Phoenix. IEEE Computer Society, Washington, DC, pp – . Blundell C, Devietti J, Lewis EC, Martin M (Jun ) Making the fast case common and the uncommon case simple in unbounded transactional memory. In: International symposium on computer architecture, San Diego . Hammond L, Carlstrom BD, Wong V, Hertzberg B, Chen M, Kozyrakis C, Olukotun K () Programming with transactional coherence and consistency (TCC). ACM SIGOPS Oper Syst Rev ():– . Rajwar R, Herlihy M, Lai K (Jun ) Virtualizing transactional memory. In: International symposium on computer architecture, Madison . Ananian CS, Asanovi´c K, Kuszmaul BC, Leiserson CE, Lie S (Feb ) Unbounded transactional memory. In: Proceedings of the th international symposium on high-performance computer architecture (HPCA’), San Franscisco, pp – . Blundell C, Lewis EC, Martin MMK (Jun ) Deconstructing transactions: the subtleties of atomicity. In: Fourth annual workshop on duplicating, deconstructing, and debunking, Wisconsin . Larus J, Rajwar R () Transactional memory (Synthesis lectures on computer architecture). Morgan & Claypool, San Rafael . Menon V, Balensiefer S, Shpeisman T, Adl-Tabatabai A-R, Hudson RL, Saha B, Welc A () Single global lock semantics in a weakly atomic STM. SIGPLAN Notices ():– . Guerraoui R, Kapalka M () On the correctness of transactional memory. In: PPoPP ’: Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming, Salt Lake City. ACM, New York, pp –
. Welc A, Saha B, Adl-Tabatabai A-R () Irrevocable transactions and their applications. In: SPAA ’: Proceedings of the twentieth annual symposium on parallelism in algorithms and architectures, Munich. ACM, New York, pp – . Moravan MJ, Bobba J, Moore KE, Yen L, Hill MD, Liblit B, Swift MM, Wood DA () Supporting nested transactional memory in logTM. SIGPLAN Notices ():– . Herlihy M, Koskinen E () Transactional boosting: a methodology for highly-concurrent transactional objects. In: PPoPP ’: Proceedings of the th ACM SIGPLAN symposium on principles and practice of parallel programming, Salt Lake City. ACM, New York, pp – . Herlihy M, Luchangco V, Moir M, Scherer W (Jul ) Software transactional memory for dynamic-sized data structures. In: Symposium on principles of distributed computing, Boston . Guerraoui R, Herlihy M, Pochon B () Toward a theory of transactional contention managers. In: PODC ’: Proceedings of the twenty-fourth annual ACM symposium on principles of distributed computing, Las Vegas. ACM, New York, pp – . Scherer WN III, Scott ML (Jul ) Contention management in dynamic software transactional memory. In: PODC workshop on concurrency and synchronization in java programs, St. John’s . Attiya H, Epstein L, Shachnai H, Tamir T () Transactional contention management as a non-clairvoyant scheduling problem. In: PODC ’: Proceedings of the twenty-fifth annual ACM symposium on principles of distributed computing, Denver. ACM, New York, pp – . Dice D, Shalev O, Shavit N () Transactional locking II. In: Proceedings of the th international symposium on distributed computing, Stockholm . Lev Y, Luchangco V, Marathe V, Moir M, Nussbaum D, Olszewski M () Anatomy of a scalable software transactional memory. In: TRANSACT , Raleigh . Spear MF, Marathe VJ, Dalessandro L, Scott ML () Privatization techniques for software transactional memory. In: PODC ’: Proceedings of the twenty-sixth annual ACM symposium on principles of distributed computing, Portland. ACM, New York, pp –
Transactions, Nested J. Eliot B. Moss University of Massachusetts, Amherst, MA, USA
Synonyms Multi-level transactions; Nested spheres of control
Definition Nested Transactions extend the traditional semantics of transactions by allowing meaningful nesting of one or
Transactions, Nested
more child transactions within a parent transaction. Considering the traditional ACID properties of transactions, atomicity, consistency, isolation, and durability, a child transaction possesses these properties in a relative way, such that a parent transaction effectively provides a universe within which its children act similarly to ordinary transactions in a non-nested system. In parallel computation, the traditional property of durability in the face of various kinds of system failures may not be required.
Discussion Transactions Transactions are a way of guaranteeing atomicity of more or less arbitrary sequences of code in a parallel computation. Unlike locks, which identify computations that may need serialization according to the identity of held and requested locks, transaction serialization is based on the specific data accessed by a transaction while it runs. In some cases, it is possible usefully to pre-declare the maximal set of data that a transaction might access, but it is not possible in general. Transactions specify semantics, while locks express an implementation of serialization. The usual semantics of transactions are that concurrent execution of a collection of transactions must be equivalent to execution of those transactions one at a time, in some order. This property is called serializability. The ACID properties capture transaction semantics in a slightly different way. Atomicity requires that transactions are all-or-nothing: Either all of a transaction’s effects occur, or the transaction fails and has no effect. Consistency requires that each transaction take the state of the world from one consistent state to another. Isolation requires that no transaction perceive any state in the middle of execution of another transaction. Durability requires that the effects of any transaction, once the transaction is accepted by the system, not disappear. In general, in parallel computing, as opposed to database systems where transactions had their origins, the consistency and durability properties are often less important. Consistency may be ignored in that most commonly there are no explicitly stated consistency
T
constraints and no mechanism to enforce them. However, a correct transaction system is still required not to leave effects of partially executed or failed transactions. Durability is often ignored in that a parallel computing system may have no permanent state. If the system has distributed memory, then it may achieve significant durability by keeping multiple copies of data in different units of the distributed memory. Of course, parallel systems can also use one or more nonvolatile copies to achieve durability, according to a system’s durability requirements. It is important to distinguish transactions from concurrency-safe data structures. A concurrency-safe data structure generally offers a guarantee of linearizability: If two actions on the data structure by different threads overlap in their execution, then the effect is as if the actions are executed in one order or the other. That is, a set of concurrent actions by different threads appears to occur in some linear order. This is similar to serializability, but what transactions and serializability add is the possibility for a given thread to execute a whole sequence of actions a , a , ..., an without any intervening actions of other threads. A concurrencysafe data structure guarantees only that the individual ai execute correctly as defined by the data type, but permits actions of other threads to interleave between the ai . In discussing transactions and their nesting, some additional terms will be useful. Transactions are said to commit (succeed) or abort (fail). They may fail for many reasons, one of them being serialization conflicts with other transactions. Transactional concurrency control may be pessimistic, also called early conflict detection, or optimistic, also called late conflict detection. Pessimistic conflict detection usually employs some kind of locks, while optimistic generally uses some kind of version numbers or timestamps on data and transactions to determine conflicts. It is even possible to maintain multiple versions so as to allow more transactions to commit, while still enforcing serializability. In general, locking schemes require some kind of deadlock avoidance or detection protocol. However, if a set of transactions is in deadlock, the system has a way out: It can abort one of the transactions to break the deadlock. Thus, deadlock is not a fatal problem as it is when using just locks for synchronization.
T
T
Transactions, Nested
A system may update data in place, which requires an undo log to support removing the effects of a failed transaction. Alternatively, a system may create new copies of data, and install them only if a transaction commits, which in general requires a redo log. Updating in-place requires early conflict detection if the system guarantees that a transaction will not see effects of other uncommitted transactions. It is possible to allow such effects to be visible, but serializability then requires that the observing transaction commit only if the transaction it observed also commits. However, the system must still prevent two transactions from observing each other’s effects, since then they cannot be serialized. More advanced models have also been explored but are not discussed here.
Semantics of Nesting The simplest motivation for nesting is to make it easy to compose software components into larger systems. If a library routine uses transactions, and the programmer wishes to use that routine within an application transaction, then there will be nesting of transaction begin/end pairs. One can simply treat this as one large transaction, effectively ignoring the inner transaction begin/end pairs. However, it is also possible to attribute transactional semantics to them, as follows. Consider a transaction T and a transaction U contained with it, i.e., whose begin and end are between the begin and end of T. T is a parent transaction and U a child transaction or subtransaction of T. If T is not contained within any enclosing transaction, it is top-level. Only proper nesting of begin/end transaction pairs is legal, so a top-level transaction and subtransactions at all depths form a tree. Conflict semantics of non-nested transactions extend to nested transactions straightforwardly in terms of relationships in the forest of transaction trees. If action A conflicts with action B when executed by two different non-nested transactions T and T, then A conflicts with B in the nested setting if neither of T and T is an ancestor of the other. Why is there no conflict in the case where, say, T is an ancestor of T? It is because T is providing an environment or universe within which additional transactions can run, compete, and be serialized.
It is simplest, however, to envision nesting where a parent does not execute actions directly, but rather always creates a child transaction to perform them. Alternatively, a parent might perform actions directly, but only when it has no active subtransactions. This model leads to serializability among the subtransactions of any given transaction T. However, as viewed from outside of the transaction tree that includes T and its descendants, T and its subtransactions form a single transaction that must itself be serializable with transactions outside of T. As with non-nested transactions, a nested transaction can fail (abort). In that case, it is as if the failing transaction, and all of its descendants, never ran. Thus, commit of a child transaction is not final, but only relative to its parent, while abort of the parent is final and aborts all descendants, even those that have (provisionally) committed.
Example Closed Nesting Implementation Approach Consider adding support for nesting to a non-nested transaction implementation that employs in-place update and early conflict detection based on locking. As each transaction runs, it accumulates a set of locks and a list of undos. If the transaction aborts, the system applies the transaction’s undos (in reverse order) and then discards the transaction’s locks. Note that a transaction T can be granted a lock L provided that the only conflicting holders of L are ancestors of T. Also note that discarding a child’s lock does not discard any ancestor’s lock on the same item. If a child transaction commits, then the system adds the child’s locks to those held by the parent, and appends the child’s undo list to that of the parent. If a top-level transaction commits, the system simply discards its held locks and its undo list. Moss [, ] described this protocol. It is also possible to devise timestamp-based approaches, possibly supporting late conflict detection, as articulated by Reed []. Notice that these nesting schemes use the same set of possible actions at each level of nesting. They provide temporal grouping of actions, and in a distributed
Transactions, Nested
system can also be used for spatial grouping within temporal groups.
Motivations for Closed Nesting There are two primary advantages of closed nesting. One is that failure of a child does not require immediate failure of its parent. Thus, if a transaction desires to execute an action that has higher than usual likelihood of causing failure, it can execute that action within a child transaction and avoid immediate failure of itself should the action cause an abort. In a centralized system aborts might most likely be caused by conflicts with other transactions, but in a decentralized system, failure of a remote node or communication link is also possible. Thus, remote calls are natural candidates to execute as subtransactions. If a child does fail, the parent can retry it, which may often make sense, or the parent can perform some alternate action. For example, in a distributed system, if one node of a replicated database is down, the parent could try another one. A possibly stronger motivation for closed nesting is safe transaction execution when the application desires to exploit concurrency within a transaction. It is easy to see this by considering that if there is concurrency within a transaction, then proper semantics and synchronization or serialization within the transaction present the same issues that led to proposing transaction mechanisms in the first place. Even if, at first blush, it appears that the space of data that concurrent actions might update is disjoint, and thus that there can be no conflict, that property can be a delicate one, and difficult to enforce in complex software systems having many layers. For example, transaction T at node A might make apparently disjoint concurrent remote calls to nodes B and C. However, B and C, unknown to T, use a common service at node D, and should have their actions properly serialized there. If the work at B and C is not performed in distinct concurrent child transactions of T, the work at D might not be properly serialized.
Open Nesting While closed nested transactions indeed support safe concurrency within transactions, and also offer limited
T
recoverability from partial failure, they have a significant limitation: Transactions that are “big,” either in terms of how long they run or the volume of data they access, tend to conflict with other transactions. While this cannot always be avoided, many conflicts are false conflicts at the level of application semantics. For example, consider a transaction T that adds a number of new records to a data structure organized as a B-tree. Logically speaking, if other transactions do not access these records or otherwise inquire directly or indirectly about their presence or absence in the data structure, then they do not conflict with T. However, straightforward mechanisms for guaranteeing safe transactional access to the B-tree might acquire locks on B-tree nodes and hold them until T commits. Thus, other transactions could be locked out of whole nodes of the tree, even though they are not (logically) affected by the changes T is making. The solution discovered in the context of databases applies also to the case of parallel computing. It is to make a distinction between different levels of semantics, and requires recognizing certain data as being part of a coherent and distinct data abstraction. For example, in the case of a B-tree, each B-tree node is part of a given B-tree, and should be visible and manipulated only by actions on that B-tree. That is, the B-tree nodes are encapsulated within their owning B-tree. This allows B-tree actions to apply conflict management and undo or redo to B-tree nodes, during execution of those actions, and for the system to switch to abstract concurrency control and abstract undo or redo once a B-tree action is complete. How does this solve the problem? The concurrency control and undo/redo on B-tree nodes allows safe concurrent (transactional) execution of B-tree actions themselves. This could also be achieved by nontransactional locking, lock-free or wait-free algorithms, or any other means that guarantees linearizability, but open nesting is generally taken to refer to the recursive use of transaction-like mechanisms, while wrapping a not necessarily transactional data type with abstract concurrency control and recovery is called transactional boosting []. More significantly, though, the conflicts between full B-tree actions will be much fewer than the (internal, temporary) conflicts on B-tree nodes during
T
T
Transactions, Nested
those actions. For example, looking up record r and adding record r do not conflict logically, but if they lie in the same B-tree node, there will be a (physical) conflict on that node. While open nesting is by no means restricted to use with such collection data types, it is certainly very useful in allowing higher concurrency for them.
An Example Open Nesting Protocol A fleshed-out example protocol for open nesting may be helpful to build understanding. This example uses in-place update and employs locks for early conflict detection, but other protocols are possible for other approaches to update and conflict detection. For understanding the protocol, a specific example data structure is instructive. Consider a Set abstraction implemented using a linked list. Suppose it supports actions add(x) and remove(x) to manipulate whether x is in the set, size() to return the set’s cardinality, and contains(x) to test whether x is in the set. Suppose the items a, b, and c have been added to the set in that order, and that add appends new elements to the end (since it must scan to the end anyway in order to avoid entering duplicates). Assume these elements are committed and there are no transactions pending against the set. Now suppose that a transaction adds a new member d and continues with other work. During the add operation, the transaction observes a, b, and c and the list links, then it creates a new list node containing d and modifies c’s link to refer to the new node. Suppose that another transaction concurrently queries whether the set contains e. At the physical level this contains query conflicts with the uncommitted add, because the query will read the link value in c’s list node, etc. Likewise a transaction that tries to remove b will conflict with both the add and the contains actions because it will try to modify the link in a’s node, to unchain b’s node from the list. To guarantee correct manipulation of the list each operation can acquire transactional locks on list nodes, acquiring an exclusive (X) mode lock when modifying a node and a share (S) mode lock when only observing it. Two S mode locks on the same object do not conflict, but all other mode combinations conflict. The pointer to the first node is likewise protected with the same kind of
locks. This locking protocol works fine for closed nesting, but it is easy to see that it leads to many needless conflicts. Open nesting requires identifying the abstract conflicts between operations. This example protocol uses abstract locks. These locks include an S and X mode lock for each possible element of the set, and an additional lock with modes Read (R) and Modify (M) for the cardinality of the set. Two R mode locks do not conflict, and two M mode locks also do not conflict, but R and M mode conflict with each other. Here is a table showing the abstract locks acquired by each action on a Set: add(x)
X mode on x; M mode on cardinality
remove(x)
X mode on x; M mode on cardinality
contains(x) S mode on x size()
R mode on cardinality
This can be refined to acquire only an S mode lock on x if add or remove does not actually change the membership of the set, and in that case also not to acquire the M mode lock on the cardinality. Assuming that each action on the set is run as an open nested transaction, then before an action completes it must acquire the specified abstract locks. If it cannot do that, it is in conflict and some transaction must be aborted. Once the action is complete and holds the abstract locks, the nested transaction commits and releases the lower-level locks on list nodes. The parent transaction will hold the abstract locks until it itself commits. Similar to closed nesting, if an uncommitted open nested transaction aborts, it can simply unwind back to where it started and try again. Often such cases arise because of temporary conflicts on the physical data structure. However, if the conflict is because of an abstract lock, then retry is not likely to help – either other transactions need to complete or abort to get out of this transaction’s way, or this transaction needs to abort higher up in the nested transaction tree. Abstract locks are just one way of implementing detection of abstract conflicts. In general what is required is an encoding of abstract conflict predicates into conflict checking code. These conflict predicates
Transactions, Nested
indicate which actions on a data type conflict with other actions. Here is an example table for Set: add(y) remove(y) contains(y) size() add(x)
x = y
x = y
x = y
true
remove(x)
x = y
x = y
x = y
true
contains(x)
x = y
x = y
false
false
size()
true
true
false
false
In this table, the left action is considered to have been performed by one transaction, and the right action is requested by another. The entry in the table indicates the condition under which the new request conflicts with the older, not yet committed, action. The table above is expressed in terms of the operations and their arguments. However, it is possible to refine these predicates if they can refer to the state of the set. In general this might include the state after the first operation as well as the state before it. The refined table below uses references to the state S before the first operation: add(y)
remove(y)
contains(y) size()
add(x)
x = y ∧ x /∈ S x = y
x = y ∧ x /∈ S x ∈/ S
remove(x)
x = y
contains(x)
x = y ∧ x /∈ S x = y ∧ x ∈ S false
false
size()
y ∈/ S
false
x = y∧x∈S x = y∧x∈S x∈S
y∈S
false
Open nesting involves more than just conflict detection. Until an open nested action commits, aborting it works like aborting a closed nested action: Simply apply its (lower level) undos in reverse order and release its (lower level) locks. However, once an open nested action commits, it does not work to undo it using the list of lower level undos it accumulated while it ran. The lower level undos are guaranteed to work properly only if the lower level locks are still held. To undo a committed open nested action, the system applies an abstract undo, also called an inverse or compensating action. Here is a table of inverses for actions on the Set abstraction; a — entry means that no inverse is needed (the action did not change the state): Action
add(x)
Inverse
if x /∈ S then if x ∈ S then remove(x) add(x)
remove(x)
contains(x)
size()
—
—
T
Notice that the appropriate inverse can depend on the state in which the original action ran. If add or remove does not actually change the state, then they can simply omit adding an inverse. These inverses are added to the parent transaction’s undo list, to apply if the parent transaction needs to abort. The abstract concurrency control will guarantee that these inverses still make sense when they are applied. They should be run as open nested transactions, and if they fail, it will only be because of temporary conflicts on the physical data structure, so they should simply be retried until they succeed.
Coarse-Grained Transactions Transactions at the more abstract level, employing abstract concurrency control, and, if using in-place update, abstract undos, can be more generally termed coarse-grained transactions. As previously noted, the individual actions need not be run as transactions under a transaction mechanism – all that is required is that they are linearizable. However, using a transaction mechanism does offer the advantage of being able to abort and retry an action in case of conflict, while other approaches must guarantee absence of conflict. Thus, if the system does not use nested transactions to implement the actions, then, if using in-place update, it will need to acquire abstract locks before running the action. This implies that it cannot base the lock acquisition on the state of the data abstraction or on the result of the action. However, if executing the action reveals that the originally acquired abstract lock is stronger than necessary, the implementation can then downgrade the lock. As noted before, the underlying implementation might use non-transactional locks to synchronize (for example, one mutual exclusion lock on the whole data structure will work, at the cost of reduced concurrency), or might use lock-free, obstruction-free, or wait-free techniques to obtain linearizability. Upon first consideration, in-place updates may appear more complicated, since they require specifying, implementing, tracking, and applying undos. However, providing new copies of an abstract data structure has its own problems. One difficulty is being clear as to what needs to be copied. A second problem is cost. To reduce cost, a coarse-grained transaction implementation might use Bloom filters, which record
T
T
Transactions, Nested
and examine an ongoing transaction’s changes in a side data structure, private to the transaction. Transactions must also record additional information even for read-only actions, in order to check for conflicts later.
Correct Abstract Concurrency Control The tables above gave conflict predicates without indicating how to derive or verify them. A conflict predicate is safe if the actions commute when the predicate is false. [Some make a distinction between actions moving to the left and to the right, but in most practical cases actions either commute (move both ways) or they do not (neither way).] Two actions commute if executing them in either order allows them to return the same results, and also does not affect the outcome of any future actions. Assuming that all relevant aspects of the state are observable, then this can be rephrased as: Actions commute if, when executed in either order, they produce the same results and the same final state. There are some subtleties lurking in this definition. First, the “same state” means the same abstract state. For example, in the case of Set implemented as a linked list, the abstract state consists in what elements are in, and not in, the set. The order in which the members occur on the linked list does not matter. Therefore, even though add(x) and add(y) result in a different linked list when executed in the opposite order, abstractly there is no difference. Thus it is important to have clarity about what the abstract state is. Second, interfaces vary in what they reveal. For example, add(x) might return nothing, not revealing whether x was previously in the set. As far as concurrency control goes, the less revealing interface reduces conflicts: two transactions could both do add(x) without conflict as perceived via this interface. Third, if the system uses undos, then the undo added to a transaction’s undo list is part of the result to consider when determining conflicts. So, if x is not initially in a set, and then two transactions each invoke add(x), even if the add actions return no result, the actions conflict since the undo for the first one is remove(x) and the undo for the second is “do nothing.” In this respect, late conflict detection sometimes allows more concurrency. However, in general it
requires making a copy (at least an effective copy) of the data structure, and if any transaction commits changes to the data structure while transaction T is running, T’s actions must be redone on the primary copy of the data structure rather than directly installing the new state that T constructed. Fourth, certain non-mutating operations entail concurrency control obligations that may at first seem surprising. For example, if x is not in a set and transaction T runs the query contains(x), the set must guarantee that any other transaction that adds x will conflict with T. Thus, if the system uses abstract locking, contains(x) must in this case lock the absence of x. Hence, an abstract lock is not necessarily a lock attached to some piece of the original data structure. (In some database implementations the locks are mixed with the actual records, and the system creates a new record for a lock like this, a record that goes away at the end of the transaction. This is called a phantom record.) A similar case occurs with an ordered set abstraction when a call to getNextHigher(x) returns y: The transaction must lock the fact that the ordered set has no value between x and y. Thus, readonly actions still require checking and recording, and this applies equally to late conflict detection as to early detection.
Extended Semantics Another use for open nesting is to break out of strict serializability (at the programmer’s risk, of course). It is sometimes useful, even necessary, to keep some effects of a transaction even if it is aborted. For example, in processing a commercial transaction, a system might discover that the credit card presented is on a list of stolen cards. While most effects of the purchase should be undone, information about the attempted use of the card should go to a log that will definitely not be undone. This is easy to do by giving the log action’s inverse as “do nothing.” (This is harder to do in a system that is not doing in-place updates, and would require a special notation.) In this way open nesting can be abused to achieve irrevocable effects. Similarly, a programmer can understate conflict predicates and allow communication between transactions. The “extended semantics” of open nesting abused in these ways may depend on the underlying implementation.
Transactions, Nested
Nesting in Transactional Memory Both closed and open nesting have been proposed for use with transactional memory (TM), for both software (STM) and hardware (HTM) approaches. The primary difficulty in implementing closed nesting for TM is its more complex conflict rule. It no longer suffices to check for equality or inequality of transaction identifiers – the test must distinguish an ancestor transaction from a non-ancestor. (This assumes that only transactions that are currently leaves of the transaction tree can execute.) HTM designs must also deal with the reality that hardware resources are always limited, and thus, there may be hard limits on the nesting depth, for example. HTM will also not be aware of abstract locks and abstract concurrency control; they will always be implemented in software. However, the number of conflict checks required for abstract concurrency control is strictly less than for physical units such as words or cache lines. It is particularly more complex to check for conflict between concurrent subtransactions running under nested TM. However, if a transaction has at most one child at once, and only leaf transactions can execute, then the implementation is only slightly more complex than for non-nested transactions. Because the transaction tree in this case consists of a single line of descent from a top-level transaction, it is called linear nesting. Linear nesting admittedly forgoes one of the strong advantages of nesting, namely concurrent sibling subtransactions, but it retains partial rollback and thus remains potentially more useful than non-nested transactions.
Bibliographic Notes and Further Reading The early exposition of nested transactions is marked by Davies [], Reed [], and Moss [, ]. Open nesting (also called multi-level transactions) was articulated by Beeri et al. [], Moss et al. [], and Weikum and Schek []. Nested transactions for hardware transactional memory are explored in Yen et al. [] and Moss and Hosking [], and Ni et al. [] describe a prototype that supports open nesting in software transactional memory. Transactional boosting was introduced by Herlihy and Koskinen []. Gray and Reuter [] provide
T
comprehensive coverage of transaction processing concepts and techniques.
Bibliography . Beeri C, Bernstein PA, Goodman N () A concurrency control theory for nested transactions. In: Proceedings of the ACM Symposium on Principles of Distributed Computing. ACM Press, New York, pp – . Davies CT Jr () Recovery semantics for a DB/DC system. In: ACM ’ Proceedings of the ACM Annual Conference. ACM, New York, pp –. doi: http://doi.acm.org/./. . Gray J, Reuter A () Transaction processing: concepts and techniques. Data Management Systems. Morgan Kaufmann, Los Altos, CA . Herlihy M, Koskinen E () Transactional boosting: a methodology for highly-concurrent transactional objects. In: Chatterjee S, Scott ML (eds) Proceedings of the th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP , Salt Lake City, UT, – Feb . ACM, New York, pp –, ISBN ---- . Moss JEB () Nested transactions: an approach to reliable distributed computing. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA (Also published as MIT Laboratory for Computer Science Technical Report ) . Moss JEB () Nested transactions: an approach to reliable distributed computing. MIT Press, Cambridge, MA . Moss JEB, Hosking AL () Nested transactional memory: model and architecture sketches. Sci Comput Progr : – . Moss JEB, Griffeth ND, Graham MH () Abstraction in recovery management. In: Proceedings of the ACM Conference on Management of Data. Washington, DC. ACM SIGMOD, ACM Press, New York, pp – . Ni Y, Menon V, Adl-Tabatabai AR, Hosking AL, Hudson RL, Moss JEB, Saha B, and Shpeisman T () Open nesting in software transactional memory. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Jose, CA. ACM, New York, pp – . Reed DP () Naming and synchronization in a decentralized computer system. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA (Also published as MIT Laboratory for Computer Science Technical Report ) . Weikum G, Schek HJ () Concepts and applications of multilevel transactions and open nested transactions. Morgan Kaufmann, Los Altos, CA, pp –, ISBN --- . Yen L, Bobba J, Marty MR, Moore KE, Volos H, Hill MD, Swift MM, Wood DA () LogTM-SE: decoupling hardware transactional memory from caches. In: Proceedings of the th International Conference on High-Performance Computer Architecture (HPCA- ), – February , Phoenix, Arizona, USA, IEEE Computer Society, Washington, DC, pp –
T
T
Transpose
Transpose All-to-All
TStreams Concurrent Collections Programming Model
Tuning and Analysis Utilities TAU
U Ultracomputer, NYU Allan Gottlieb New York University, New York, NY, USA
Definition The NYU Ultracomputer is a class of highly parallel, shared-memory multiprocessors featuring two innovations: the fetch-and-add process coordination primitive, and a hardware method to combine (nearly simultaneous) memory references directed at the same location. These references can be loads, stores, or fetch-andadds. The combining procedure generalizes to fetchand-phi for any associative operator phi, for example, max or min. Combining often enables all simultaneous memory requests from many processors to be completed in essentially the time required for just one such request. This property, together with the power of fetch-andadd/phi, has enabled construction of coordination algorithms with the important property of imposing no serialization beyond that specified in the problem itself. For example, fetch-and-add-based readers-writers algorithms, during periods of no writer activity, can satisfy multiple requests in essentially constant time. Similarly, fetch-and-add-based queue algorithms, during periods when the queue is neither full nor empty, can satisfy multiple inserts and deletes in essentially constant time. Such implementations are sometimes called “bottleneck free.” The Ultracomputer project constructed a few smallscale prototype implementations, the largest being a fully functional, -processor prototype with a processor-to-memory interconnection network whose full-custom VLSI switches successfully combined memory references, including fetch-and-adds. Implemented software included two generations of the Unix operating system tailored to run on the multiprocessor. The first version required nearly all kernel
activity to occur on one processor; the second was itself a parallel program that made extensive use of locally developed, highly parallel (aka scalable) process coordination algorithms. All the software was compiled by the locally developed port of GCC (the GNU Compiler Collection, formerly the GNU C Compiler).
Discussion Early History The NYU Ultracomputer project was launched in when Jack Schwartz wrote a highly influential paper [] itself entitled Ultracomputers. However, the primary architecture described in that paper is not what became the NYU Ultracomputer. Instead, influenced by the Burroughs proposal for NASA’s Numerical Aerodynamic Simulation Facility, Schwartz and his NYU colleagues increased their consideration of sharedmemory MIMD (Multiple Instruction stream Multiple Data stream) designs. This lead to the discovery of both fetch-and-add together with several effective process coordination algorithms using this primitive, and the hardware design enabling nearly simultaneous fetch-and-adds to be combined into a single memory operation. In fact fetch-and-add had been discovered previously. Dijkstra [] considered essentially the same primitive and showed that the natural fetch-and-add implementation of a semaphore had a fatal race condition. The NYU group, ignorant of this earlier work, not only found the same primitive but also implemented essentially the same flawed semaphore, thinking it was correct. However, they subsequently discovered the same race condition, and found another semaphore implementation, which they proved correct.
Fetch-and-Add (and Friends) Fetch-and-add is essentially an indivisible add to memory operation. Its format is F&A(X,v ), where X is an integer variable and v is an integer value. This function
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
U
Ultracomputer, NYU
with side effects is defined to both return the (old) value of X and replace the memory value of X by X +v . That is, F&A(X,v ) fetches X and adds v to it. Simultaneous fetch-and-add’s directed at the same integer variable X , are required to satisfy the serialization principle (cf. Eswaran et al. []), meaning that their effect must be the same as if they were executed serially in some unspecified order. That is, fetch-and-add is atomic.
Fetch-and-ϕ and Replace-Add The fetch-and-add function can easily be generalized to fetch-and-ϕ, where ϕ is a binary operator. For example fetch-and-max(X,v ) would return the old value of X while atomically setting X :=max(X,v ). As shall be seen below, the ability to combine simultaneous fetch-and-ϕ’s requires that ϕ be associative. The early versions of fetch-and-add actually used a slightly different function Replace-add(X,v ) which returned the new (incremented) version of X . Thus, it was an atomic version of ++X rather than X ++. For an invertible function such as addition, there is little to choose between the two variants since F&A(X,v ) is equivalent to Replace-Add (X,v )-v Replace-Add(X,v ) is equivalent to F&A(X,v )+v However, when generalizing to non-invertible ϕ’s such as max, the fetch-and- version is superior to the replace- version since there is no analogue of the first equivalence for max. This observation becomes especially important when simultaneous fetch-andϕ(X,v )’s are performed on the same variable X .
A Fetch-and-Add Semaphore Perhaps the simplest coordination algorithm is the busy-waiting, binary semaphore (aka. spin-lock) used to enforce mutual exclusion for critical sections. loop P (S ) critical section V (S ) end loop The idea behind a fetch-and-add-based semaphore S is to treat S as an integer variable, letting S = represent
“open” and S ≤ represent “closed.” S is then atomically incremented and decremented with fetch-and-add. A successful P (S ) operation finds S = . It executes F&A(S,-1) receiving and setting S to . The V (S ) operation executes F&A(S,+1) thereby undoing the effect of its successful P. But what about an unsuccessful P (S ), one that finds S < when it does F&A(S,-1)? The unsuccessful P (S ) simply does F&A(S,+1) to undo its decrement and tries again, hoping that some V (S ) will have occurred in the interim. Code implementing this (faulty) pair of algorithms is presented in Section Code Examples; the corresponding functions are named NaiveP (S ) and V (S ). These two functions are essentially the same code that Dijkstra correctly criticized for having a fatal race condition in which no process is in the critical section, but several process continually try and fail to enter. For example, consider the following scenario, beginning with the semaphore initially open (i.e., S = ) and three processes A, B, and C arriving at NaiveP (S ). Assume all three execute the F&A(S,-1) simultaneously and the serial order effected is A, B, C. Thus A receives + and enters the critical section. B and C receive and −, respectively; they each add back a and retry. As long as A is inside the critical section (i.e., has not executed V (S )), S remains less than correctly preventing B and C from succeeding. When A does execute V (S ), its initial decrement is undone and either B or C should be able to succeed. Indeed, if both B and C have just executed F&A(S,+1), the semaphore will be restored to its initial value of + and the next process to execute F&A(S,-1) will receive + and succeed. However, this may never happen since the following endless scenario may occur instead. Assume that, after A increments S, both B and C are at their increment instruction (at this point S = −). Suppose that the effected serialization is that B increments and then decrements prior to C executing. The result is that S moves to and then back to −, and B receives from F&A(S,-1); thus B was not successful. Now the same sequence can occur for C, and then again for B, etc. Although this race condition is quite unlikely to occur for three processes, it is possible, and becomes steadily more probable as the number of processes increases.
Ultracomputer, NYU
If NaiveP (S ) is replaced by TrueP (S ) also from the section Code Examples, the race is no longer possible. Indeed, using TrueP (S ), a correct implementation of P (S ), Gottlieb et al. [] proved that if, at time T, N > processes are executing the P (S )-V (S ) loop above and no process ceases execution, then there is a time t > T when exactly one process is executing its critical section. At first glance, TrueP (S ) appears to simply add a redundant test to NaiveP (S ). As just shown, however, the extra test is needed. In fact the resulting testdecrement-retest paradigm has proved fruitful for other algorithms as well (Gottlieb et al. []). Tighter coding of several of these algorithms can be found in [] and application to a highly parallel version of Unix can be found in [].
The Ultracomputer Architecture The basic Ultracomputer design is that of a symmetric multiprocessor, that is, there are multiple processors sharing a central memory. Although the memory can be packaged together with the processors, the prototype hardware did not do this and neither do the diagrams in this article. Figure in the Section Diagrams shows the basic block diagram with multiple processors connected to multiple memory modules.
The Processors The Ultracomputer processors are fairly standard; the primary novelty is that they can issue fetch-and-add instructions. The prototype hardware used AMD microprocessors with locally developed support hardware to implement fetch-and-add. The Memory Modules The memory modules were also straightforward. An adder was included at each module to enable fetchand-add. P
P
P
Interconnection network
M M
M
Ultracomputer, NYU. Fig. Ultracomputer block diagram
U
The Interconnection Network The primary novelty of the hardware design is the ability of the interconnection network to combine (nearly) simultaneous requests to the same memory location, in particular, the ability to combine multiple fetch-andadd operations into a single memory operation. The Topology The interconnection topology used was an N × N Ω-Network composed of × crossbar switches. See Fig. for an example with N = and, for the moment ignore the colors. The very left and right columns contain – written in binary, they correspond to the eight processors and memory modules; the rectangles are bidirectional crossbar switches that can connect either left port with either right port; and the inter-column wires implement the so-called shuffle map. Consider the red wire. Counting from zero it connects the third horizontal line on the left to the fifth on the right. Note that = , = , and RightRotate() = . That is the defining property of the shuffle map; it holds for all the wires. To see that any of the eight far left ports can be connected to any of eight far right, first note that any crossbar can connect xyz to either xy or xy, that is, each crossbar can determine the low-order bit of the right-hand row independent of the low-order bit of the left-hand row. So any column can “correct” the low-order bit and the shuffle map ensures that each successive column operates on a new low-order bit. Thus, after logN columns and shuffles, the message is at the desired row. When traversed “backward” (i.e., from right to left), the network can again correct the low-order bit in each column, but now gets the next low-order bit via the inverse shuffle, which corresponds to a left rotate. Thus, each of the processors can access each memory module and vice versa. Now consider the blue lines in Fig. . They show the paths from each processor to memory module number . Note that = and that every path successively exits switch ports , , and . Thus the Ω-network is the superposition of N trees, one rooted at each memory module. It is also the superposition of N other trees, one rooted at each processor. It will be important for combining to note that the network has a unique path π from a given processor to a
U
U
Ultracomputer, NYU
000
0
0
0
0
0
0
000
001
1
1
1
1
1
1
001
010
0
0
0
0
0
0
010
011
1
1
1
1
1
1
011
100
0
0
0
0
0
0
100
101
1
1
1
1
1
1
101
110
0
0
0
0
0
0
110
111
1
1
1
1
1
1
111
Ultracomputer, NYU. Fig. × Omega network
given memory module and that the (unique) path from that memory to that processor is just the reverse of π. It is possible to use a network with multiple paths but then forward request must generate enough information to ensure that the reply traverses the reverse path. The Switches As mentioned, each Ultracomputer switch is a × crossbar. The switches are buffered so that if two requests entering the left side wish to exit the same right-side port, one is stored at the switch (assuming the available buffer space has not been exhausted). In addition to providing the throughput advantages of buffered switches, buffering is essential for the chosen combining implementation.
Combining Loads and Stores It is fairly easy to see how to combine two loads or two stores that are directed at the same memory location and are present in a single switch at the same time. For two loads, the switch sends one forward and retains the other. When the memory response arrives, two replies are sent back, one for each request. For two stores, the switch again sends just one forward. When this store is performed at memory, the effect is as if the non-forwarded store was executed immediately before. When the memory acknowledgment for the forwarded store arrives, the switch acknowledges both requests. It is important for program
F&A(X,v)
F&A(X,v + w)
T
F&A(X,w) T+v
T
v
Ultracomputer, NYU. Fig. Combining two fetch-and-adds
semantics not to acknowledge either request before the memory acknowledgment has arrived. Note that combined loads and combined stores appear to the next stage switch just like ordinary loads and stores and thus can be further combined. In the best case, each processor can issue a request to the same memory location and all N requests will be combined along the tree to the memory module so that only a single request need be serviced at the memory itself.
Combining Fetch-and-Adds Combining fetch-and-adds is not as easy as combining loads or stores []. Figure in section Diagrams shows the actions performed when F&A(X,v ) meets F&A(X,w ) at a switch. The switch computes the sum v + w, transmits the combined request F&A(X,v +w ), and stores the value v in its local memory. Eventually, a value T will be returned to the switch in response to
Ultracomputer, NYU
F&A(X,v +w ). At this point, the switch returns T to satisfy the request F&A(X,v ), returns T + v to satisfy the request F&A(X,w ), and deletes v from its local memory. If the combined request F&A(X,v +w ) was not itself combined, then T is simply X. In this case, the values returned are X and X+v which is consistent with the serialization order “F&A(X,v ) followed immediately by F&A(X,w ).” The memory location has been incremented to X + (v + w). By the associativity of addition, this value equals (X + v) + w, which is again consistent with “F&A(X,v ) followed immediately by F&A (X,w ).” It is not hard to apply this argument recursively and see that, thanks to associativity, if combined requests are further combined while on their way to memory, the results returned to each requesting processor are consistent with some serialization of the constituent fetch-and-adds. For example, Fig. shows four fetchand-adds with increments , , , and being combined and de-combined to achieve the serialization order F&A(X,8), F&A(X,4), F&A(X,1), F&A(X,2).
Combining Heterogeneous Requests Having seen how to combine loads, stores, and fetchand-adds, it is natural to ask if the switch can combine two different request types. Since load X can be
U
viewed as F&A(X,0), the only heterogeneous case that need be considered is combining a fetch-and-add with a store, which can be combined as follows. When F&A(X,v ) meets Store(X,w ), the switch transmits Store(w +v ) and remembers w. When the acknowledgment of this “combined” store is received, the switch acknowledges the original store, returns w to satisfy F&A(X,v ), and deletes w from its local memory. These actions have the same effect as “Store(X,w ) followed immediately by F&A(X,v ).”
Combining Switch Design As shown above, the physical combining of memory requests occurs at the network switches. The first externally available description of combining switches was [], a later description can be found in [], and a detailed description together with analysis and simulation results occurs in Dickey’s thesis []. A block diagram of a single × switch is shown in Fig. . Unlike previous diagrams in which lines represented bidirectional data flow, each line in this figure is unidirectional. The processors would be to the left and memory to the right. In terms of the combining description given in section Combining Fetch-and-Adds, F&A(X,v ) meets F&A(X,w ) in the Combining Queue associated with the right-side port on the path to X . The value v
F&A(X,1) X : 12
F&A(X,2) X : 13
1
F&A(X,3)
F&A(X,15)
X : 12
X:0
X = 0 (start) X = 15 (end)
U F&A(X,12) F&A(X,4) X:8
X:0
F&A(X,8) X:0
8
Ultracomputer, NYU. Fig. Combining four fetch-and-adds
12
U
Ultracomputer, NYU
Combining queue
In Wait buffer
Out
Ultracomputer, NYU. Fig. × Systolic queue Non-combining queue
Systolic Queues The point of departure for the NYU
Non-combining queue
Combining queue
Wait buffer
Ultracomputer, NYU. Fig. × Switch
is stored in the Wait Buffer, together with additional information identifying the requests. The eventual response from memory is compared with entries in the Wait Buffer and matching entries are sent to the Noncombining Queue(s) associated with left-side port(s) on the paths to the processors that issued the original requests. There are challenges designing each of the components and in partitioning the design to meet the constraints imposed by the low pin-count chip packages available at the time. However, the most interesting component is likely the enhanced systolic queue used to implement the combining queue, so that component is described in more detail. (Semi-)Systolic (Combining) Queues Before turning our attention to the actual queue designs, note that Fig. shows combining queues with two inputs. Designs of this kind were studied, but the actual implementation of the two-input queues consisted of two one-input queues followed by a two-input multiplexer (see Ref. [] for a detailed comparison).
combining queue, was the systolic (non-combining) queue design of Guibas and Liang [], the basic idea of which is shown in Fig. . This queue implementation is called systolic since communication within the queue is local, unlike a solution using a RAM with head and tail pointers. Avoiding global signals is advantageous for VLSI. Items enter the In row and exit the Out row. If the output destination can always accept an item, then only the first column is used. In a single cycle, an item enters In, goes to the first column, and then drops to the bottom row, from which it exits on the subsequent cycle. More interesting is the behavior when the downstream queue is full and blocks the current queue. In this case the item will find that it cannot go down (since that slot is full and the current occupant cannot leave) and it remains in the top row. In the next cycle, the item in question again moves right and then moves down, providing the correspond slot is available. Thus, if downstream remains full, the current queue will itself become full as all columns become occupied. As the careful reader will have noticed, the single cycle referred to above is (internal to the queue) divided into two subcycles, one each for horizontal and vertical motion. The NYU designs for both the non-combining queues just described and the combining queues to be described next used two global control signals in_moving and out_moving and thus are properly called quasi- or semi-systolic. The global signals can be precomputed at the expense of an extra column to accept an incoming message from each predecessor after the switch announces that it is full. The signals can be implemented efficiently as qualified clocks (see Ref. []). The semi-systolic design used in the implementation permits insertions and deletions to occur every
Ultracomputer, NYU
Message ID
Addr
Chute row
Addr
In row
Wait buffer
Data Switch input
Compare Compare Compare Compare Compare Chute Data
Switch output
U
Addr
ALU Out
Data
Addr
Data
Out row
Ultracomputer, NYU. Fig. × Combining queue
cycle, an important property that is not supported by the purely systolic design of [].
Semi-Systolic Combining Queues The key observation enabling a variation of the systolic queues to support combining is that as an item moves left along Out, it passes directly under all subsequent items as they move right along In. When one item is directly above another, local logic can compare them and determine if they are destined for the same address. (Since both In and Out can move during the same cycle, it would appear that items can pass by without having a chance for comparison. However, all Ultracomputer memory references use an even number of packets, the first of which contains the address. Thus, with some extra logic, the system can assure that “address packets” are vertically adjacent at the end of a cycle.) Figure , adapted from [], is a block diagram of the Ultracomputer combining queue. When this diagram is compared to Fig. , the non-combining queue, three additions become apparent: the new Chute row, the comparators between the In and Out rows, and the “stuff on the left.” Each is discussed in turn. When combinable messages are detected (one traversing Out, the other In), the message traversing In is moved to the Chute. As a result it will never be moved to Out and will not be sent to the memory. For example, Fig. shows this event occurring in column . The Chute moves in synchrony with Out and thus both
the message traversing Out and the corresponding message traversing Chute will leave their respective rows together. If the needed slot in Chute is already full when combinable messages are detected, the second combining opportunity is lost. (The Out message has already been combined with another message from In.) Thus, having only one Chute supports only pairwise combining in a single switch. As noted above, these pairwise combined messages can themselves combine as they pass through subsequent switches. Adding a second Chute (together with enhancing the ALU to accept three inputs, and other changes) would permit three-way combining within a switch. The comparators between the In and Out rows detect references to the same address. As mentioned previously, while the address packet of a message traverses Out, it will sit directly below all address packets traversing In. When the comparator in a column detects equal addresses in the In and Out rows, logic in the column moves the address packet (and subsequently the remaining packets) to the Chute (unless this chute slot is full due to a previous combine, as described earlier in this section.). The logic on the left is essentially an extra column, without the comparators, together with an ALU needed to sum the increments when two fetch-and-adds are combined. The lack of a comparator in this leftmost column does mean that some combining opportunities are lost. It was omitted to prevent the increased cycle
U
U
Ultracomputer, NYU
time that would result from one column implementing both addition and comparison. When a combined message moves from the rectangular array of cells to the left-side logic, the memory address plus some identifying information is stored in the wait buffer. If the operation was fetch-and-add, the Out-row increment is also stored. This enables the eventual memory response to be decombined into responses for the two originating requests.
Code Examples procedure NaiveP(S) -- faulty implementation of P(S) OK := False; repeat if F&A(S,-1) > 0 OK := True; else F&A(S,+1); end if; until OK; end NaiveP; procedure V(S) -- correct implementation F&A(S,+1); end V; procedure TrueP(S) -- correct implementation OK := False; repeat if S > 0 then if F&A(S,-1)> 0 then OK := True; else F&A(S,+1); end if; end if; until OK; end P;
Related Entries Buses and Crossbars Cedar Multiprocessor Flynn’s Taxonomy
Interconnection Networks Memory Models MIMD (Multiple Instruction, Multiple Data) Machines Networks, Multistage Non-Blocking Algorithms Nonuniform Memory Access (NUMA) Machines Parallel Computing PRAM (Parallel Random Access Machines) Reduce and Scan Scan for Distributed Memory, Message-Passing Systems Shared-Memory Multiprocessors Switch Architecture Switching Techniques Synchronization
Bibliographic Notes and Further Reading Perhaps the most polished account of the NYU Ultracomputer can be found in the book by Almasi and Gottlieb []. More information, plus an extensive bibliography, can be found in []. An early Ultracomputer paper [] was selected to appear in a collection of ISCA papers together with a retrospective on the project [].
Bibliography . Almasi G, Gottlieb A () Highly parallel computing, nd edn. Benjamin/Cummings, Redwood City . Dickey S, Kenner R, Snir M () An implementation of a combining network for the NYU ultracomputer. Ultracomputer Note , NYU . Dickey SR () Systolic combining switch designs. PhD thesis, NYU, May . Dijkstra EW () Hierarchical orderings of sequential processes. In: Hoare CAR, Perrot RH (eds) Operating systems techniques. Academic Press, New York . Edler J () Practical structures for parallel operating systems. PhD thesis, NYU, May . Eswaran KP, Gray JN, Lorie RA, Traiger IL (Nov ) The notion of consistency and predicate locks in a database system. Commun ACM ():– . Freudenthal E () Comparing and improving centralized and distributed techniques for coordinating massively parallel sharedmemory systems. PhD thesis, NYU, May . Gottlieb A () An overview of the NYU ultracomputer project. In: Dongarra JJ (ed) Experimental parallel computing architectures. North Holland, Amsterdam, pp – . Gottlieb A () A personal retrospective on the NYU ultracomputer. In: Sohi GS (ed) years of the international symposia
Unimodular Transformations
.
.
. . . .
on computer architecture (selected papers), Barcelona, ACM, New York, pp – Gottlieb A, Grishman R, Kruskal CP, McAuliffe KP, Rudolph L, Snir M () The NYU ultracomputer∣designing a MIMD shared memory parallel machine. In: International symposium on computer architecture, Austin, Apr , pp – Gottlieb A, Lubachevsky BD, Rudolph L (Apr ) Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors. Trans Program Lang Syst ():– Guibas LJ, Liang F () Systolic stacks, queues, and counters. In: MIT conference on research in VLSI, Cambridge, pp – Rudolph LS () Software structures for ultraparallel computing. PhD thesis, NYU, Feb Schwartz JT (Oct ) Ultracomputers. ACM Toplas (): – Snir M, Solworth J () The ultraswitch—a VLSI network node for parallel processing. Ultracomputer Note , NYU, Jan
Uncertainty Quantification Terrestrial Ecosystem Carbon Modeling
Underdetermined Systems Linear Least Squares and Orthogonal Factorization
Unified Parallel C UPC
Unimodular Transformations Utpal Banerjee University of California at Irvine, Irvine, CA, USA
Definition A Unimodular Transformation is a general loop transformation based on a unimodular matrix. It changes a perfect nest of sequential loops into a similar nest with the same set of iterations executing in a different sequential order.
U
Discussion Introduction This essay gives the basic introduction to Unimodular Transformations. More advanced results on this topic are given in the essay Loop Nest Parallelization in this encyclopedia, where other transformations are considered as well. The program model is a perfect nest L of m sequential loops. An iteration of L is an instance of the body of L. The program consists of a certain set of iterations that are to be executed in a certain sequential order. This execution order imposes a dependence structure on the set of iterations, based on how they access different memory locations. A loop in L carries a dependence if the dependence between two iterations is due to the unfolding of that loop. A loop can run in parallel if it carries no dependence. A unimodular matrix is a square integer matrix with determinant or −. An m×m unimodular matrix U can be used to transform L into a perfect nest L′ of m loops, with the same set of iterations executing in a different sequential order. This is a unimodular transformation. The new nest L′ is equivalent to the old nest L (i.e., they give the same results), if whenever an iteration depends on another iteration in L, the first iteration is executed after the second in L′ . If a loop in L′ carries no dependence, then that loop can run in parallel. It turns out that a unimodular transformation always exists, such that L′ is equivalent to L and all loops in L′ , except possibly one, can run in parallel. There is a section on mathematical preliminaries to prepare the reader for what comes next. It is followed by a section on basic concepts. The general unimodular transformation is then explained in detail by examples. Finally, some important special features of special unimodular transformations like loop permutation and loop skewing are pointed out.
Mathematical Preliminaries For basic Linear Algebra concepts used in this essay, see any standard text. The set of real numbers is denoted by R and the set of integers by Z. For any positive integer m, Rm denotes the m-dimensional vector space over R consisting of all real m-vectors, where vector addition and scalar multiplication are defined coordinate-wise in the usual way. The zero vector (, , . . . , ) is written as .
U
U
Unimodular Transformations
The subset of Rm consisting of all integer m-vectors is denoted by Zm .
Lexicographic Order For ≤ ℓ ≤ m, define a relation ≺ℓ in Zm as follows: If i = (i , i , . . . , im ) and j = (j , j , . . . , jm ) are vectors in Zm , then i ≺ℓ j if i = j , i = j , . . . , i ℓ− = jℓ− , and i ℓ < jℓ . The lexicographic order ≺ in Zm is then defined by requiring that i ≺ j, if i ≺ℓ j for some ℓ in ≤ ℓ ≤ m. The notation i ⪯ j means either i ≺ j or i = j. Note that ⪯ is a total order in Zm . The associated relations ≻ and ⪰ are defined in the usual way: j ≻ i means i ≺ j, and j ⪰ i means i ⪯ j. (⪰ is also a total order in Zm .) An integer vector i is positive if i ≻ , nonnegative if i ⪰ , and negative if i ≺ . For x ∈ R, the value of sgn(x) is , −, or , according as x > , x < , or x = , respectively. The direction vector of a vector (i , i , . . . , im ) ∈ Zm is the vector of signs: (sgn(i ), sgn(i ), . . . , sgn(im )). Note that a vector is positive (negative) if and only if its direction vector is positive (negative). For more details on lexicographic order, see [].
Fourier’s Method of Elimination Fourier’s method of elimination can be used to decide if a system of linear inequalities in real variables has a solution []. Consider a set of n inequalities in m real variables x , x , . . . , xm , of the form aj x + aj x + ⋯ + amj xm ≤ cj
( ≤ j ≤ n),
()
where the aij ’s and the cj ’s are real constants. To solve this system, eliminate the variables one at a time in the order: xm , xm− , . . . , x . If an inequality c ≤ turns up during the elimination process where c is a positive constant, then the system has no solution. Otherwise, the final result is a system of inequalities of the form: bi (x , x , . . . , xi− ) ≤ xi ≤ Bi (x , x , . . . , xi− ) ( ≤ i ≤ m),
()
where each bi is either −∞ or the maximum of a set of linear functions of x , x , . . . , xi− , and each Bi is either ∞ or the minimum of a set of linear functions of x , x , . . . , xi− . In this case, b ≤ B . The set of all real solutions (x , x , . . . , xm ) to () is given by (): Pick any
x such that b ≤ x ≤ B , then pick any x such that b (x ) ≤ x ≤ B (x ), and so on. For the purpose of finding loop limits after a unimodular transformation, Fourier’s method is extended to check for any integer solutions to a system of the form (), where the coefficients aij ’s and cj ’s are integer constants. If () has no real solution, then obviously it has no integer solution. Assume then () has a set of real solutions (x , x , . . . , xm ) given by (). The set of all integer solutions (z , z , . . . , zm ) to () is then given by α i (z , z , . . . , zi− ) ≤ zi ≤ β i (z , z , . . . , zi− ) ( ≤ i ≤ m),
()
where αi and β i are defined by α i (z , z , . . . , zi− ) = ⌈bi (z , z , . . . , zi− )⌉ and β i (z , z , . . . , zi− ) = ⌊Bi (z , z , . . . , zi− )⌋ for each i. If α > β , then there is no integer solution. Otherwise, proceed by enumeration: Take any integer z such that α ≤ z ≤ β , then check if there is an integer z such that α (z ) ≤ z ≤ β (z ), and so on. The process described above is illustrated by the following example. For a detailed description of Fourier’s method including an algorithm, see []. Example Consider the system of linear inequalities ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ≤ − ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ≤ −. ⎪ ⎪ ⎭
x − x ≤ −x + x x + x −x
()
Rearrange the inequalities so that the ones with a positive coefficient for x are at the top; they are followed by the ones where the coefficient for x is negative; and then come the inequalities where x is absent: ⎫ ⎪ −x + x ≤ − ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ x + x ≤ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ x − x ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ≤ −. ⎪ −x ⎪ ⎭
()
Unimodular Transformations
Divide each of the first three inequalities by the corresponding coefficient of x : ≤ ≥ − .
− x + x ≤ − x
+ x
− x + x
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
Note how the direction of the third inequality is reversed after division, as the coefficient of x in it was negative. Isolate x in this system to get x x
x − ≤ − x + ≤
x − ≤ x .
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
⎫ ⎪ ⎪ x ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ − x ≤ − ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −x ≤ −. ⎪ ⎪ ⎭ Divide each inequality by the corresponding coefficient of x (and reverse the direction if that coefficient is negative): ⎫ ⎪ ⎪ x ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ x ≥ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ x ≥ . ⎪ ⎪ ⎭ , ) or The minimum value of x is max ( maximum value is . Since the range
x − B (x ) = min ( x − , − x + ) . b (x ) =
≤ x − ≤ − x + .
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
Simplify these inequalities and bring in the last member of (): ⎫ ⎪ ⎪ − x ≤ − ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ x ≤ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −x ≤ −. ⎪ ⎪ ⎭ There is no inequality where x is absent. Rearrange the system so that the single inequality with a positive
and the
()
is nonempty, the original system () has a real solution. All solutions have the form (x , x ), where x is a real number in the range () and x satisfies () for that value of x . Clearly, (x , x ) = (, ) is one such solution. Next, explore if () has an integer solution. The set of all integer vectors (z , z ) satisfying the four inequalities in () is given by
Now, eliminate x by setting each of its lower bounds less than or equal to each of its upper bounds: x − x −
,
≤ x ≤
()
where
coefficient for x goes to the top:
Thus, when x is given, x must satisfy b (x ) ≤ x ≤ B (x ),
U
α ≤ z ≤ β α (z ) ≤ z ≤ β (z ), where α = ⌈
⌉=
and
α (z ) = ⌈ z − ⌉ β (z ) = ⌊min ( z −
β = ⌊
⌋ = ,
U and , − z + )⌋ .
The only possible values of z are and . It is easy to show that for either value of z , a corresponding integral value for z does not exist. Thus, there are real solutions to the system of inequalities (), but no integer solutions.
U
Unimodular Transformations
Unimodular Matrices An m × m real matrix A = (art ) represents a linear mapping of Rm into itself that maps a vector x = (x , x , . . . , xm ) into the vector xA = (y , y , . . . , ym ), where m
yt = ∑ art xr
( ≤ t ≤ m).
r=
This mapping is a bijection (one-to-one and onto) if and only if A is invertible, that is, det A ≠ . A unimodular matrix is a square integer matrix whose determinant is or −. The product of two unimodular matrices is unimodular. A unimodular matrix is invertible and its inverse is also unimodular. Thus, an m × m unimodular matrix U maps each integer vector x ∈ Rm into an integer vector xU ∈ Rm , and for each integer vector y ∈ Rm , there is a unique integer vector x ∈ Rm such that y = xU. In other words, an m × m unimodular matrix defines a bijection of Zm onto itself. See [, ] for detailed discussions on unimodular matrices. For any m, the m × m unit matrix is unimodular. More unimodular matrices can be easily constructed from a unit matrix by elementary row (or column) operations. One gets a reversal matrix by multiplying a row of a unit matrix by −, an interchange matrix by interchanging two rows of a unit matrix, and a skewing matrix by adding an integer multiple of one row to another row. These are called the elementary matrices. The × matrices ⎛ ⎜ ⎜ ⎜ − ⎜ ⎜ ⎝
⎞ ⎛ ⎟ ⎜ ⎟ ⎜ ⎟,⎜ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝
⎞ ⎟ ⎟ ⎟, ⎟ ⎟ ⎠
and
⎛ ⎞ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎝ ⎠
are examples of a reversal, an interchange, and a skewing matrix, respectively. Elementary matrices are all unimodular. And each unimodular matrix can be expressed as the product of a finite number of elementary matrices. (See Lemma . in [].) A permutation of a finite set is a one-to-one mapping of the set onto itself. A permutation matrix is any matrix that can be obtained by permuting the columns (or rows) of a unit matrix. More formally, a permutation matrix is a square matrix of ’s and ’s, such that each row has exactly one and each column has exactly one . An interchange matrix is a permutation matrix. A product of interchange matrices is clearly a permutation
matrix. The converse is also true: every permutation matrix can be expressed as a finite product of interchange matrices. In fact, every permutation matrix can be expressed as a finite product of interchange matrices each of which interchanges two adjacent columns (or rows). A permutation matrix is unimodular. (See Sect. . in [].) Let U denote any m × m permutation matrix. For ≤ i ≤ m, let π(i) denote the number of the row containing the unique in column i. Then, the function π : i ↦ π(i) is a permutation of the set {, , . . . , m}, and it completely specifies the matrix U. For any m-vector x = (x , x , . . . , xm ), one has xU = (xπ() , xπ() , . . . , xπ(m) ) .
()
Basic Concepts The program model is a perfect nest of loops L = (L , L , . . . , Lm ) shown in Fig. . For ≤ r ≤ m, the index variable Ir of Lr runs from pr to qr in steps of , where pr and qr are integer-valued linear functions of I , I , . . . , Ir− . The index vector of the loop nest is I = (I , I , . . . , Im ). An index point or index value of the nest is a possible value of the index vector, that is, a vector i = (i , i , . . . , im ) ∈ Zm such that pr ≤ ir ≤ qr for ≤ r ≤ m. The subset R of Zm consisting of all index points is the index space of the loop nest. During sequential execution of the nest, the index vector traverses the entire index space in the lexicographic order. The body of the loop nest L is denoted by H(I). A given index value i defines a particular instance H(i) of H(I), which is an iteration of L. The program of Fig. consists of the set of iterations {H(i) : i ∈ R}, and they are executed sequentially in such a way that an iteration H(i) comes before an iteration H(j) if and only if i ≺ j. There is dependence between two distinct iterations H(i) and H(j) if there is a memory location that is referenced (read or written) by both. (Sometimes, it may be required that at least one of the references be a “write.”) Suppose there is dependence between H(i) and H(j). L1 : L2 : .. . Lm :
do I1 = p1, q1 do I2 = p2, q2 .. . do Im = pm, qm H(I)
Unimodular Transformations. Fig. Loop nest L
Unimodular Transformations
Then H(j) depends on H(i) if H(j) is executed after H(i), that is, if i ≺ j. If H(j) depends on H(i), then d = j − i is a (dependence) distance vector of the loop nest. Since i ≺ j, a distance vector is necessarily positive. To keep matters simple, it is assumed that each distance vector d is uniform in the sense that whenever i and i+d are two index points, the iteration H(i + d) depends on the iteration H(i). Let N denote the total number of distinct distance vectors. Then the distance matrix of L is an N ×m matrix D whose rows are the distance vectors. The distance matrix is unique up to a permutation of its rows. If d is a distance vector of L, then the direction vector of d is a (dependence) direction vector of the loop nest. A direction vector of L is necessarily positive. The direction matrix of L is a matrix Δ with m columns whose rows are the direction vectors. The direction matrix is unique up to a permutation of its rows. A loop Lr in the nest L, where ≤ r ≤ m, carries a dependence if there is a distance vector d with d ≻r . A loop can run in parallel if it carries no dependence. According to this definition, a trivial loop with a single iteration can run in parallel. This is helpful in avoiding special cases. Let U denote an m × m unimodular matrix. The image of R under U is the set {IU : I ∈ R}. By using Fourier’s method of elimination, one can get a suitable description for this image. Note that the index space R consists of all integer m-vectors I = (I , I , . . . , Im ) that satisfy the system of inequalities: pr ≤ Ir ≤ qr
( ≤ r ≤ m),
()
where pr and qr are integer-valued linear functions of I , I , . . . , Ir− . Let K = IU, so that I = KU− . If K = (K , K , . . . , Km ) and U− = (vtr ), then Ir = ∑m t= vtr Kt for each r. Replacing each Ir in () with the corresponding expression in K , K , . . . , Km , one gets a system of linear inequalities in K , K , . . . , Km . Fourier’s method then yields an equivalent system of inequalities of the following type: α r ≤ Kr ≤ β r
( ≤ r ≤ m),
()
where α r and βr are integer-valued functions of K , K , . . . , Kr− . The image of R under U consists of all integer m-vectors K = (K , K , . . . , Km ) that satisfy (). The loop nest L′ of Fig. , where H ′ (K) = H(KU− ), defines the unimodular transformation of the loop nest L
L1 :
L2 : .. . Lm :
U
do K1 = α1, β1 do K2 = α2, β2 .. . do Km = αm, βm H(K)
Unimodular Transformations. Fig. Loop nest L′
induced by the unimodular matrix U. The transformed program has the same set of iterations as the original program, executing in a different sequential order. Two iterations H(i) and H(j) in the original nest L become iterations H ′ (k) and H ′ (l), respectively, in the transformed nest L′ , where k = iU and l = jU. In L, H(i) precedes H(j) if i ≺ j. But in L′ , H ′ (k) precedes H ′ (l) if k ≺ l, that is, if iU ≺ jU. The loop nest L′ is equivalent to L, if whenever H(j) depends on H(i) in L, H ′ (k) precedes H ′ (l) in L′ , that is, iU ≺ jU. The transformation L ⇒ L′ is valid if L′ is equivalent to L. Theorem The transformation L ⇒ L′ induced by a unimodular matrix U is valid if and only if dU ≻ for each distance vector d of L. Proof The “if ” Part. Suppose dU ≻ for each distance vector d of L. Let an iteration H(j) depend on an iteration H(i). Since j − i is a distance vector of L, one gets jU − iU = (j − i)U ≻ . Hence, iU ≺ jU. Therefore, L′ is equivalent to L, and the transformation L ⇒ L′ is valid. The “only if ” Part. Suppose the transformation L ⇒ L′ is valid. Let d denote any distance vector of L. There exist index points i and j of L, such that d = j − i and the iteration H(j) depends on the iteration H(i). By hypothesis, iU ≺ jU, so that dU = jU − iU ≻ . ∎ Corollary If the transformation L ⇒ L′ by a unimodular matrix U is valid, then the distance vectors of L′ are the vectors dU where d is any distance vector of L. Proof There is dependence between two iterations H ′ (k) and H ′ (l) of L′ if and only if there is dependence between the iterations H(i) and H(j) of L, where k = iU and l = jU. Since l − k = (j − i)U, the distance vectors of L′ are vectors of the form ±dU where d is a distance vector of L. Now a distance vector has to be positive. Since dU ≻ for each d (Theorem ), it follows that the distance vectors of L′ are precisely the vectors dU. ∎
U
U
Unimodular Transformations
Parallelization by Unimodular Transformation The goal of this section is to explain by examples how a loop nest can be transformed by a unimodular matrix, such that the innermost or outermost loops in the transformed program can run in parallel. Detailed algorithms are given in the essay Loop Nest Parallelization in this encyclopedia.
(, )U ≻ (, ) are both false. Thus, to get an equivalent nest L′ whose inner loop can run in parallel, one must have (, )U ≻ (, ) and (, )U ≻ (, ), that is, (u , u ) ≻ (, ) and (u , u ) ≻ (, ), or u > and u > . A simple unimodular matrix U that fits these constraints, and its inverse are shown below: ⎛ U=⎜ ⎜ ⎝
Example (Inner Loop Parallelization) Consider a nest L of two loops: L : L :
do I = , do I = , H(I) : A(I , I ) = A(I − , I ) + A(I , I − )
The index space R of L is given by
I2
3 2 1
2
3
U
⎛ =⎜ ⎜ ⎝ −
⎞ ⎟. ⎟ ⎠
⎫ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ≤ min(, K ). ⎪ ⎪ ⎭
≤ K ≤
The index space and the dependence between iterations are shown in Fig. . The only distance vectors of L are (, ) and (, ). Each of the two loops carries a dependence, and hence cannot run in parallel. Let U = (urt ) denote any × unimodular matrix. Let the unimodular transformation of L induced by U result in a loop nest L′ = (L′ , L′ ). By Theorem , L′ is equivalent to L if (, )U ≻ (, ) and (, )U ≻ (, ). Assume L′ is equivalent to L. Then, by Corollary to Theorem , the distance vectors of L′ are (, )U and (, )U. The inner loop L′ of L′ can run in parallel if it carries no dependence, that is, if (, )U ≻ (, ) and
1
and
−
The equation (I , I ) = (K , K )U− gives I = K − K and I = K . Since ≤ I ≤ and ≤ I ≤ , one gets the inequalities ≤ K − K ≤ and ≤ K ≤ that lead to the ranges (by Fourier’s method):
R = {(I , I ) : ≤ I ≤ , ≤ I ≤ }.
0
⎞ ⎟ ⎟ ⎠
I1
Unimodular Transformations. Fig. Index space of L
max(, K − ) ≤ K
Thus, the given program L is equivalent to the loop nest L′ : L′ : L′ :
do K = , do K = max(, K − ), min(, K ) H ′ (K) :A(K − K , K ) = A(K − K − , K ) + A(K − K , K − )
where the new body H ′ (K) is obtained from H(I) by replacing I with K − K and I with K . The index space of the transformed loop nest L′ is shown in Fig. . The distance vectors of L′ are (, ) and (, ). The inner loop L′ can now run in parallel since it carries no dependence. In Fig. , consider the set of parallel lines I +I = c, where ≤ c ≤ . They correspond to the values of K . On any given line, the index points correspond to the values of K for a fixed K . Executing L′ sequentially and L′ in parallel means that these lines are taken sequentially from c = to c = , while iterations for index points on each given line are executed in parallel. Since the lines appear to constitute a “wave” through the index space of L, the name wavefront method is often given to this process of executing a loop nest. In the essay Loop Nest Parallelization, a general algorithm is given that finds a valid unimodular transformation for any given loop nest such that all but
Unimodular Transformations I2
6 5 4 3 2 1
0
1
2
3
4
5
I1
6
Unimodular Transformations. Fig. Wave through the index space of L
Let U = (urt ) denote any × unimodular matrix. Let the unimodular transformation of L induced by U result in a loop nest L′ = (L′ , L′ ). By Theorem , L′ is equivalent to L if (, )U ≻ (, ) and (, )U ≻ (, ). Assume L′ is equivalent to L. Then, by Corollary to Theorem , the distance vectors of L′ are (, )U and (, )U. The outer loop L′ of L′ can run in parallel if it carries no dependence, that is, if (, )U ≻ (, ) and (, )U ≻ (, ) are both false. Thus, to get an equivalent nest L′ whose outer loop can run in parallel, one must have (, )U ≻ (, ) and (, )U ≻ (, ). The conditions on the elements of U are then ⎫ ⎪ u + u = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u + u > ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ u + u = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ u + u > . ⎪ ⎪ ⎭
⎛ − U=⎜ ⎜ ⎝
3
⎞ ⎟ ⎟ ⎠
and
⎛ U− = ⎜ ⎜ ⎝
⎞ ⎟. ⎟ ⎠
The limits on I and I are given by ≤ I ≤ and ≤ I ≤ . Since
2 1
1
2
3
4
5
6
(I , I ) = (K , K )U− = (K , K + K ),
K1
the constraints in terms of K and K are
Unimodular Transformations. Fig. Index space of L′ in Example
≤
K ≤
≤ K + K ≤ the outermost loop in the transformed nest can run in parallel. See also Remark . Example (Outer Loop Parallelization) Consider the loop nest L: L : L :
A simple unimodular matrix U that fits these constraints, and its inverse are shown below:
K2
0
U
do I = , do I = , H(I) : X(I , I ) = X(I − , I − ) + X(I − , I − )
The only distance vectors of L are (, ) and (, ). Since the outer loop carries a dependence, it cannot run in parallel.
.
By Fourier’s method of elimination, one gets − ≤ K ≤ ⌈max(, − K /)⌉ ≤ K ≤ ⌊min(, − K /)⌋ . Thus, the given program L is equivalent to the loop nest L′ : L′ : L′ :
do K = −, do K = ⌈max(, − K /)⌉, ⌊min(, − K /)⌋ H′ (K) : X (K , K + K ) = X(K − , K + K − ) + X(K − , K + K − )
U
U
Unimodular Transformations
where the new body H ′ (K) is obtained from H(I) by replacing I with K and I with K + K . The distance vectors of L′ are (, ) and (, ). The outer loop L′ can now run in parallel since it carries no dependence.
Corollary If the loop permutation L ⇒ L′ is valid, then the direction vectors of L′ are the vectors (σπ() , σπ() , . . . , σπ(m) ), where (σ , σ , . . . , σm ) is any direction vector of L.
In general, if the rank of the distance matrix for L is ρ, then a unimodular matrix U can always be found such that the outermost (m − ρ) loops L′ , L′ , . . . , L′m−ρ of L′ can run in parallel. Using inner loop parallelization techniques on top of this, one may get an equivalent loop nest where all loops except L′m−ρ+ can run in parallel. Details are given in the essay Loop Nest Parallelization. There are some special unimodular transformations that can be handled more easily than a general transformation. They are described next.
A direction vector (σ , σ , . . . , σm ) ≻ prevents the loop permutation defined by a permutation π of {, , . . . , m} if (σ π(), σπ() , . . . , σπ(m) ) ≺ . A very useful permutation is the interchange of two loops Lr and Lr+ in L, where ≤ r < m. The only direction vectors that would prevent it have the form (, , . . . , , , −, ∗, ∗, . . . , ∗), where there are (r − ) zeros in the front and “∗” indicates any member of the set {, , −}. Algorithm . in [] shows how to find all direction vectors that prevent the permutation of a loop nest L defined by any given permutation π of {, , . . . , m}.
Loop Permutation
Remark Loop permutation is often used to achieve a desirable memory access pattern. A few comments are made here about inner and outer loop parallelization using permutation (see []). Let ≤ r ≤ m. Note that the loop Lr carries a dependence if and only if there is a direction vector of the form (, , . . . , , , ∗, . . . , ∗) with (r − ) leading zeros. Consider these special cases:
Again, the model program is the loop nest L of Fig. , and the transformation of L induced by a unimodular matrix U results in the loop nest L′ of Fig. . Let U denote a permutation matrix. Then the transformation L ⇒ L′ is a loop permutation. If π is the corresponding permutation of the set {, , . . . , m}, then the index vector K = (K , K , . . . , Km ) of L′ is given by Kr = I π(r) for ≤ r ≤ m. (See ().) One important feature of loop permutations is that their validity can be determined from a knowledge of direction vectors alone. This is quite useful since the number of direction vectors is never more (and often less) than the number of distance vectors, and they may be easier to compute and store than distance vectors. Theorem Let U denote an m × m permutation matrix and π the corresponding permutation of the set {, , . . . , m}. The loop permutation L ⇒ L′ induced by U is valid, if and only if for each direction vector σ = (σ , σ , . . . , σm ) of L, the vector (σπ() , σπ() , . . . , σπ(m) ) is positive. Proof The “if ” Part. Suppose the condition holds. Let d = (d , d , . . . , dm ) denote a distance vector of L. Let σ be the direction vector of d. Since σi = sgn(di ), it follows that (sgn(dπ() ), sgn(dπ() ), . . . , sgn(dπ(m) )) is positive. This means (dπ() , dπ() , . . . , dπ(m) ) ≻ , that is, dU ≻ . (See ().) Hence the transformation L ⇒ L′ is valid by Theorem . The “only if ” Part can be proved similarly. ∎
. Column r of the direction matrix Δ has only ’s. By a suitable loop permutation, Lr can be moved inward or outward to any position, and it can run in parallel there. (When a loop moves, its limits may change.) . Column r of Δ has only ’s and ’s. Then Lr can move outward to any position. . Column r of Δ has only ’s. Then Lr can move outward to any position. If it is made the outermost loop, then no other loop in L′ will carry a dependence and each inner loop can run in parallel. . Lr carries no dependence. Then Lr can move inward to any position where it can run in parallel. (Even when Lr carries a dependence, it may still be possible to move it inward to a position where it runs in parallel.) . If there is no − in the direction matrix Δ, then every permutation of the nest L is valid.
Loop Skewing Let a and b denote two integers such that ≤ a < b ≤ m. Let U denote the matrix obtained from the m × m unit matrix by replacing the on row a and column b with an
Unimodular Transformations
integer z. Then U is a skewing matrix and unimodular. The transformation L ⇒ L′ induced by U is a skewing of the loop Lb by the loop La with skewing factor z. (More precisely, it is an upper skewing.) The index vector K of L′ is given by
Proof Let d = (d , d , . . . , dm ) denote any distance vector of L. Then
Example Consider a loop nest L = (L , L , L ) with the distance matrix ⎛ − ⎞ ⎜ ⎟ ⎜ ⎟ ⎟. D=⎜ − ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
K = IU = (I , . . . , Ia , . . . , Ib− , Ib + zIa , Ib+ , . . . , Im ). Theorem Let ≤ a < b ≤ m. A skewing of the loop Lb in the nest L by the loop La is always valid.
U
By applying loop skewing repeatedly and a loop permutation at the end, D can be transformed into the form described in Remark :
dU = (d , d , . . . , db− , db + zda , db+ , . . . , dm ). Since d ≻ , it follows that (d , d , . . . , db− ) ⪰ . If (d , d , . . . , db− ) ≻ , then dU ≻ . Suppose (d , d , . . . , db− ) = . Then da = since ≤ a < b. Hence, dU = (, , . . . , , db , db+ , . . . , dm ) = d ≻ . Now apply Theorem . ∎ Corollary After a skewing of a loop Lb by a loop La with a skewing factor z, the distance vectors of the transformed nest L′ are vectors of the form (d , d , . . . , db− , db + zda , db+ , . . . , dm ), where (d , d , . . . , dm ) is any distance vector of L. Remark Let d = (d , d , . . . , dm ) be a distance vector of L. Since d is positive, it must be of the form d = (, . . . , , da , da+ , . . . , dm ) where ≤ a ≤ m and da > . Suppose there is a b such that a < b ≤ m and db < . Then skewing of the loop Lb by the loop La with the positive skewing factor ⌈−db /da ⌉ will change d into a distance vector (d , d , . . . , db− , d′b , db+ , . . . , dm ), where d′b ≥ . For a given distance vector d with a leading element da , it is possible to skew all such loops Lb in one step via a single unimodular matrix that is the product of a number of skewing matrices. By systematically considering all distance vectors of the above type and applying loop skewing repeatedly, it is thus possible to find a loop nest L′ equivalent to L such that the distance matrix D′ of L′ has all nonnegative entries. Furthermore, by a slight extension, one can make column m of D′ a vector of positive integers. Then column m of the direction matrix Δ′ of L′ will be a column of ’s. By Remark ., the innermost loop L′m of L′ can be made the outermost loop to create an equivalent program where all the (m − ) inner loops can run in parallel. See the example below.
The index vector of the loop nest transforms as follows: (I , I , I ) ⇒ (I , I , I + I ) ⇒ (I , I + I , I + I + I ) ⇒ (I , I + I , I + I + I ) ⇒ (I + I + I , I , I + I ) = (K , K , K ),
where (K , K , K ) is the index vector of the final loop nest. Thus, (I , I , I ) = (K , −K + K , K + K − K ). Limits for K , K , K can be computed from the limits for I , I , I by Fourier’s method in the usual way. The two inner loops of the final nest do not carry a dependence, and hence they can run in parallel.
Related Entries Code Generation Parallelization, Automatic Loop Nest Parallelization Parallelism Detection in Nested Loops, Optimal
Bibliographic Notes and Further Reading Research on unimodular transformations goes back to Leslie Lamport’s paper [] in , although he did not
U
U
Universal VLSI Circuits
use this particular term. This essay is based on the theory of unimodular transformations as developed in the author’s books on loop transformations [, ]. The general theory grew out of the theory of unimodular transformations of double loops described in []. (Watch for some notational differences between those references and the current essay.) The paper by Michael Wolf and Monica Lam [] covers many aspects of unimodular transformations in detail. See also Michael Dowling [], Erik H. D’Hollander [], and François Irigoin and Rémi Triolet []. Loop interchange has been studied extensively by Michael Wolfe [, ], and by Randy Allen and Ken Kennedy [, ]. Implementation of the wavefront method by loop skewing and interchange was discussed by Wolfe in []. The same technique is described in a general setting in []. Wei Li and Keshav Pingali [] discuss a loop transformation framework based on the class of nonsingular matrices (that includes unimodular matrices).
. Schrijver A (Oct ) Theory of linear and integer programming. Wiley, New York . Williams HP (Nov ) Fourier’s method of linear programming and its dual. The American Mathematical Monthly ():– . Wolf ME, Lam MS (Oct ) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst ():– . Wolfe M (Aug ) Loop skewing: the wavefront method revisited. Int J Parallel Program ():– . Wolfe M () Advanced loop interchanging. In: Proceedings of the international conference on parallel processing, St. Charles, – Aug , pp –. IEEE Computer Society Press, Los Angeles . Wolfe MJ () Optimizing supercompilers for supercomputers. The MIT Press, Cambridge
Universal VLSI Circuits Universality in VLSI Computation
Bibliography . Allen R, Kennedy K (Oct ) Automatic translation of FORTRAN programs to vector form. ACM Trans Program Lang Syst ():– . Allen R, Kennedy K (Oct ) Optimizing compilers for modern architectures. Morgan Kaufmann, San Francisco . Banerjee U () Unimodular transformations of double loops. In: Proceedings of the third workshop on languages and compilers for parallel computing, Irvine, – Aug . Available as Nicolau A, Gelernter D, Gross T, Padua D (eds) () Advances in languages and compilers for parallel computing. The MIT Press, Cambridge, pp – . Banerjee U () Loop transformations for restructuring compilers: the foundations. Kluwer Academic, Norwell . Banerjee U () Loop transformations for restructuring compilers: loop parallelization. Kluwer Academic, Norwell . D’Hollander EH (July ) Partitioning and labeling of loops by unimodular transformations. IEEE Trans Parallel Distrib Syst ():– . Dowling ML (Dec ) Optimal code parallelization using unimodular transformations. Parallel Comput (–):– . Irigoin F, Triolet R () Dependence approximation and global parallel code generation for nested loops. Cosnard M et al. (eds) Parallel distrib algorithms. Elsevier (North-Holland), New York, pp – . Lamport L (Feb ) The parallel execution of DO loops. Commun ACM ():– . Li W, Pingali K (Apr ) A singular loop transformation framework based on non-singular matrices. Int J Parallel Program ():–
Universality in VLSI Computation Gianfranco Bilardi , Geppino Pucci University of Padova, Padova, Italy Università of di Padova, Padova, Italy
Synonyms Area-universal networks; Universal VLSI circuits
Definition A VLSI circuit U(A) is said to be area-universal if it can be configured to emulate every VLSI circuit of a given area A. If U(A) has area AU , then it has blowup α = AU /A. If any circuit with areatime bounds (A, T) is emulated by U(A) in time TU ≤ σ T, then U(A) has slowdown σ . Clearly, smaller blowup and smaller slowdown reflect a better quality of a universal circuit. An analogous formulation enables the study of area-universal general-purpose routing. The broad goal of research on area-universality is to characterize blowup/slowdown trade-offs, that is, what is the minimum blowup achievable for any given slowdown.
Universality in VLSI Computation
Discussion Models of VLSI Computation The VLSI model of computation (see Article [] of this Encyclopedia) was formulated in the late seventies to capture the resource trade-offs of computation by large-scale integrated circuits. In this model, a computational engine is an interconnection of basic Boolean gates and storage elements (flip-flops), with a specified space layout defining the geometric placement of these building blocks (nodes) and of their interconnections (wires). Although the details of the model were developed with integrated circuit technology in mind, the spirit of the model is quite general and captures constraints that apply to any reasonable physical computing device. These key physical contraints can be stated as follows. P – Principle of information density. There is a lower bound to the amount of space required to store a bit. P – Principle of bandwidth density. There is an upper bound to the amount of information that can be transmitted through a unit of surface in a unit of time. P – Principle of computation delay. There is a lower bound to the time necessary to compute a Boolean function of a given constant number of inputs. P – Principle of communication delay. There is an upper bound to the speed at which information can travel. The bounds stated by the above principles are assumed to be fixed positive constants. Any given technology typically comes to within constant factors of these bounds and technological progress aims, by and large, to an improvement of these factors. Most of the research on VLSI computing has been based on Thompson’s model [], which does not actually include principle P, as it assumes that information can be transmitted in constant time along a wire, irrespective of its length. The rationale was that, in the VLSI systems of the time, transmission delays did not add substantially to computation delays. An alternate VLSI model, including principle P, was proposed by Chazelle and Monier []. A detailed study of how the predicted development of silicon technology would have impacted the choice between the two models was carried out by Bilardi, Pracchi, and Preparata in [], indicating (correctly, in retrospective)
U
that the Thompson model would prove to be a reasonable approximation for a decade or more, at least with reference to single-chip systems. Furthermore, even when not applicable to the entire computing system, Thompson’s model still leads to valuable insights on the design of sufficiently small regions of the system. In this entry, VLSI universality will be discussed in the regime where communication delays are negligible. In the complementary regime, governed by principle P, computing systems do exhibit different behaviors, with signifcant impact on machine organization and algorithm design, as discussed by Bilardi and Preparata in [].
The Question of Universality Many studies have utilized the VLSI model of computation to explore area-time trade-offs in special-purpose computing. Given a target problem, the objective is to determine, for any achievable computation time T, the minumum area A such that there is a VLSI circuit that can be laid out in area A and can solve the problem in time T. The area-time trade-off has been determined for several key problems. For example, the cumulative product of N elements drawn from a semigroup of fixed size can be computed in area A = Θ(N/T), where Ω(log N) ≤ T ≤ O(N). For the addition of two N-bit numbers, the area becomes A = Θ((N/T) log(N/T)), where Ω(log N) ≤ T ≤ O(N). Finally, the integer multiplication of two N-bit numbers, the sorting of n integer keys of log n bits each (hence N = n log n input bits overall), and the n-point Discrete Fourier Transform in the ring of the integers modulo m (hence N = n log m input bits overall), are all problems which feature optimal circuits √ of area A = Θ(N /T ), where Ω(log N) ≤ T ≤ O ( N). The designs achieving the optimal area-time performance are typically based on different topologies for different problems, or even for different computation times concerning the same problem. Thus, a variety of interconnection topologies have been exploited for various operations, including meshes of various dimensions, the tree, the shuffle-exchange, the cube connected cycles and its pleated version, the meshof-trees, and several others. This state of affairs seems to lead to an unwelcome conclusion: if the optimal execution of different computations requires different architectures, then no architecture is universally good,
U
U
Universality in VLSI Computation
and general-purpose computing is inherently suboptimal. Is this conclusion really warranted? The theory of area-time universality in VLSI is a way to address this fundamental question. To be specific and quantitative, the question for the case of circuits can be formulated as follows (the adaptation to routers will be discussed at the end of the next section). A VLSI circuit U(A) is area-universal if it can be programmed to emulate every circuit of a given area A. If U(A) has area AU , then its blowup is α = AU /A. If any circuit with area-time bounds (A, T) is emulated by UA in time TU ≤ σ T, then U(A) has slowdown σ . Clearly, smaller blowup and smaller slowdown reflect a better quality of a universal circuit. The general goal is to characterize the blowup–slowdown trade-off, that is, what is the minimum blowup achievable for any given slowdown. In the VLSI model, the basic processing elements are Boolean gates with a fixed number, say q, of inputs and outputs. If the unit of length is chosen to reflect the feature size of the technology, then the area of an elementary gate is a small, integer constant. It is straightforward to design a q-universal gate of constant area aq , programmable to compute any Boolean function of q inputs and outputs, in constant time τ q . A universal circuit U(A) can then be conceived as a set of A universal gates connected by a routing network programmable to simulate the interconnection of any specific circuit to be emulated. Since the universal Boolean gates contribute a constant factor both to the blowup and to the slowdown, the pivotal issue is the area-time performance of the routing network. Although a general permutation network could easily be adapted for the purpose, it would result in an inefficient solution. In fact, a data exchange that can be realized in unit time by a set of wires laid out in area A is considerably more constrained than a general permutation. To be area-time efficient, a router must take advantage of such constraints, which are formulated quantitatively in the next section.
Bandwidth and Area A key insight of the theory of VLSI layout is that the area of a graph is closely related to how well such graph can be embedded into a binary tree. This insight is systematically and beautifully developed by Bhatt and Leighton
in []. Below, the results relevant to the present discussion are reviewed and cast in the unifying terminology of graph embedding. In an embedding, each vertex v ∈ V of a guest graph G = (V, E) is mapped onto a node ϕ(v) ∈ W of a host graph H = (W, D), and each edge (a, b) ∈ E of G is mapped onto a path ψ(a, b) joining node ϕ(a) to node ϕ(b) in tree H. The load ℓ(w) of a host node w ∈ W is the number of guest vertices mapped onto w. The dilation d(a, b) of a guest edge (a, b) ∈ E is the length (number of edges) of the corresponding host path ψ(a, b). The congestion c(u, w) of a host edge (u, w) ∈ D is the number of guest edges whose corresponding path includes edge (u, w). Good embeddings are characterized by small values of load, dilation, and congestion. A two-dimensional VLSI layout of a graph G, for simplicity assumed throughout of degree , can be defined as an embedding of G into a rectangular grid, subject to three constraints: () the load of each grid node is at most one; () the congestion of each grid egde is at most one; and () the grid path associated with a guest edge (a, b) can only traverse grid nodes of zero load, except for ϕ(a) and ϕ(b). It turns out that a good layout of graph G leads rather straightforwardly to a good embedding of G in a binary tree; it is also the case that a good layout can be obtained from a good tree embedding, although the process is more subtle. Intuitively, searching for a good tree embedding is easier than searching for a good grid layout, due to the removal of contraints (), (), and () above and to the fact that a tree embedding is uniquely determined by vertex placement, if simple paths (which are unique for any choice of their endpoints) are used for hosting edges. The price to pay for the simplification is that, even if the given tree embedding is optimal, the area of the resulting layout may be suboptimal by up to two logarithmic factors. A tree embedding can be obtained from a layout as follows. By simple transformations that increase the area by√at most √ a constant factor, the√layout is made to fit a A × A square grid, where A is an integer power of two. Next, each grid vertex is labeled √ with a pair of integer coordinates (i, j), with i, j = , . . . , A−. The grid vertices are labeled so that vertex (, ) is the north-west corner of the grid, thus i increases eastward and j increases southward. Let H = (W, D) be a complete, ordered binary tree of A leaves, numbered , . . . ,
Universality in VLSI Computation
A − , from left to right. The tree embedding is based on the shuffle-major one-to-one correspondence that associates grid vertex (i, j) with leaf k of tree H, where the binary representation of k is obtained from those of i ≡ (i(log A)/− . . . i ) and j ≡ (j(log A)/− . . . j ) as k = shm(i, j) ≡ (j(log A)/− i(log A)/− . . . j i j i ). Then, each node of the tree is naturally associated with the region of the grid matched to its descendant leaves. This association can be understood as a recursive decomposition, with the root corresponding to the entire layout, the left and the right children of the root, respectively, corresponding to the west and to the east halves of the layout, the four grandchildren of the root corresponding, from left to right, to the north-west, south-west, north-east, and south-east quadrants, and so on. With the preceeding setup, a layout of a graph G = (V, E) naturally induces an embedding of G in H where () a vertex v of G placed at vertex (i, j) of the grid is mapped onto leaf ϕ(v) = shm(i, j) of H and () an edge (a, b) is mapped onto the unique simple tree path from ϕ(a) to ϕ(b). In the resulting embedding, the load is one for the leaves and zero for the internal nodes. Furthermore, edges have dilation at most log A. Finally, the congestion Cq of a tree edge joining a node w at level q + to its √ √ parent u (at level q) is at most A/q at even √levels (q = (root), , . . . , log A − ) and is at most A/q , at odd levels (q = , , . . . , log A − ), since Cq cannot exceed the perimeter of the layout region corresponding to node w, as the layout of all the edges of G congesting tree edge (w, u) must cross the boundary of said ˜ if A ˜ is the region. A graph G is said to have tree-area A minimum value such that G admits a tree embedding ˜ with the properties just derived for graphs of area A. ˜ From a tree-embedding of area A, a grid layout of area ˜ can be constructed in polynomial time ˜ log A) A = O(A []. The procedure is subtle and exploits some powerful combinatorial lemmas. A variant of the procedure also yields a more sophisticated embedding of G in H satisfying the following √ properties. The load of any node at level q if at most λ A/q ; the congestion of any edge joining a node w at level q + to its parent is al most √ γ A/q ; and the dilation of any edge is at most , where λ and γ are suitable constants. The developments of [] √ revolve around the concept of minimum -bifurcator of a graph G, denoted by F and related to the tree area √ ˜ as F = Θ ( A).
U
Most machines meant for general-purpose computing consist of a set of processing elements that can exchange messages via a routing network. In this context, it becomes interesting to compare machines with the same type of processing elements, but with different routers. Analogously to the question raised in the previous subsection for circuits, one can ask how well a given general-purpose router R(A) can deliver messages, compared to any router G of area A. To this end, it is useful to define the load factor λ of a given set of messages, whose sources and destinations (called terminals) are placed in a square region of area A, as the maximum over all the edges of the tree H (defined above) of the ratio between the number of messages with source and destination on opposite sides of an edge, and the congestion value Cq associated with that edge. It has to be observed that λ is a lower bound to the routing time of a message set for any network that can be laid out in a square region of area A. Therefore, the quality of a universal VLSI router is conveniently captured by its blowup and by its routing time, expressed as a function of λ and A.
Universal Fat-Tree Architectures Most efficient area-universal networks proposed in the literature exhibit a treelike structure, with the bandwidth of child-to-parent channels growing going from the leaves toward the root. Informally, these networks are called fat-trees. Usually, the leaves of a fat-tree act as processing elements (in universal circuits) or as terminals (in universal routers) and the subnetworks located at the internal nodes perform a mere switching role. However, in a few notable cases, internal nodes also play an active role. Typically, the bandwidth between a node and its parent doubles every√other level, going from constant at the N leaves to N at the root. Alll variants of fat-trees considered in the literature have layout area O(N log N). In [], Bilardi and Bay define the class of channel-sufficient fat-trees, as those where the maximum number of edge-disjoint paths connecting any two sets of leaves comes within a constant factor of the largest value compatible with the node-to-parent bandwidths. Graphs in this class, which contains all fat-trees proposed in the literature, are shown to require area Ω(N log N). When targeting universality for area A, if
U
U
Universality in VLSI Computation
a fat-tree with N = Θ(A) leaves is chosen, then the area becomes Θ(A log A), resulting in a blowup α = Θ(log A). The first efficient universal fat-tree was proposed by Leiserson in []. Its internal nodes are realized as partial concentrators of constant depth, whence the name of Concentrator Fat-Tree (CFT). On a CFT, a message set of load factor λ ≤ can be routed in time O(log A), hence, as an area-universal circuit, the CFT exhibits slowdown σ = O(log A). The authors of [] also devise an off-line decomposion of an arbitrary message set of load factor λ into O(λ log A) message sets of load factor smaller than . Thus, as an area-universal router, the CFT exhibits a routing time of O(λ log A). Routing time on the CFT has been subsequently improved by Greenberg and Leiserson in [] and by Leighton et al. in [], with randomized, online algorithms. Concentrators are powerful routing components, but setting their switches is achieved by computing bipartite graph matchings, for which area-time efficient, deterministic algorithms are not known. For this reason, two new fat-trees that do not make use of partial concentrators, namely, the pruned butterfly and the sorting fat-tree, have been introduced by Bay and Bilardi in [] for online, deterministic routing. Finally, a variant of the CFT, called Fat-Pyramid, was proposed by Greenberg in [] to achieve good blowup and slowdown in a VLSI model where the cost to transmit a bit over a wire is a (quite) general function of the wire length, ranging between constant and linear time. In the context of universal circuits, the Meshed CFT has been considered, both by Leighton et al. in [] and by Bay and Bilardi in []. The Meshed CFT consists of O(A/ log A) meshes of size O(log A × log A), connected√to the leaves of a CFT with root bandwidth O ( A/ log A); the total area is O(A). In [], the routing required to simulate a circuit makes use of standard permutation routing techniques on the mesh combined with a sophisticated randomized technique on the CFT. When evaluating the area requirements of the resulting circuit in the bit model, the mesh nodes require Θ(log log A) bits apiece to encode switching decisions, pushing the total area to O(A log log A). The O(log log A) blowup is avoided in [] by a specialized routing approach that makes crucial use of the fact that the permutations to be routed in the meshes are not arbitrary, but arise from a layout. In both designs,
√ in spite of the reduced root bandwdth (O ( A/ log A) √ against O ( A) of the original CFT), the routing time remains O(log A), thanks to a careful pipelining of O(log A) message waves along the tree. In [], it is also shown that the entire transformation of a given circuit layout into the meshed CFT configuration needed to simulate it can be computed in (optimal) sequential time O(A). Whether a universal circuit of area O(A) can achieve sublogarithmic slowdown remains an open question. As a corollary of a more general result, Bilardi et al. in [] have shown that any universal circuit which employs an embedding of the guest circuit √ for its simulation must incur a slowdown of Ω ( log A/ log log A). It is instead possible to achieve sublogarithmic and even constant slowdown with a substantial blowup, as established by Bhatt, Bilardi, and Pucci in []. Their construction makes crucial use of the constant-dilation tree-embedding discussed in Chapter and of a redundant-computation technique. A slowdown of O(log log A) was also achieved by Kaklamanis, Krizanc, and Rao in [], with a multifat-tree network targeted to the simulation of planar circuits, and a with butterfly network, targeted to the simulation of circuits with a number of nodes suitably sublinear in the area. The butterfly approach suffers a larger blouwp than the one achieved in [] for the same slowdown, even for the restricted class of circuits that it can handle; however, the techniques of [] are different from those employed in [] and could have other significant applications. A summary of the main characteristics of the universal designs discussed above is displayed in Table . For the sake of uniformity and comparison, in the table as well as in the preceding discussion, all results have been translated to a common framework. In particular, layouts are taken to be two dimensional, with area and time evaluated in Thompson’s model. Also, the family of simulated routers (respectively, circuits) is assumed to be all those admitting layouts within area A. All routing times, except that of [], apply to messages of b = O(log A) bits, although they remain unchanged if b = O(). Features of a design that do not fit into the outlined common framework are explicitly reported and marked with symbol “⊖” if they represent a restriction, and with symbol “⊕” if they represent an enhancement. Finally, it is interesting to observe that a universal router naturally yields a universal circuit, by first
U
Universality in VLSI Computation
Universality in VLSI Computation. Table Each table entry refers to a specific universal (router or circuit) design. The following features are displayed: (a) the topology; (b) the mode of operation (only for routers), that is, whether or not the universal router requires preprocessing to be configured for the simulation (off-line vs. online design), and whether or not the routing algorithm makes use of random bits (randomized vs. deterministic design); (c) the area blowup; (d) the routing time for universal routers (as a function of the load factor λ of the message set to be routed), or, alternatively, the slowdown of the simulation for universal circuits; and (e) the corresponding scientific reference Universal routers Topology
Mode
Blowup
Routing time
Ref
Concentrator Fat-Tree (CFT)
off/det
O(log A)
O(λ log A)
[]
CFT
on/rand
O(log A)
O(λ log A+
[]
+ log A log log A)
CFT (⊖:word model)
on/rand
O(log A)
O(λ + log A)
Pruned Butterfly
on/det
O(log A)
O(λ log A)
Sorting FT
on/det
O(log A)
O(λ log A + log A)
off/det
O()
O(λ + log A)
(⊖: constant-degree message sets) Fat Pyramid
[] []
[] []
(⊕:general delay model)
(⊖:O(A/ log A) terminals)
Universal circuits Topology
Blowup
Slowdown
Ref
CFT
O(log A)
O(log A)
[]
Meshed CFT
O(log log A)
O(log A)
[]
Meshed CFT
O()
O(log A)
[]
Multi-Computation FT (parametric design) √ (⊕: simulates circuits with A-bifurcator)
O(Aє )
O(/є)
[]
Multi-FT
O(log A)
log log A/ log A ≤ є ≤ O(log log A)
[]
O(/є + log log A)
[]
(⊖: planar guest circuits only) Butterfly (parametric design) √ (⊖: O ( A+є log A) nodes)
O(Aє )
matching the terminals of the host router with the computing elements of the guest circuit according to the tree embedding induced by the guest’s layout, and then hardwiring the routing decisions relative to the message set induced by the guest’s proximity structure. The slowdown of the resulting universal circuit is obtained by setting λ = in the formula for the routing time. The blowup could increase, due to the area needed to store the hardwired routing decisions, although the increase is often negligible. The resulting performance metrics tend to be inferior to those of explictly designed universal circuits. On the positive side, the class of simulated circuits typically includes all those of√tree-area A (or, equivalently, with bifurcator F = O ( A)), which is a larger family than all circuits with a layout of area A. A
< є < , є constant
similar observation applies to the planar circuit simulator of [], which is able to simulate all planar graphs with O(A) nodes, some of which are know to have area ω(A).
Conclusions Research on VLSI universality has demonstrated the viability of universal circuits and routers featuring area and time performances which are only a few logarithmic factors away from those exhibited by specialized devices of a stipulated area budget.This line of work has provided significant insight on several important technological issues. Regarding circuits, universal designs provide theoretical grounding to the compelling practical evidence that Field-Programmable
U
U
UPC
Gate Arrays (FPGAs) are indeed a viable and costeffective option for the design of highly configurable, general-purpose integrated circuits and may offer considerable savings at the cost of a modest degradation in performance for a vast number of applications. On the other hand, the inevitable slowdowns incurred by such devices could be undesirable for specific, missioncritical computations, for which the development of specialized circuitry is most appropriate. The legacy of area-universal routing is instead related to the emergence of a number of successful commercial fat-treelike interconnection networks for parallel machines including the Thinking Machines CM-, the Quadrics CS, and the IBM RS/ SP.
Related Entries Models of Computation, Theoretical VLSI Computation
Bibliographic Notes and Further Reading
. Leiserson CE () Fat-trees: universal networks for hardwareefficient supercomputing. IEEE T Comput C-():– . Greenberg RI, Leiserson CE () Randomized routing on fattrees. In: Micali S (ed) Randomness and computation. JAI Press, Greenwich, pp – . Leighton FT, Maggs BM, Ranade AG, Rao S () Randomized routing and sorting on fixed-connection networks. J Algorithm ():– . Bay PE, Bilardi G () Deterministic on-line routing on areauniversal networks. J ACM ():– . Greenberg RI () The fat-pyramid and universal parallel computation independent of wire delay. IEEE T Comput (): – . Bay PE, Bilardi G () An area-universal VLSI circuit. In: Proceedings of the symposium on integrated systems, Seattle, pp – . Bilardi G, Chaudhuri S, Dubhashi DP, Mehlhorn K () A lower bound for area-universal graphs. Inform Process Lett (): – . Bhatt SN, Bilardi G, Pucci G () Area-time tradeoffs for universal VLSI circuits. Theor Comput Sci (–):– . Kaklamanis C, Krizanc D, Rao S () Universal emulations with sublogarithmic slowdown. In: Proceedings of the th IEEE symposium on foundations of computer science, Palo Alto, pp –
The following bibliography lists most relevant references on area-universality. For a more comprehensive introduction on research on the VLSI model of computation, the reader is referred to Jeffrey D. Ullman’s book: Computational Aspects of VLSI, Computer Science Press, Rockville MD, .
William Carlson , Phillip Merkey Institute for Defense Analyses, Bowie, MD, USA Michigan Technological University, Houghton, MI, USA
Bibliography
Synonyms
. Preparata FP. VLSI computation. In: Encyclopedia of parallel computing. Springer . Thompson CD () A complexity theory for VLSI. Ph.D. thesis, Tech. Rep. CMUCS- –, Dept. of Computer Science, Carnegie-Mellon University, Pittsburgh . Chazelle B, Monier L () A model of computation for VLSI with related complexity results. In: Proceedings of the th ACM symposium on theory of computing, Milwaukee, pp – . Bilardi G, Pracchi M, Preparata FP () A critique of network speed in VLSI models of computation. IEEE J Solid-St Circ SC():– . Bilardi G, Preparata FP () Horizons of parallel computation. J Parallel Distr Com ():– . Bhatt SN, Leighton FT () A framework for solving VLSI graph layout problems. J Comput Syst Sci :– . Bilardi G, Bay PE () An area lower bound for a class of fattrees. In: Proceedings of the second European symposium on algorithms, Utrecht, pp –
UPC
Unified parallel C
Definition UPC, Unified Parallel C, is a parallel programming language which is derived from the C language. It supports the Partitioned Global Address Space (PGAS) programming model.
Discussion Introduction UPC is a strict superset of the C programming language in the sense that any legal C program is a legal UPC program. It extends the C memory and execution model, and implements the Partitioned Global Address Space (PGAS) programming model. UPC adds several
UPC
new type-qualifiers to describe the sharing and consistency models of objects (e.g., variables, structures, and arrays). It also adds some new keywords to control the synchronization of the parallel threads (e.g., activities) within the UPC runtime environment. The driving goal behind UPC is to maintain the C aesthetic in a parallel programming language and model: users have a high degree of affinity with the machine they are programming; there are minimal runtime checks and restrictions; and programs are portable to a wide variety of parallel platforms. As it is entirely possible to write in many programming styles within the C programming model and language, UPC makes no demands on programmers to conform to a particular parallel programming style. For example, UPC programs have been written in the styles of dynamic work queues; message queues; fineand coarse-grained shared memory; fine- and coarsegrained distributed memory, lock- and barrier-based synchronized; synchronization-free “Monte Carlo”; and “embarrassing parallel.” UPC compilers strive to render each of these styles as efficiently as can be, but cannot prevent users from writing programs in a style that does not match a particular system’s performance profile. For example, it is possible to write a fine-grained, lock-based shared-memory program in UPC and run it on a system with no native support for such features. Performance of such attempts will be predictably horrid. Conversely, running such a program on a well-equipped system may be delightful. UPC is implemented by a variety of compilers and runtime systems. These include both open-sourcebased compilers and runtimes as well as proprietary ones. Together these implementations allow users to run UPC on almost every significant parallel system in service today. It is important to note that this includes systems which have native support for shared memory and those whose physical interconnection is based on message passing paradigms. In addition, UPC support is planned for all future products of the main vendors of HPC systems. On all of these systems, UPC offers competitive performance to other parallel programming models, as well as the productivity offered by the C programming model and a shared-memory programming paradigm. Links to a variety of these implementations are provided in the Compiler section of the Bibliography.
U
Parallelism Model UPC uses the SPMD (Single Program Multiple Data) execution model and the PGAS (Partitioned Global Address Space) model of memory. In UPC, these models are so intertwined that it is difficult to discuss them independently. This article will first describe the execution model with the understanding that the threads operate within the context of the yet to be defined PGAS memory model. Each instance of execution is called a thread. A UPC programmer typically thinks of a thread as a sequential program which shares some data with some number of other threads. The two keywords MYTHREAD and THREADS are expressions with a value of type int that provide a unique thread index for each thread and the total number of threads, respectively. UPC threads are statically declared, heavy-weight threads whose execution span the life of the whole program. There is no mechanism within UPC to spawn or kill threads. Where and how the threads are executed in hardware is determined by the system, not the language. UPC programs are usually executed via a system-dependent “parallel run” command, sometimes called upcrun. Arguments to such a command control the number of threads that will execute and can pass parameters to the runtime system to best suit a particular system. Instruction-driven synchronization is provided by barriers and locks. The upc_barrier statement provides a “split-phase” barrier: upc_notify followed by upc_wait. These functions are collective functions in the sense that they must be called by all threads. The upc_notify/upc_wait pair implements a barrier because no thread can return from the call to upc_wait until all threads have called the upc_notify. The split-phase barrier requires that no globally sensitive code, including other collectives, be executed between the upc_notify and upc_wait. It provides a mechanism to reduce barrier overhead by allowing local or non-globally sensitive work to be placed between the two phases. An extreme instance of this is the bulk-synchronous model in which all “computation” is placed between the notify and wait while all data interaction (“communication”) is placed between the wait and the following notify. Locks in UPC are provided by the functions upc_lock, upc_lock_attempt, and upc_ unlock along with library functions to allocate locks.
U
U
UPC
These operate on objects of type upc_lock_t, an opaque type for a shared object which takes on one of two states, locked or unlocked. Since locks can be allocated, a programmer can use them in a number of ways. For example, one could make an association between a lock and a section of code to provide instruction-driven synchronization like a critical section. If one allocated enough locks to associate individual locks with shared objects or subsets of object, one could use them for datadriven synchronization like atomic or transactional memory operations. Again, UPC defines the semantics, but makes no assumptions about the performance characteristics.
defined to create a shared object. UPC logically partitions shared memory into equivalence classes based on threads. This relationship is referred to as affinity. The affinity of shared objects is defined by cyclically assigning blocks of objects to threads. The default block size of one corresponds to the standard cyclic distribution. The other extreme is the standard block distribution that has only one block per thread. The fact that shared objects are assigned affinity in a known pattern allows programmers and compilers to exploit the performance characteristics of private memory for some uses of the shared objects. The upc_forall loop construct does precisely this. The command has the form:
Memory Model UPC pairs the SPMD execution model with a compatible PGAS memory model. The notion of affinity in UPC captures the locality between processors and memory that exists on most parallel architectures. Figure shows that memory in UPC is segmented into private and shared. For each thread, private memory has precisely the same semantics as in C programs. Private memory for one thread cannot be addressed by any other thread. This is the extreme case of memory locality. The ability to use private memory is important because it allows the explicit description of thread level concurrency and the associated scaling of processor to memory bandwidth; the importation of normal C functions on a thread by thread bases; and the explicit expression of the lack synchronization required among different threads acting on their own private objects. Shared memory is declared via the keyword typequalifier shared or is allocated by a library function
Shared memory ...
Private memory ...
Thread 0
Thread 1
Thread 2
UPC. Fig. UPC Memory model
...
Thread Threads-1
upc_forall(init;condition; increment;affinity) This is the familiar for loop except that the bodies of the loop are only executed by the threads that match the affinity field. If the affinity field is an integer, the thread which is congruent to that integer, modulo THREADS, is the thread that executes the loop body. If the affinity field is a shared address, then the thread with affinity to the object at that address is the thread that executes the loop. It is important to note that this construct is not a “fork-join” based parallel loop, nor does it impose synchronization before, during, or after the loop. It does not express parallelism, it expresses locality. In addition to being a useful idiom for programmers that can be optimized by compilers, it is a suggestive abstraction that ties the execution model to a PGAS memory model for a two-level memory hierarchy. Generalizations of upc_forall may be useful for deeper memory hierarchies. The memory consistency model for UPC also facilitates performance by eliminating unrequested synchronization. Every memory reference for a shared object is either strict or relaxed. One can set the default behavior for all objects in a program by including either the header file
UPC
is the interaction of the two. The strict references lay down a sequentially consistent grid of references that provides fenced regions in which the ordering of relaxed references can be somewhat arbitrary. That is, strict accesses always appear (to all threads) to have executed in program order with respect to other strict accesses, and in a given execution all threads observe the effects of strict accesses in a manner consistent with a single, global total order over the strict operations. Any sequence of purely relaxed shared accesses issued by a given thread may appear to be reordered relative to program order, and different threads need not agree upon the order in which such accesses appeared to have taken place. The only exception to the previous statement is that two relaxed accesses issued by a given thread to the same memory location where at least one is a write will always appear to all threads to have executed in program order. When a thread’s program order dictates a set of relaxed operations followed by a strict operation, all threads will observe the effects of the prior relaxed operations made by the issuing thread (in some order) before observing the strict operation. Similarly, when a thread’s program order dictates a strict access followed by a set of relaxed accesses, the strict access will be observed by all threads before any of the subsequent relaxed accesses by the issuing thread. Consequently, code blocks containing only relax operations are open to all serial compiler optimization techniques and strict operations can be used to synchronize the execution of different threads by preventing the apparent reordering of relaxed operations contained within the grid of strict operation.
An Example Program The following UPC code implements a fairly standard algorithm to illustrate a Monte Carlo technique for calculating the mathematical constant Pi. It does so by determining what percentage of random values in a unit rectangle fall within a unit quarter-circle and multiplying by four. It is a parallel algorithm in that any number of parallel actors can make contributions to the accumulations of “tries” and “hits.” #include
U
tries++; double x = drand48(); double y = drand48(); if (x∗ x+y∗ y <= 1.0) hits++; } double a = (double) hits/ tries; return (4.0∗ a); }
The only changes from a standard C program required to make this a parallel program are the addition of the shared keyword to the two accumulators. In this very simple version of the code, there is no synchronization required. Any thread calling the function will use the values of tries and hits that all threads have completed by the point it completes its loop. Of course this will result in threads returning different values of Pi, but it is important to note that the algorithm itself is nondeterministic, so the writer of this code decided that added synchronization was unnecessary. Note that this function could be called any number of times by any group of threads, each time a “better” answer would be returned (assuming of course that the implementation of drand() is a good random number generator!). If the writer wanted to ensure that all threads returned the same value, a upc_barrier statement could be added before the return statement, but this would then require that all threads call pi() collectively. This simple version may not scale very well to large numbers of threads on some systems as they will “fight” over the shared variables and each access to these is a remote access, which are generally more expensive than local accesses, sometimes dramatically so on large systems. A second version, given in the following code, would be “faster” on many systems. #include
U
U
UPC
t_tries + = tries[i]; t_hits + = hits[i]; } double a = (double) t_hits/ t_tries; return (4.0∗ a); }
Here the program will likely run faster and scale better as any reasonable implementation of UPC will know that (due to affinity) all accesses in the first loop are effectively local. It maintains the characteristic of a loosely synchronized program. If the writer wished a more strictly synchronized program, than version two, a barrier could be inserted before the second loop. In fact, the writer could use the collective function upc_all_reduce() to avoid writing that second loop. Using the provided collective function could also increase performance, as it may be implemented with a more clever algorithm than the simple loop above. But the writer of this code did not want to “over-synchronize.” A savvy reader of this article may notice that the second program also eliminates race conditions that the first code contained relative to the parallel increments of the shared variables. It is a philosophical question whether race conditions are “good” or “bad.” Understanding them, using them or avoiding them is a programming choice. It could be argued that for this algorithm, one would obtain better results by not preventing the race. Following the general C philosophy, UPC takes no position on whether race conditions are “good” or “bad,” but it does provide mechanisms to manage them. If the programmer wanted to avoid the nondeterministic effects of the race condition, she could surround the shared accesses with a pair of upc_lock()/upc_unlock() routines. This would likely result in an even less scalable program. If available, she could also use atomic operations. Atomic increment operations are available on some machines and are, in fact, faster than usual read from memory, update, and write back cycle. Hence, the compiler or the runtime system may choose to use the atomic update for the sake of performance and incidentally avoid the race condition.
Applications Like C, UPC is a high-level language that is designed to give programmers control of the machine without imposing one particular paradigm. The following sketches what a program might look like if UPC were used to implement a few of the most popular parallelization strategies. Consider the master/slave model of parallelism. If one uses a upc_lock to protect a shared work queue, all the threads could execute as slave tasks. Each thread in turn: obtains the lock, takes the next assignment and updates the work queue, then releases the lock and executes the slave task. The process repeats until the work queue is empty. There are minor details in starting the process and terminating the process, but the shared work queue allows one to use a master/slave model without the need of a master thread. Again, the programmer can chose the appropriate amount of synchronization. If the size of the task is large compared to modifying the work queue, then conflicts at the critical section are probably low and the lock overhead is probably tolerable. If the application has a large Amdahl fraction because of the conflicts at the critical section, one could remove that overhead at the expense of the redundant work caused by not using a lock. A number of applications are commonly identified as applications that use blocking point-to-point communication to exchange objects and synchronize threads. These include applications like Fast Fourier Transforms, most codes that use domain decomposition, non-continent sorts, and dense linear algebra. These have well-known distributed-memory implementations. Since UPC has a well-defined data layout scheme, forming partnerships between threads (the equivalent of picking send/receive pairs) and the use of explicit messages can be avoided by using standard (albeit sometimes messy) pointer arithmetic and simple assignment statements on shared variables. The second version of the Pi program above is a trivial example of a UPC version of such a program. The first loop use of affinity on a shared array is effectively using private variables on each thread. The difference between message passing and shared memory is illustrated in the second loop. UPC programs can “receive” the values “on” another thread with a simple read statement without
UPC
interrupting its work flow. This one-sided communication does not require synchronization, so synchronization is a choice. Often these applications work “in phases” or with an outer loop. The outer loop commonly executes a compute phase and then a communication phase. The synchronization at this level would be instruction driven. Depending on the application this can be an exact fit for the bulk-synchronous programming model or in other cases the communication can essentially disappear. If threads only write to objects for which they have affinity, there can be no race conditions and computation phase can be asynchronous. If the communication phase must wait until the computation is completed, a split-phase barrier as described above would yield a bulk-synchronous implementation. In addition, if the algorithm enjoys sufficient reuse of off-affinity references one could use upc_memcpy to make private copies of shared objects to improve performance in the next computation phase. In many cases the communication phase is unnecessary as the communication will be accomplished by interweaving off-affinity references with the computation. The relaxed consistency model then allows one to overlap (via prefetching or caching) these off-affinity references with the ongoing computation. On some systems this can essentially hide the communication costs. UPC does not impose a synchronization model, a preferred granularity, or parallelization technique. Therefore, the UPC programmer is free to match applications, algorithm choices, and implementation to a given architecture in order to maximize performance.
UPC History UPC is the direct descendant of three C language extensions developed in the s: Split-C [], PCP [], and AC []. Each of these languages had implementations on several platforms of the day and was the result of several years of experimentation. In , the principal contributors to these languages met at UC-Berkeley and designed UPC. Its initial implementation was completed in for the Cray T-E system by Carlson and Draper. The first published specification was []. The UPC consortium was formed in May and is open to all interested parties. The consortium controls the
U
specification which stands at version ., published in May .
Related Entries Coarray Fortran PGAS (Partitioned Global Address Space) Languages
Biographical Notes and Further Reading Textbook Tarek El-Ghazawi, William Carlson, Thomas Sterling, Katherine Yelick, UPC: Distributed Shared-Memory Programming, John Wiley & Sons, Hoboken, New Jersey, .
Web Documentation UPC Documentation, http://upc.gwu.edu/documentation.html UPC Wiki, https://upc-wiki.lbl.gov/index.php
Compilers IBM XL UPC Compilers, http://www.alphaworks.ibm.com/tech/upccompiler HP UPC, http://h.www.hp.com/upc/ Intrepid Technology, Inc, http://www.intrepid.com/ GCC UPC, http://www.gccupc.org/
Research Efforts UPC@Berkeley, http://upc.lbl.gov UPC@Florida, http://www.hcs.ufl.edu/upc UPC@GWU, http://upc.gwu.edu UPC@MTU, http://upc.mtu.edu
Bibliography . Brooks E, Warren K () Development and evaluation of an efficient parallel programming methodology, spanning uniprocessor, symmetric shared-memory multi-processor, and distributed-memory massively parallel architectures, poster session at Supercomputing ‘, San Diego, CA, – December . Carlson WW, Draper JM () Distributed data access in AC. Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara, CA, – July , pp – . Carlson WW, Draper JM, Culler DE, Yelick K, Brooks E, Warren K () Introduction to UPC and language specification, CCS-TR--. IDA/CCS, Bowie
U
U
Use-Def Chains
. Culler DE, Dusseau A, Goldstein SC, Krishnamurthy A, Lumetta S, von Eicken T, Yelick K () Parallel programming in split-C. In: Proceedings of Supercomputing ‘, Portland, OR, – November , pp –
Use-Def Chains Dependence Abstractions
V Vampir Holger Brunst, Andreas Knüpfer Technische Universität Dresden, Dresden, Germany
Synonyms Vampir ; VampirServer; VampirTrace; Vampir NG; VNG
Definition In the scope of HPC, the name Vampir denotes a software framework that addresses the performance monitoring, visualization, and analysis of concurrent software programs by means of a technique referred to as event tracing. Originally, VAMPIR was an acronym for “Visualization and Analysis of MPI Resources.” Today, the framework no longer addresses MPI programs only. Support for many different programing paradigms and performance data sources has been included over time. The framework consists of a monitor and a visualization component, which are sometimes independently referred to as VampirTrace and Vampir. The monitor component attaches to a running software program and records timed status information about the invocation of program subroutines, the usage of communication interfaces, and the utilization of system resources.
Discussion Monitoring The monitoring component (VampirTrace) is used for recording the event traces during a program run. This process involves two steps, the instrumentation step and the runtime recording step [, ]. The former modifies the target executable by inserting measurement points that later on allow the detection of runtime events. The latter step handles the runtime events as they occur. This includes collecting
events and attributes of events as well as buffering and storing the event data stream.
Supported Paradigms and Languages VampirTrace currently supports event tracing for the programing languages C, C++, Fortran, and Java. It works for sequential programs as well as for parallel programs using MPI, OpenMP, POSIX Threads, GPUs, the Cell Broadband Engine, or hybrid combinations of them. Instrumentation Instrumentation is the modification of the target program in order to detect predefined runtime events. It is inserting monitoring probes on different levels and in different ways: Source code: On the source code level either manually or automatically with the help of source-to-source translation tools. This may take place during the software development as permanent part of the code or as a preprocessing step during compilation. Compiler: By compilers with specific command line switches and interfaces. This is supported by most relevant compilers today, including the GNU compiler collection, the OpenUH compilers, and the commercial compilers from Intel, Pathscale, PGI, SUN, IBM, and NEC. Re-linking: Re-linking with a wrapper library that includes instrumentation. Typically, the wrapper library will refer to the original library for the provided functionality. Binary: By binary rewriting of a ready executable either in a file or a memory image. In VampirTrace, different instrumentation techniques can be combined for different types of events. For convenience, the complex internal instrumentation details are hidden in compiler wrappers. The compiler wrappers can be used like regular compiler commands. They refer to underlying compilers but perform all
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
V
Vampir
instrumentation activities and/or add instrumentation command line options.
Runtime Recording The runtime recording library receives events from the instrumentation layer. It is responsible for collecting the events together with related attributes. Then it stores them as event records in a memory buffer which is written to a file eventually. Below, the most important types of events are listed. They are recorded with a high-resolution time stamp and a location identifier specifying the process or thread. Furthermore, there are additional type-specific attributes. Enter/leave: Entering to or returning from a subroutine call. This is used for instrumented subroutines in the user code as well as for MPI calls, POSIX I/O calls, or LIBC calls. Send/receive: Sending or receiving of point-to-point messages both for the blocking and the non-blocking versions. This is modeled according to the MPI standard but can be used for alternative message-passing models. Begin/end collective communication: Begin and end of an MPI collective communication operation. They are present at every rank participating in an MPI collective operation. Begin/end of I/O operation: Begin and end of an I/O operation. Counter sample: Provides the current value of a performance counter.
hardware performance counters via the PAPI library, SUN Solaris CPC counters, and NEC SX counters. Secondly, it allows to record memory allocation statistics, I/O throughput statistics, and arbitrary counter values provided by the program. Profiling As an alternative mode, VampirTrace supports
profiling, which delivers a concise statistic summary instead of a detailed trace file.
Open Trace Format The standard trace file format of VampirTrace is the Open Trace Format (OTF). It is developed and maintained by ZIH, Technische Universität Dresden in cooperation with the University of Oregon and the Lawrence Livermore National Lab.
Visualization The visualization component of Vampir displays the runtime behavior of parallel programs. It visualizes event traces gathered by the monitoring component VampirTrace. The Vampir tool translates trace file data into a variety of graphical representations that give developers detailed insights into performance issues of their parallel program. Vampir allows to quickly browse large performance data sets by means of interactive zooming and scrolling. The detection and explanation of unexpected or incorrect performance behavior is straightforward due to the exact rendering of the program flow. Vampir supports two general display types for the illustration of trace data: timeline and chart views [, ].
Timers and Timer Synchronization Precise timing plays an important role for event tracing. Therefore, VampirTrace supports a wide range of generic and platform specific high-precision timers that have a resolution of up to a single CPU clock tick. VampirTrace provides its own timer synchronization mechanism because most high precision timers in parallel environments are asynchronous. During trace collection, only local timer values are recorded. In addition, synchronization information is collected, which allow the conversation from local time stamps to globally synchronized time stamps in a post-processing step.
Timelines A timeline view shows detailed event-based information for arbitrary time intervals. It graphically presents the chain of events of monitored processes on a horizontal time axis. Detailed information about subroutine invocation, communication, synchronization, and hardware counter events is provided. Vampir currently features three different timeline types: master timeline, process timeline, and counter-data timeline. Figure shows the three different timeline types on the left-hand side from top to bottom.
Performance Counters VampirTrace supports a number
Master Timeline The master timeline consists of a col-
of sources for performance counters. Firstly, it can read
lection of rows which represent individual processes.
Vampir
V
Vampir. Fig. Vampir GUI with the most important displays: the small Navigation Timeline (top), the Master Timeline (upper left), the Process Timeline (middle left), two Counter Timelines (bottom left), the Function Summary (upper right), the Context View below (middle left), and the Communication Matrix View (bottom right)
All timelines are equipped with a horizontal timescale. Likewise, timelines provide the names of the depicted processes on the left. Color-coding is used to identify specific program actions. In this example, “dark” sections identify MPI communication and synchronization. This color-coding is customizable and depends on the user and the contents of the recorded trace file. Process Timeline The process timeline resembles the
master timeline but has a slight difference. In this case, the different vertical levels represent the different stack levels of subroutine calls. The main routine begins at the topmost level, a respective subroutine call is depicted a level below, and so forth. If a subroutine is completed, the graphical representation of the calling routine continues one level above.
Counter Data Timeline The counter timeline shows recorded performance counter values of the processes with a scale on the left-hand side. The display gives the minimum, average, and maximum values if the individual fluctuations are too tiny to distinguish. Zooming All timeline displays allow zooming in time to look at details within the huge amounts of data. The zoom intervals of all timelines are aligned such that zooming in one of the displays will update all others to the same time interval.
Charts In addition to the timeline displays, there are chart displays that provide summarized performance metrics computed from the corresponding event data. Again,
V
V
Vampir
the chart displays are connected to the zoom interval of the timeline views. The summary information always relates to the current zoom interval, i.e., the statistics are constraint to the current contents of the timeline views. Function Summary The function summary gives a profile (statistics) about subroutine calls, either as number of calls or inclusive/exclusive timing for individual subroutines or for classes of subroutines. It provides different graphical representations. Communication Matrix View The communication matrix provides statistics about point-to-point messages between sender and receiver processes in a two-dimensional matrix with a color legend. It can report message counts, data volumes, timing, and speed. This display can also be zoomed to a subset of senders/receivers or coarsened to groups of processes. Call Tree The call tree display presents caller and callee
relations between subroutines in the current zoom interval. Context View Finally, the context view shows individual detail information for a selected event (by mouse click), e.g., subroutine calls or messages.
Editions The Vampir performance data browser is available for all major platforms, including Unix, Microsoft Windows, and Apple Mac OS X–based systems. Three different product editions are available. “Vampir-Light,” “Vampir-Standard,” and “Vampir-Professional” address the needs of small, midsize, and large institutions. These three editions mainly differ in terms of the supported platform size and pricing. In addition to the above editions, a feature and time-restricted copy of Vampir is available for students and evaluation purposes.
History In , the development of the Vampir performance visualization tool was started by Wolfgang E. Nagel and Alfred Arnold at the Center for Applied Mathematics (ZAM) of Research Center Jülich in Germany. In these early times, it was still called PARvis. When the software became a commercial product in , its name
was changed to the acronym VAMPIR (visualization and analysis of MPI resources) in order to underline its focus at that time, which was the visualization of MPI parallel programs. The German company Pallas GmbH became the commercial distributor for the software in . In , the design responsibilities for the Vampir software tool moved from Research Center Jülich to the Center for High Performance Computing of Technische Universität Dresden due to the relocation of its founder Wolfgang E. Nagel. Between the years and , the software was temporarily distributed by the Intel Corporation under the label Intel Trace Analyzer. Since , the original software is available again under its most prominent name Vampir and can be obtained from Technische Universtät Dresden and GWT-TUD GmbH which are colocated in Dresden.
Future Directions Motivated by the changing requirements and needs of parallel software developers, the Vampir tool suite is under constant development. The tapping and processing of new performance data sources, namely, hardware accelerators and energy meters is an active field of research. A customizable event data mining engine and the comparison of successive trace runs are under investigation. Furthermore, the overall scalability of the tool suite will be improved in the near future by means of compression techniques that are based on pattern detection algorithms, which allow the elimination of redundant information [].
Credits We would like to acknowledge Vampir’s original inventors, Prof. Wolgang E. Nagel and Alfred Arnold. Furthermore, we would like to acknowledge the financial support from the Bundesministerium für Bildung und Forschung, the European Union, and other government sponsors. Finally, we would like to acknowledge Forschungszentrum Jülich, Indiana University, and Oak Ridge National Laboratory, in particular Bernd Mohr, Craig Stewart and Rainer Keller.
Licenses and Other Software VampirTrace and OTF come under a BSD Open Source license. They are also included as default components in the Open MPI distribution from version ., in the Sun
Vector Extensions, Instruction-Set Architecture (ISA)
Cluster Tools from version ., and in the Open Speed Shop package from version ..
Related Entries
V
VampirServer Vampir
Performance Analysis Tools
Bibliographic Notes and Further Reading . For more information about product releases of Vampir, please visit http://www.vampir.eu . For more information about VampirTrace, please visit http://www.tu-dresden.de/zih/vampirtrace . For more information about the Open Trace Format (OTF), please visit http://www.tu-dresden.de/ zih/otf . For more information about Vampir related research, please visit http://www.tu-dresden.de/zih/ptools
Bibliography . Nagel WE, Arnold A, Weber M, Hoppe H-Chr, Solchenbach K () VAMPIR: visualization and analysis of MPI resources. Supercomput J ():–. SARA, Amsterdam . Müller M, Knüpfer A, Jurenz M, Lieber M, Brunst H, Mix H, Nagel WE () Developing scalable applications with Vampir, VampirServer and VampirTrace. In: Parallel computing: architectures, algorithms and applications, Adv Parallel Comput :–. IOS Press, Amsterdam. ISBN ---- . Knüpfer A, Brunst H, Doleschal J, Jurenz M, Mickler LH, Müller M, Nagel WE () The Vampir performance analysis tool-set. In: Tools for high performance computing. Springer, Berlin, pp –. ISBN ---- . Brunst H () Integrative concepts for scalable distributed performance analysis and visualization of parallel programs. Dissertation, Technische Universität Dresden. ISBN ---- . Knüpfer A () Advanced memory data structures for scalable event trace analysis. Dissertation, Technische Universität Dresden, Suedwestdeutscher Verlag fuer Hochschulschriften. ISBN ---
Vampir Vampir
Vampir NG Vampir
VampirTrace Vampir
Vector Extensions, Instruction-Set Architecture (ISA) Valentina Salapura IBM Research, Yorktown Heights, NY, USA
Synonyms Data-parallel execution extensions; Media extensions; Multimedia extensions; SIMD (Single Instruction, Multiple Data) Machines; SIMD extensions; SIMD ISA
Definition Instruction-Set Architecture (ISA) vector extensions extend instruction-set architectures with instructions which operate on multiple data packed into vectors in parallel.
Discussion Instruction-set architecture vector extensions or vector instructions are SIMD instructions. SIMD stands for “Single Instruction stream, Multiple Data stream,” and it is a technique used to exploit data level parallelism. SIMD architectures apply operations to a fixed number of bits in a single logical computation. These bits are generally spatially local (i.e., a set of successive bits) and can represent a variable number of distinct data elements. The number of elements in a SIMD vector is a direct relationship of the total SIMD register size and the number of bits per data element – one can generally fit -bit (byte) elements in a -bit SIMD register, providing for a -way parallel operation on bytes, but only -bit data elements can fit in the same register, providing a maximal -way parallel operation on -bit (word) elements.
V
V
Vector Extensions, Instruction-Set Architecture (ISA)
The maximal speedup available by applying a SIMD architecture to a problem is bounded by both the maximal number of elements in the SIMD vector register and the computational efficiency of the SIMD algorithm. Computational efficiency is a measure of the effective utilization of the elements available within the SIMD register. If a SIMD register can contain elements, but only of the elements contain data that are used in the computational algorithm, then the computational efficiency of that SIMD algorithm is % (i.e., out of ). Combining the maximum parallelism with the computational efficiency gives a rough measure of the overall speedup.
History There are two distinct types of architectures captured by the classification “vector architecture” which is used today to describe instructions which operate on multiple data values simultaneously. The term was initially coined when traditional vector processors were in common use. Vector-processing facilities operated on multiple data items stored in a vector register containing a large number of vector elements ordered sequentially in the vector register. A single instruction (e.g., a vector add) would indicate that the successive elements of each source operand register would be added together and placed into a third vector register. This process necessarily specified a set of multiple data items (in this case tuples) on which the same instruction operation (i.e., add) was to be performed. An example of such vector processors is the Cray X-MP from the early s. Modern vector architectures also operate on vectors of data, but do so in a more time-parallel fashion. Vector length is typically held to be quite short, usually dependent on the size of the vector elements, and generally not exceeding elements. These SIMD architectures generally fix the total capacity of a vector register (in terms of bits) to a well-defined constant value (typically bits) and allow a varying number of elements per vector based on the bit-size representation of the element data type (e.g., -bit byte elements, or -bit word elements). This architecture facility is referred to as “SIMD architecture,” and qualify with “short/parallel vector SIMD” when necessary. The first architecture to include short parallel vector SIMD instructions was the Intel i []. These instructions were targeted at D graphics acceleration. Vectors
had a data width of bits and were stored in the floating-point registers. Processing was performed on vectors of eight -bit elements, four -bit elements, or two -bit or -bit elements, by a dedicated graphics processing unit. Subsequently, several other architectures were extended with a SIMD facility to support graphics processing: as an example, the HP Precision Architecture’s MAX facility was targeted at speeding up MPEG decompression, achieving the first real-time softwarebased decoding of MPEG video streams. Unlike the separate i facility, the HP MAX facility shared the data path of the main integer pipeline and stored four -bit elements of data in the integer (general purpose) register file. Based on this choice, key operations such as vector byte add could be achieved simply by breaking the carry chain between the bytes of the normal -bit integer data path, resulting in only minor incremental cost to support this “subword” SIMD facility []. Architects chose a different route for the IBM Power Architecture Vector Multimedia Extension facility. The Power Architecture vector facility (also known as AltiVec) added a dedicated SIMD vector register file with -bit registers, and a rich set of operation primitives and data types ranging from -bit pixels, -bit, -bit, -bit integers, to -bit (IEEE singleprecision) floating point. Execution commonly occurs in four execution pipelines, one dedicated to simple integer operations (add/subtract/logical), one for complex integer operations (multiply, multiply-add), one for floating-point operation, and one for permute (data formatting) operations []. Starting with Power , a new vector-scalar extension (VSX) integrates the floating point and VMX register files to offer a single -entry register file for scalar and vector computation while retaining binary compatibility with legacy applications. Several application-domain specific SIMD facilities are provided for the Power Architecture, including GameCube’s processing of two single-precision floating-point data elements in a single -bit doubleprecision floating-point register, Motorola’s SPE, and BlueGene’s “Double-Hummer” -wide double-precision floating-point elements in paired floating-point registers. The Intel x/AMD architecture has several distinct SIMD facilities. The MMX facility supports integer
Vector Extensions, Instruction-Set Architecture (ISA)
data formats in eight -bit registers, which “overlay” the traditional FPU register stack. The DNow! extensions add single precision floating-point operations. A third SIMD facility (SSE) includes an eight-entry bit register file and also provides scalar floating-point operations to serve as both a new floating-point facility and a SIMD architecture. The Cell Synergistic Processor Architecture is a new architecture built from the ground up around short parallel vector SIMD processing. A single -entry -bit register file stores both scalar and vector data, and SIMD data paths provide the computation capabilities for both scalar and SIMD processing. The SPU supports a repertoire of -bit, -bit, and -bit integer operations, and -bit and -bit floating-point operations [].
Area Efficiency As processor designs have become increasingly complex, ISA vector extensions have become more attractive by delivering increased performance for a range of workloads. Unlike traditional ILP extraction in microprocessors which require to fetch, decode, issue, and complete more instructions to increase compute performance, vector extensions increase the compute capability per instruction. Thus, while ILP performance techniques require scaling up of all parts of the processor, such as instruction fetch bandwidth, instruction issue bandwidth, global completion tables or reorder buffers, the amount of register file and data cache ports and, hence, register file size, as well as execution units, vector instructions can increase performance by replicating execution units operating in lockstep and ensure adequate data delivery by widening register files and implementing wide data cache and register file ports. In contrast, instruction processing capabilities, such as the number of instructions fetched, decoded, renamed, issued, and completed remains independent of the number of operations completed by a single instruction. In addition to the attractive area efficiency, vector extensions are comparatively easier to verify than many other performance techniques, as such designs increase the dataflow content of a design by replicating execution units. The dataflow of these execution units can in turn by verified in a modular fashion at the individual dataflow level.
V
Power and Energy Efficiency Vector extensions have also become increasingly attractive in the power-constrained design space in which modern microprocessor designs must be implemented. From a power- and energy-efficiency perspective, reduced area consumption directly translates into a reduction in power and energy use. Beyond the power and energy benefits associated with area reduction, the performance increase offered by exploiting data parallelism with vector extension can also be translated into using more efficient operating points. By exploiting the increased performance potential offered by vector extensions to operate microprocessors in a more power-efficient regime, power and energy efficiency can be dramatically improved. Because performance perf = ops/cycle∗ f , any increase in ops/cycle can be offset by a commensurate reduction in operating frequency. Following the power equation P = CV f (here P is power, C capacitance, V voltage, and f operating frequency), and the observation f ∼∼ V any increase in processing performance can be used as a cubic reduction in power due to the V f factor. On the other hand, the capacitance C will only rise modestly as the number of execution units is increased. Salapura et al. study energy efficiency of the BlueGene system and contains a discussion of power efficiency of SIMD architectures based on power/ performance characterization of BlueGene workloads [].
Software Enablement Programmers can exploit vector capabilities provided by vector ISA extensions in several different ways.
Compiler Supported Exploitation Compiler support is the single most important enabler, and yet also the biggest obstacle, to ensure the reach of vector architectures beyond the traditional graphics and high-performance compute kernels. In this approach, application is written without any vector architecture in mind. The compiler detects
V
V
Vector Extensions, Instruction-Set Architecture (ISA)
parts of the code with data parallelism and generates the code which takes advantage of SIMD capabilities for the target architecture. As advantage, data parallelism in applications can be exploited to achieve higher performance, the application is portable, and the usage of vector unit and vector instructions is transparent to the programmer. This approach requires compiler support, developed libraries, and middleware for the target vector architecture to exploit the vector capabilities.
Usage of Intrinsics To overcome the limitations of compilers commonly in use today, SIMD instructions are often generated directly using intrinsics, a form of inline assembly. This approach requires the programmer to map the algorithm to the sequence of assembly instructions, but it allows the compiler to perform register allocation and instruction scheduling. When using compiler intrinsics, the programmer gives directions to the compiler on which variables to perform the specified vector operation for the target vector architecture. As a result, the code is optimized for the particular vector architecture, and it exploits vector capabilities efficiently and achieves high performance. As a disadvantage, the application developer has to be familiar with the vector instructions of the target architecture and to be able to use them efficiently. The resulting code is also typically not portable to any other vector architecture. To port the code to some other vector architecture, compiler intrinsic for the vector instructions specified in the original code have to be replaced to match the new target vector architecture. Alternatively, the vector intrinsics from the first architecture can be mapped to the equivalent vector operation or sequence of vector operations of the second target architecture. This task can be facilitated by using translation libraries. To date, widespread adoption of SIMD-based computing has been hampered by low programmer productivity due to the need for low-level programming to exploit the SIMD ISA.
Limitations While exploiting vector architectures with intrinsics allows more efficient exploitation, the speedup attainable with vector architectures is often limited. Some
of these limitations causing a program not to achieve performance increase when using vector processing are listed here.
Sequential Code The applications with data level parallelism gain the largest performance improvement when using vector unit. Not all applications can benefit from vector processing. For example, in a widely cited study performed at the University of California at Berkeley [] finite-state machine (FSM)-based algorithms associated with text processing were identified as particularly challenging to leverage the parallelism that modern processors have made available. In these algorithms, a loop iterates over a pointer-based linked list. The current element has to be read in order to determine the memory location of the next memory element.
Data Formatting Layout of data in memory impacts performance of vector architectures. If data are laid out in the memory in the consecutive memory locations, data can be loaded into vector registers. Otherwise, data need to be formatted until they are in a format suitable for vector processing. This manipulation introduces overhead which diminishes or eliminates performance increase of vector processing. Applications that operate on noncontiguous memory location will not experience the same speedup due to SIMD. The overhead due to formatting requires additional formatting instructions and increases register pressure in order to keep values in the register file as they are being formatted. Reduced number of registers can in turn reduce the effectiveness of compiler optimizations. As of today, no processor provides hardware support for loading from or storing to non-contiguous memory locations by using some form of SIMD gather/scatter instructions. Another data formatting factor which impacts performance of vector architectures is data alignment. An aligned reference means that the desired data reside at an address that is a multiple of the vector register size. Contiguous data are unaligned when they are not physically located at addresses that are multiple of the SIMD data width. For example, vector loads on Power or on Cell PPU/SPU can only load data that start at
Vector Extensions, Instruction-Set Architecture (ISA)
addresses that start at addresses which are aligned at -byte boundaries. The compiler attempts to position data at aligned addresses, but the data address cannot be always determined at compile time, for example, when address calculation uses a result from previous operations. To ensure that data are aligned, an alignment sequence can be used. This sequence loads all necessary data, and some additional data, and then formats it accordingly to use only data dictated by the application. This introduces overhead and, in turn, reduces the performance advantage of using vector instructions. In the cases there is a large number of misaligned streams, performance speedup will be minimal due to increased instruction bandwidth and register pressure []. Recent vector architectures provide hardware support for loading contiguous memory locations that are not aligned at multiple of the vector length of the vector unit. Frequently, such loads are inefficient and have significantly higher latency compared to vector loads of aligned data. Inefficient hardware implementation may simply shift the software overhead to the hardware.
Exemplary ISA Vector Extensions AltiVec and VMX
AltiVec is a floating-point and integer SIMD instructionset architecture designed by the IBM, Motorola (the Motorola semiconductor division is now operating independently as Freescale Semiconductor), and Apple between and . AltiVec is implemented in a number or PowerPC processors. The first implementation of AltiVec was in Motorola’s G PowerPC. IBM implements AltiVec in a number of IBM processors, in IBM’s PowerPC (also called G by Apple) and Power processors. Since Freescale owns a trademark for AltiVec, the system is also referred to as IBM Vector Multimedia Extension VMX by IBM, and Velocity Engine by Apple. AltiVec was the first SIMD instruction set to gain widespread acceptance by programmers. Apple was the primary customer to adopt AltiVec in PowerPC-based Mac PCs. AltiVec/VMX architects a dedicated set of vector register file with -bit registers that can represent -bit signed or unsigned characters, eight -bit signed or unsigned shorts, four -bit integers, or four -bit floating-point single precision variables. AltiVec
V
supports a special RGB “pixel” data type and provides cache-control instructions intended to minimize cache pollution when working on streams of data. Most AltiVec instructions take three register operands, and there is a small number of four operand instructions. The four operand instructions include merged floating-point and integer multiplyadd instructions, which is very important in achieving high floating-point performance, and a vector permute instruction. Another very useful four operand instruction is the permute instruction. It allows to take any byte of either of two input vectors, as specified in the third vector, and place it in the resulting vector. This allows for complex data manipulations in a single instruction and is very useful in a number of applications, ranging from image processing to cryptography. AltiVec is a standard part of the Power ISA v.. [] and later specifications. VMX is a modified version of VMX for the Xenon processor (used in Microsoft’s XBox ). It architects registers and adds support for sumacross operations, where all the vector elements in a single vector register are added together to generate a single sum. These are application-specific operations for accelerating D graphics and game physics. The Blue Gene/L and BlueGene/P supercomputers use a double-precision floating-point SIMD unit. This unit uses -bit vector registers to contain two double-precision floating-point values, and the scalar and vector registers share the same register file. MMX
MMX is a SIMD instruction set designed by Intel and implemented in the Pentium processor in . MMX used -bit vector data, which were stored in the floating-point registers. Processing was performed on vectors of eight -bit elements, four -bit elements, or two -bit elements by a dedicated graphics processing unit. The mapping of MMX registers onto floatingpoint registers made difficult to use floating-point and SIMD data in the same application, and required mode switching. MMX provided only integer operations []. AMD designed an extension to the x instruction set with the DNow! instruction set. DNow! added floating-point support to MMX SIMD instructions. This first implementation of DNow! instruction set was in the AMD’s K- processor in .
V
V
Vector Extensions, Instruction-Set Architecture (ISA)
SSE
Streaming SIMD Extensions (SSE) is a SIMD instruction set extension of the x architecture. SSE was designed by Intel and introduced in . The first SSE implementations were in Intel’s Pentium III processors and later in AMD’s Athlon XP and Duron processors. SSE uses a separate set of vector registers and general purpose registers. It has eight -bit registers that can represent sixteen -bit bytes or characters, eight -bit short integers, four -bit integers, or four -bit floating-point single precision variables, or two -bit integers or two -bit double-precision floatingpoint numbers. SSE includes integer and floating-point instructions. Unlike MMX instruction set, SSE does not reuse the floating-point registers but implements separate registers for SIMD. This allows interleaving of SSE and scalar floating-point operations without having to switch between the two modes. Since the introduction of SSE, SSE has gone through several revisions (SSE, SSE, SSE, SSE). SSE was introduced in Pentium processors in . SSE adds new instructions for double-precision (-bit) floating point. SSE was introduced in new Pentium chips in . SSE adds the capability to work horizontally in a register such as instructions to add and subtract the multiple values stored within a single register. SSE was introduced in the Intel Core processor in . Cell SPE
The Cell Synergistic Processor Architecture (SPE) instruction set is designed by IBM, Sony, and Toshiba. It is implemented in IBM’s Cell BE processor for Sony’s PlayStation in and in PowerXCell i processor by IBM in . The Cell SPE architecture implements a single entry -bit register file for storing both scalar and vector data and supports both scalar and SIMD processing. The SPE can perform four single precision, or two double-precision floating-point operations, sixteen -bit integer, eight -bit integer, or four -bit integer operations. Many architectures separate scalar registers and execution units from SIMD registers and execution units. Some applications (for example, as text processing) require frequent data movement between the two sets of registers. The movement of data between vector and general purpose registers is achieved by storing data
from one set of registers and loading it in the another registers. This type of transfer incurs a significant delay often wiping out any performance gains that might be gotten from exploiting SIMD data parallelism. Cell SPE architects a unified register file for storing both short vector and scalar values, and it reuses its execution units for both scalar and vector processing. Scalar values are stored in the leftmost slot of the vector register, the so-called “preferred slot.” The unified register file makes data sharing between scalar and SIMD vector operations straightforward and incurs no delay. VSX
VSX (Vector-Scalar Extension) is a new SIMD instruction set designed by IBM. It is first implemented in IBM’s Power processor in and described in Power ISA v. []. VSX implements SIMD registers and includes instructions for double-precision floating point, decimal floating point, and vector execution. VSX implements a unified register file, where floating-point registers and VMX/AltiVec vector registers are mapped to the VSX vector registers. VSX SIMD floatingpoint operations use -bit vector registers ( of them overlayed on top of the VMX registers, and overlaid on top of the scalar floating-point registers) []. VSX implements the concept of “preferred slot,” as introduced in Cell’s SPE, where the leftmost element of a register can be used as both scalar and vector element. The data from the merged vectorscalar floating-point registers are used for scalar and vector floating-point processing, as well as for vector integer processing. Furthermore, VSX introduces double-precision floating-point vector operations to increase parallelism in double-precision floating-point processing.
Bibliography . Kohn L, Margulis N () Introducing the Intel i -bit microprocessor. IEEE Micro ():– . Lee R () Effectiveness of the MAX- multimedia extensions for PA-RISC . processors. HotChips IX, Palo Alto, – August . Diefendorff K, Dubey P, Hochsprung R, Scales H () AltiVec extension to PowerPC accelerates media processing. IEEE Micro ():–
VLIW Processors
. Gschwind M, Hofstee P, Flachs B, Hopkins M, Watanabe Y, Yamazaki T () Synergistic processing in cell’s multicore architecture. IEEE Micro ():– . Salapura V, Walkup R, Gara A () Exploiting workload parallelism for performance and power optimization in BlueGene. IEEE Micro ():– . Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA () The landscape of parallel computing research: a view from Berkeley. Technical Report, UCB/EECS--, EECS Department, University of California, Berkeley . Eichenberger A, Wu P, O’Brien K () Vectorization for SIMD architectures with alignment constraints. In: Programming language design and implementation. ACM, New York . Power ISA v.., Power.org, --. http://www.power.org/ resources/downloads/PowerISA_.Public.pdf . Peleg A, Weiser U () MMX technology extension to the Intel architecture. IEEE Micro ():– . Power ISA Version .. Power.org. --. http://www. power.org/resources/downloads/PowerISA_V._PUBLIC.pdf . Gschwind M, Olsson B () Multi-addressable register file. Patent Application US-, Aug
Vectorization FORGE Parallelization, Automatic Parafrase
Verification of Parallel Shared-Memory Programs, Owicki-Gries Method of Axiomatic Owicki-Gries Method of Axiomatic Verification
View from Berkeley Green Flash: Climate Machine (LBNL)
Virtual Shared Memory Software Distributed Shared Memory
V
VLIW Processors Joseph A. Fisher , Paolo Faraboschi , Cliff Young Miami Beach, FL, USA Hewlett Packard, Sant Cugat del Valles, Spain D. E. Shaw Research, New York, NY, USA
Definition VLIW (Very Long Instruction Word) is a CPU architectural style that offers large amounts of irregular instruction-level parallelism (ILP) by overlapping the execution of multiple machine-level operations within a single flow of control. In a VLIW, the instruction-level parallelism is visible in the machine-level program and must be exposed and arranged before programs run; this complex job is done using sophisticated compiler technology, with little, if any, help from the programmer. A classic organization of a VLIW instruction consists of many individual operations bundled together into a long instruction word, with one such word issued each processor cycle. VLIW processors are used extensively in high-performance embedded applications, and have found some success as high-performance servers.
Discussion VLIW architectures offer large amounts of instructionlevel parallelism by arranging a parallel execution pattern in advance of the running of the program. The parallelism is carried out among multiple machine-level operations, typically RISC-style, within a single flow of control. In a VLIW, a single program counter is used to determine the instruction stream, and instructions (each of which might itself include multiple operations) are fetched and dispatched one at a time with no rearrangements of their order done in the hardware. The execution arrangement is orchestrated in advance of the running of the program. The hardware simply fetches, decodes, and issues the wide instructions in an in-order fashion, preserving the order that the compiler had previously established. Unlike vector architectures which require code with a regular form of parallelism where the same operation is applied to parallel data, VLIW parallelism does not rely upon such regular patterns being identified and presented to special execution units. The identification and then choreography of “irregular” parallelism can
V
V
VLIW Processors
be extremely complex, and since VLIWs must look like ordinary processors to the user, sophisticated compiler technology is required for a VLIW to be effective. This effect is so strong that VLIW architectures are often co-designed with a matching compiler technology. The “very long” instructions are what inspired the name VLIW, which provides a good mnemonic device for the key architectural characteristics. There is, however, no requirement that a VLIW actually have “long instructions.” Rather, the key defining characteristic is that the single-stream ILP is exposed in the program, and thus planned in advance, rather than organized on-the-fly while the program runs. Operations might instead be statically scheduled to be issued rapid-fire, while several previous operations are still in flight. Such an architecture style, sometimes called superpipelining, has the same key characteristics as one with actual long instructions, differing only in implementation details, and can be considered as a form of VLIW where statically scheduled parallel instructions overlap over time. VLIWs are often juxtaposed with Superscalar architectures, since both have the goal of rearranging the instruction stream to effect a speedup via the parallel execution of simple operations on multiple functional units. The two design styles differ fundamentally in when and by what mechanism the rearrangement is done: whether beforehand at compile time (VLIW), or via specialized hardware operating each cycle while the program runs (Superscalar). Real implementations have interpreted the “Very” in VLIW in rather different ways, depending on the target domain, the available circuit, and compiler technology. For example, in high performance computing, the Multiflow Trace architecture (a – minisupercomputer) could issue up to operations per cycle. In the embedded space, the vast majority of VLIW architectures to date issue between and operations per cycle. Two techniques that combine architectural support with compiler transformations, speculative execution and predicated execution, offer complementary methods to increase instruction-level parallelism in ILP processors. In speculative execution, the compiler moves operations above conditional branches that dominate the original location of the operation. Because the branch is conditional, this code motion is a gamble: if the compiler guesses right, the program runs faster; if
not, the program will have wasted execution resources. Care must be taken to suppress side effects (e.g., memory stores or exceptions) that would change the semantics of the program if the speculated operation is incorrectly executed. In predicated execution, the instruction-set architecture of the machine supports additional input registers to operations that conditionally enable or disable (viz., predicate) the operation. Compiler optimizations can then choose to predicate operations from different control paths and fuse those paths into a single flow of control. Predication can reduce branch-related pipeline penalties and enable further scalar optimizations, but it comes at the cost of a more complicated implementation and potentially increased instruction fetch bandwidth when the operations from all fused paths must be fetched. These two techniques are covered by other articles in this encyclopedia, and will not be discussed in more detail in this article.
VLIW Implementation A classic VLIW implementation might have instruction encodings that look like the following, from the Multiflow Trace / (–) (Fig. ). The Trace / issues seven operations per clock cycle: one branch, two integer operations, two memory operations (either of which can optionally be an integer operation instead), and two floating operations. A machine-level program is a sequence of such instructions, as shown below, with one instruction issued per clock cycle, unless the CPU stalls (Fig. ). Classic VLIW execution unit hardware usually consists of several execution functional units, several register banks and paths to memory arranged in a crossbar for unrestricted access to all in any cycle, as shown below. Implementation realities will often make such rich connection not scalable beyond a certain number of fully connected units, because of the non-linear relationship between the number of execution units and the area and complexity of the register file and the bypass (forwarding) logic. Sometimes functional units can be arranged to only be connected to a subset of the register banks – for example, two identical integer ALUs might each address a different register bank. In this case, the partitioned-register architecture is referred to as a clustered VLIW – successfully dealing with clustered register banks is a compiler research topic of its own.
VLIW Processors
V
256-bit Instruction Word PC
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
VLIW Processors. Fig. A possible encoding of a -bit wide VLIW instruction composed of operations
Instruction 01
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 02
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 03
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 04
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 05
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 06
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 07
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 08
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 09
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 10
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
Instruction 11
branch opcode
int/mem opcode
int/mem opcode
integer opcode
integer opcode
floating opcode
floating opcode
VLIW Processors. Fig. A machine-level program with one instruction issued per clock cycle, unless the CPU stalls
The two figures below show the block diagrams of (unclustered and clustered) VLIW architectures (Fig. ).
Benefits of VLIW The VLIW design style exists to provide effective high performance, with little if any programmer intervention, and without great hardware cost. A VLIW usually requires less hardware than other ILP approaches (such as superscalar or dataflow) to high performance, since
the choreography of parallelism is done in advance and there is no need for hardware to rearrange the instruction execution sequence at run time. The absence of rearrangement hardware, running every cycle, has broad implications for performance, cost, and power: ●
A faster clock is possible (since there is less to do each cycle). ● There is less hardware, thus a lower silicon cost.
V
V
VLIW Processors
BYPASS LOGIC
Register File
Load-Store Unit
ALU ALU ALU ALU ALU ALU ALU ALU 0 1 2 3 4 5 6 7 ALU 4
Register File
Register File
ALU 7
Load-Store and Intercluster Copy
BYPASS LOGIC
Cluster 0
ALU 6
BYPASS LOGIC
Cluster 1
ALU 5
Unit
ALU 0
ALU 1
ALU 2
ALU 3
VLIW Processors. Fig. Block diagrams of (unclustered and clustered) VLIW architecture
● ●
Less power is used. Far more ILP is practical, since the additional hardware needed to rearrange computations tends to grow exponentially with the amount of ILP available.
A result of this is that ILP in the range of – operations issued per cycle is common in VLIWs. The Multiflow Trace / offered operations issued per cycle, and
for some heavily computational code was able to sustain a large fraction of complete utilization. These advantages exist primarily in comparison to superscalar architectures, the closest equivalent form of parallelism. When compared to other forms of parallelism, both VLIW and superscalar architectures have the advantage that the computing paradigm does not change, that is, parallel processes and regular structures do not have to be identified in the code. To the programmer, both are much like “normal” scalar processors.
VLIW Processors
Object-Code Compatibility The greatest shortcoming in VLIW’s use as “ordinary” computers lies in its defining characteristic. Old programs will not run correctly on new machines where either the set of hardware resources or the number of cycles to perform an operation (i.e., its latency) has changed vs. what the compiler assumed when it generated the original object code. The exact computation to be carried out, and the machine on which to execute it, is visible in a VLIW binary program. Thus, a new implementation of a VLIW architecture is not likely to be object-code compatible with an old one, which represents a major flaw in VLIW’s use as a general-purpose computer. One consequence of this shortcoming is that VLIWs have found far greater success in the embedded computing world, where object-code compatibility is far less of a factor than on general purpose computers such as desktops or servers. If an architecture is not changed dramatically, there are several techniques that can help solve this problem, individually or in combination: ●
If the new architecture is in some sense a superset of the old one, then the system can be designed to allow the old programs to run on a limited portion of the hardware, corresponding to the old architecture. Eventually, the code may be recompiled to get greater performance on the new architecture. This upward compatibility was used by Multiflow, where the system that issued operations at a time could run code compiled for the system that issued . It is also used, in a different manner, in the Intel IA- where the compiler generates code assuming a superset of the resources are available, and each implementation may choose to only execute some operations in parallel. ● If the incompatibility results largely from changed latencies, then programs can carry information about their assumptions, and mode switches built into the CPU can enforce the correct timings (delaying any instruction issue when a component operation cannot be guaranteed to have finished). ● The system can statically change old programs into new ones that follow the new architectural assumptions at the time the program is loaded. ● The system can dynamically change old programs into new ones that follow the new architectural
V
assumptions while the program is running. This only makes sense if the translations can be cached and reused repeatedly. For example, the xcompatible Transmeta processors (–) used a similar dynamic binary translation technology to generate VLIW instructions from a stream of x instructions on the fly.
Other Perceived Disadvantages of VLIW Dynamic Code Behavior Often compilers cannot predict the behavior of a program, most importantly in branch direction and in memory latency. A VLIW, having had its execution laid out in advance, can’t adjust to these dynamic changes. ●
To ameliorate the problem of unpredictable branch directions, a compiler uses static branch prediction to assume that the more likely branch will be taken, and inserts code to correct the program behavior when this is wrong. The development of these techniques (Trace Scheduling was the first) greatly increased the potential for ILP in ordinary programs. Short, hard-to-predict branches can also be eliminated through various forms of predication (also known as conditional, or guarded, execution). ● To ameliorate the problem of unpredictable memory latency, VLIWs often include decoupled register fetch mechanisms, and other dynamic features. The designers of VLIW architectures have been careful to regard the VLIW design style not as a dogma, but rather as a design principle, to guide decisions but not mandate them. When dynamic features have been desirable, designers have felt free to use them. For example, the use of register scoreboarding reduces code size by removing the need for no-op operations. Dynamic branch prediction reduces the misprediction penalty by prefetching instructions among the most likely path. Stall-on-use memory techniques help ameliorate the L cache miss penalty when the compiler can schedule the consumer of a memory operation later than the shortest latency without impacting performance.
Code Size A VLIW can offer a great deal of parallelism, but most sections of code will only use a portion of what is offered. Code outside the dense inner loops will often
V
V
VLIW Processors
use only a very small part of what is offered. While performance improvements in such sections of code are very valuable, and VLIWs tend to do comparatively well there, most instructions still contain only a few operations to be carried out. That leaves many fields of the long instruction word blank, or filled with no-ops. Left untreated, this would cause programs to occupy an undesirably large amount of memory. Every modern VLIW, from the s forward, has dealt with this problem in some way. Sophisticated code compression techniques have been developed that bring a VLIW’s code size down to the size of uncompressed RISC programs. Compression appears at every level of the instruction memory hierarchy, though typically the greatest degree of decompression is done as the instruction cache is filled.
Compiler Sophistication and Compile Time The job of arranging a VLIW’s parallelism, which falls to the compiler, is difficult to engineer and timeconsuming to run. While this is true, it is a fallacy to think of these effects as an undesirable quality of the VLIW design style, especially when that is compared to superscalars. The actual scheduling phase of a VLIW compiler, while complex, is fairly well understood and need not take a lot of compile time. The hard part of the compiling process is doing the code rearrangements to expose the parallelism in the first place, including the branch prediction mentioned above. All of this is a function of the available instruction-level parallelism, not of the style of architecture that exploits it. The reason VLIW architectures require “heroic compilers” is that they offer so much ILP. If it were practical to build a superscalar architecture with that much ILP, it would require an equivalently heroic compiler.
Dealing with Exceptions Any system that does as much rearranging of a program as a VLIW does must deal with preserving the original program semantics in the face of exceptions. Even worse, rearranging programs by speculating the execution of an instruction before a controlling branch may cause excepting operations to execute, when they would not have executed at all in the original program’s sequential order. A vast body of technology, too much
to give justice to here, has grown to address this problem, including the concepts of “dismissible loads” and “delayed exceptions.”
Important VLIW Systems Startup Computer Companies in the s The first commercial VLIW systems were small supercomputers: ●
The Trace /, and its successors, introduced in by the startup Multiflow Computer ● The Cydra-, introduced in by the startup Cydrome About Multiflow Traces were sold, most into mechanical, chemical, and electronic simulation environments. Despite Multiflow’s initial success, neither company survived beyond .
First VLIW Microprocessor After years of design, the Philips Life-I (later renamed the TriMedia) was introduced in , and was the first commercially successful VLIW Microprocessor. Eventually, the division of Philips that produced the TriMedia became the private company NXP Semiconductors, which still designs and builds VLIW microprocessors, the latest being the PNX, used in digital TVs and other digital audio and video products, as well as in cellular handsets and other applications. Other Modern VLIW Microprocessors Today’s most important VLIW microprocessors, and their main uses, include: ●
The STMicroelectronics ST (used extensively in digital video and printing and scanning) ● The Texas Instruments Cx family (used extensively in cellular basestations and smartphones, and inhome entertainment) ● The Fujitsu FR-V (used in digital cameras) Although companies do not generally release volumes for individual parts, ST has reported that over million ST cores were shipped by April, , and Texas Instruments report that over % of all cellular base stations contain VLIW processors. VLIW processors have been used successfully in products such as hearing aids, graphics boards, and network communications devices. There are also numerous “Fabless” (IntellectualProperty-Based) VLIW Designs commercially available
VLIW Processors
from CEVA, Silicon Hive, Tensilica, and others. These processors are generally found as the high-performance core or cores in a larger system-on-a-chip.
VLIW in Processors for General Purpose Use During the decade of the s, two prominent efforts were made to use VLIW in desktops, notebooks, and servers. ●
Transmeta used a VLIW processor as the underlying engine to run an emulator of the Intel x. However, Intel and AMD’s strength in the marketplace, and the complexity of setting up a high-volume manufacturing operation was too much of a barrier for the excellent cost/performance characteristics of this product to overcome. ● Intel’s Itanium processor family borrows many characteristics from VLIWs, while adding many new complex features of its own. Intel refers to the design style of that family as “EPIC” (explicitly parallel instruction computing). The Itanium borrows many design artifacts from the Cydra-, and founders of both Cydrome and Multiflow were involved in the initial design at HP Labs. While the Itanium is the fourth most popular server architecture today, it has not met industry projections for its volume, as its lack of compatibility and its ineffective early implementations have greatly limited its growth.
VLIW was Motivated by Compiler Techniques The Trace Scheduling compiler algorithm was put forward by Joseph Fisher [] at the Courant Institute of New York University in . Trace scheduling was designed to solve the “horizontal microcode compaction problem” by eliminating the restriction that overlapping operations must originate from the same basic block of straight-line code. No practical algorithm to do so had been put forward before that. The basic idea in trace scheduling has two steps. First, select code originating in a large region (typically many basic blocks). Second, compact operations from the entire region as if it were one big basic block (which often contains a lot of ILP), possibly adding extra operations to maintain the semantics of the program, despite the violence done to the flow of control. Given that Trace Scheduling made far more ILP available, Fisher predicted that general-purpose systems
V
like today’s VLIWs made sense, coining the term Instruction-level Parallelism for the kind of parallel processing []. While at Yale University, Fisher then defined and named the VLIW design style in , publishing his thoughts in []. By January of , Multiflow Computer, a company started by Fisher and members of his Yale University group, delivered high-performance commercial VLIWs to customers. During that same period, software pipelining began to mature. Software pipelining is a compiler technique (originally, it was sometimes done by hand) to speed up inner loops. It does this by rearranging the loop so that operations from multiple iterations are computed within the same revised inner loop. Code is placed before (prologue) and after (epilogue) the new loop to preserve the program semantics, by starting the first iteration(s) and completing the last one(s). Software pipelining was done by hand on early computers containing some ILP, and was carried out via simple pattern matching by the Fortran compiler for the CDC- as early as the s. By the early s, the Floating Point Systems Fortran compiler could software pipeline a loop consisting of a single basic block [, ]. In , Bob Rau [] described a more scientific version of software pipelining, using the term “Modulo Scheduling.” His work helped lead to an early interest in VLIWs, and Rau went on to start Cydrome, which, along with Multiflow, pioneered VLIWs. Modulo Scheduling researchers all favored “rotating registers,” as found in the FPS systems. Rotating registers change their addresses to match the code iterations in modulo scheduled loop. These have been shown to not be necessary, and whether they are beneficial is still controversial. They are found in the Itanium and (in a more restricted form) in the TI Cx.
Precursors of VLIWs During the s and s, three different architectural trends led the way toward VLIWs. . High-performance general-purpose scientific computers. These were the supercomputers of their day. The best examples are the CDC- and the IBM /, which was preceded by the IBM research processor Stretch. They derived ILP by performing several arithmetic operations simultaneously, in a manner arranged by the hardware (like today’s
V
V
VLSI Algorithmics
superscalars). Although the hardware could identify independent operations and schedule their issuance, practitioners quickly realized that a compiler could place operations close enough to each other in the machine-level program that the hardware, with its limited visibility, could overlap them more effectively. Because these systems typically overlapped only or operations, the compiler’s rearrangement only needed to span short sections of straight-line code in the source program. These systems were general-purpose ILP CPUs. Primitive software pipelining and basic block instru ction scheduling both originated in this environment. Researchers working on these systems, convinced by flawed experiments that available ILP was limited to a factor of or times speedup, never pushed the superscalar aspects of their technology farther. . Attached signal processors, or “array processors.” In the s, and perhaps earlier, engineers and scientists built hardware that resembled VLIW, called attached signal processors, or “array processors.” The most famous of these were the Floating Point Systems AP-b (introduced in ) and FPS- (introduced in ), though there were many others, offered by Numerix, Mercury, CDC and Texas Instruments, among others. Although they evoke today’s VLIWs, they were typically built to compute a particular inner loop very effectively, and were thus very idiosyncratic, and their parallelism was almost always specified by hand, rather than via compiling. Circuits are still built in this style, and are called DSPs (Digital Signal Processors). . Horizontal microcode. Microcode is an RISC-like level of code, sometimes found hard wired in a CPU to emulate the true, more complex, machine-level architecture. Since the emulator is a single program, running all the time, it was straightforward to build “horizontal microcode,” in which several of the pieces of hardware run in parallel. This again resembles the hardware structure of VLIW.
Related Entries Cydra Dependence Analysis Instruction-Level Parallelism Modulo Scheduling and Loop Pipelining Multiflow Computer Speculation Trace Scheduling
Bibliography . Fisher JA () The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources. Ph.D. dissertation, Technical Report COO-–. Courant Mathematics and Computing Laboratory, New York University, New York . Fisher JA () Trace scheduling: a technique for global microcode compaction. IEEE T Comput ():– . Fisher JA () Very long instruction word architectures and the ELI-. In: Proceedings of the th annual international symposium on computer architecture, Stockholm, Sweden, – June , pp – . Charlesworth A () An approach to scientific array processing: the architectural design of the AP-b/FPS- family. IEEE Comput ():– . Touzeau RF () A FORTRAN compiler for the FPS- scientific computer. In: Proceedings of ACM SIGPLAN’ symposium on compiler construction, Montreal, pp – . Rau BR, Glaeser CD () Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In: Proceedings of the th annual microprogramming workshop, IEEE Press, Piscataway, pp – . Fisher JA, Faraboschi P, Young C () Embedded computing – a VLIW approach to architecture, compiler, and tools. Morgan Kaufmann, San Francisco
VLSI Algorithmics VLSI Computation
VLSI Computation Franco P. Preparata Brown University, Providence, RI, USA
In the latter two domains, compiling was regarded as desirable, but only as an afterthought. It was never very effective, but research into compiling in those domains led to the compiler breakthroughs that enabled VLIW.
Synonyms Application-specific integrated circuits; VLSI algorithmics
VLSI Computation
Definition VLSI Computation (computation within the VeryLarge-Scale-Integrated technology) concerns the analysis of the computations realized by large integrated networks, whereby the traditional distinction between networks and computations disappears (each network is “dedicated” to the execution of a particular algorithm). Specific algorithms realized by such circuits are evaluated in terms of their efficient use of the integrated technology.
Discussion Historical Background Switching networks, the basic constituents of digital systems, were traditionally realized by interconnecting discrete electrical components, such as transistors, resistors, capacitors, etc., by means of conducting wires. Transistors, the fundamental switching elements, were initially realized by appropriately modifying (“doping”) different portions of the surface of a piece of semiconductor material. A key innovation occurred in the late s of the past century, when it was realized that conducting wires could also be implemented on the same surface, either by deposition of metal or by modification of the semiconductor. The consequence was that several distinct transistors could be fabricated on the same piece of material, along with appropriate interconnecting wires. This discovery ushered in the era of integrated circuits, initially as interconnections of a few gates, called micrologics. The technology progressed rapidly by reducing the dimensions of the individual transistor and therefore increasing the number of transistors realizable within a fixed area of material. In those days () Moore’s law was formulated, the empirical observation that the number of transistors that could be placed on an individual integrated circuit would double approximately every two years. In a few years it was possible to bring to market extremely simple processors, entirely realized on a single plate of semiconductor (the chip) and acting on operands of very few bits ( or ). The progression, however, was extremely rapid: by the late s, the computer science and engineering community became fully aware of a mature technology of large-scale integrated circuits, and for the theoretical component of this community a new model of computation emerged:
V
the discipline of VLSI computation was therefore established within the scope of computer science research and became the theoretical underpinning of massive parallel computation. Although today parallel computation research pursues different directions, VLSI computation remains a rich theoretical acquisition.
A Model of VLSI Computation A paramount theoretical issue is the formulation of the computation model to be adopted. The computation model defines the framework within which performance is to be evaluated; it defines the nature of the usable resources and their respective costs. As is typical of all models, we expect the VLSI model to be reasonably simple to facilitate mathematical treatment, but sufficiently faithful to the physical reality to provide an adequate level of confidence to predictions based on the model. This goal calls here for a review, albeit extremely sketchy, of the technology. The fundamental element of a widely used, but by no means exclusive, VLSI technology is the Metal-OxideSemiconductor (MOS) transistor (other technologies can be treated analogously). Its geometry is illustrated in Fig. . In Fig. a a thin silicon wafer is selectively doped, by diffusion of appropriate atoms, into a p-type substrate and two n-type electrodes called source and drain, separated by a gap called channel. On top of this gap a third electrode, called gate, is created by depositing an (insulating) oxide layer and a conducting layer on top of it. A voltage applied to the gate creates a conducting channel between source and drain. The control exerted by the gate is the hallmark of the transistor. Revealing is the top view of the device (Fig. b). Here the transistor may be viewed simply as a (horizontal) strip comprising source and drain “crossed” by a (vertical) strip containing the gate. Similarly, a wire joining different electrodes appears as a strip of conducting material on the surface of the chip. The conclusion is that a VLSI circuit is described by the geometry of an interconnection of strips, representing devices and wires. We can now elucidate the abstraction process leading from technology to model. The model assumptions can be subdivided into three groups: . Network layout on the semiconductor chip.
V
V
VLSI Computation
Conductor Oxide
Gate
n
a
Channel p
n Substrate
Source
Gate
Drain
b VLSI Computation. Fig. Geometry of the MOS transistor
The physical layout of a VLSI circuit must be as compact as possible. However, manufacturing must guarantee the integrity of the circuit, that is, the layout must comply with design rules for transistors and wires (to avoid accidental interruption a wire cannot be too narrow, to avoid shorts two wires cannot be laid out too close to each other). Therefore, the design rules for the layout strips mainly concern widths and separations, and are expressed in terms of a technology-independent parameter λ called feature size, typically expressing the sum of the width of a wire and its separation from adjacent layout items (The decrease of λ is a measure of technological progress: in fact λ has decreased by nearly two orders of magnitude since the late s). Another important parameter is the number ν of usable layers, a parameter that over the years has grown appreciably over its original value (of just ). However, VLSI technology is essentially twodimensional. The fundamental reason, beside ease of manufacturing, is heat dissipation: as λ decreases, higher clock speed and shorter transistor switching times are attainable. This results in higher production of heat, which must be dissipated in a medium in contact with the chip surface. The area of an integrated circuit is obviously the sum of the areas of its transistors, its wires, and its input/output pads. We shall later see the dominant role of wire area in fast circuits. . Timing of signal transmission from device to device. The elementary action involved in timing considerations is the switching of a transistor and the propagation of its output signal to another transistor. In a simplified view, this action is the cascade of switching time and propagation time. The latter, due
to the finiteness of the speed of light, clearly grows with the length of the wire. However, in compact circuits switching time dominates propagation time, and this circumstantial evidence was used originally to axiomatize that the time of an elementary action is constant (independent of the wire layout length). . Input/output protocols. To be honest with respect to the use of area, each input is assumed to be read just once (denoted timedeterminate) and in just one pad (denoted placedeterminate), so that it remains stored within the chip. In addition, each input and output is uniquely identified by its time of use and by its place of occurrence. This provision excludes trivial algorithms, where one could claim sorting just by the order in which randomly permuted inputs are read out as outputs. We can therefore summarize the model discussed above as follows: ●
VLSI networks are synchronous boolean machines with the following features, presented here as the axioms of the model: – Axiom . Wires have minimum width λ, transistors have minimum area CT λ , I/O pads have minimum area CP λ (CT , CP are technological parameters). – Axiom . The time of an elementary action is constant (synchronicity). – Axiom . Inputs are read once and in one place. Input/outputs are time- and place-determinate.
For its simplicity, this VLSI model belongs to the best computer science tradition of machine models. Its salient features dispense with details and aim to
VLSI Computation
capture the essentials of the technology. As we shall discuss later, its soft spot is the timing assumption (Point above).
Area-Time Trade-offs The primary concern of the design of algorithms is to establish bounds on the resources used by a computational task. In the context of VLSI the crucial resources are computation time (number of time units) and circuit size (area). Here, as in any other area of algorithmic research, we try to find intrinsic relationships between resources (in the form of lower bounds), and then design algorithms that possibly achieve these bounds. The relation between time and area is based on the evaluation of the flow of information required for the correct execution of a computation. Specifically, we wish to identify sections (lines) in the chip through which a given flow of information is required by the computation. Two types of sections can be identified: . Input/output boundary . Internal sections The boundary trivially defines its required flow, as information proportional to the problem size, conventionally n. More subtle is the analysis of the internal flow. Assume, for simplicity, that the input bits are stored uniformly within the two-dimensional chip domain. A balanced bisection is a curve that divides the domain, and correspondingly the input set, into two parts of about the same size. Assume also that a computational argument shows that I bits must be exchanged through the bisection to correctly complete the computation. Then, if d is the length of the bisection, the number w of wires crossing the bisection satisfies w ≤ νd/λ; in clocked operation, at least I/w clock cycles (time units) are required. All we need is to relate length d to the chip area A. Simple arguments show that A ≥ d /, so that, denoting T the number of time units, we conclude Iλ T≥ √ ν A
or AT ≥ cI
for some technological constant c. This important relationship exhibits a trade-off between area and time: it gives formal substance to the intuitive expectation that, for a given problem, faster VLSI circuits have larger size.
V
As we shall see in the next section, several important problems are covered by this analysis. It is worth mentioning that, according to the preceding analysis, as we increase T the area A correspondingly decreases. There are problems, like the sorting of very long keys, where the area may become too small to store the data, that is, it becomes saturated. When saturated information is the mechanism limiting performance, optimal circuits have an area growing with /T rather than /T . Finally, the outlined bisection-width analysis applies to a class of computations (a large and significant one) where the internal information flow depends upon the entire input set. However, there are other significant problems that are not covered by this rubric. These are the computations where subsets of the output variables depend upon subsets of the input variables, so that, for any such pair of (input,output) subsets, no variable can be output before all variables of the corresponding input set have been stored in the network. Moreover, the boolean circuit computing the output variables has depth at least logarithmic in the size of ts input set. In other words, each input must remain stored for some time within the chip, that is, the flow of information is being slowed down. This behavior, dubbed computational friction, is characterized as the following area-time trade-off: AT ≥ c′′ I. log A
VLSI Architectures and Algorithms The preceding section outlines, in a very succinct form, the intrinsic limitations placed by the VLSI technology on computations realized by networks. A crucial feature of VLSI computation is the potential of being able to select, for any given application, the targeted computation time and to attempt to design a network realizing it. So, at one end of the spectrum we have sequential circuits, and the other the fastest parallelizations. The objective of this section is to consider classes of important computations, to describe their VLSI implementations, and to compare their performance with the lower-bounds developed in the previous section. Preliminarily, we must complete the definition of VLSI machine model with the notions of “architecture” and “algorithm.”
V
V
VLSI Computation
. Architecture ● An architecture is a graph G = (V, E), whose nodes V are modules and whose edges E are (bundles of) wires. ● The diameter of an architecture is the maximum of the lengths of the shortest path between any√two of its modules. (A large diameter is Θ( (A)), a small diameter is Θ(log A).) ● A module is the elementary component of an architecture. It could be as simple as a transistor or as complex as a microprocessor. The structure of the module is assumed independent of the size of the network. ● Operation is synchronous, that is, clocked. . Algorithm ● X = {x , . . . , xn } are variables and K = {k , . . . , km } are constants. ● A parallel step σ is a collection of assignments σ = {xi ← fj (X ∪ K)∣i = . . . n, fj ∈ F } where F is a set of functions for which hardware is duly provided in the network. The semantics is that all assignments in set σ are simultaneously executed. ● A parallel computation is a time sequence σ σ . . . σp ●
A parallel algorithm is a procedure executed as a parallel computation.
A most significant achievement of research in VLSI computation has been the identification of a few fundamental interconnections (networks) and of their natural association with classes of computations. Most interesting architectures results from the composition of such interconnections. Even more revealing is the identification of classes of computations matched to each of such architectures. Such classes are unified not by superficial similarity of applications, but by the communication patterns, or data exchanges, implied by their algorithms. Such fundamental interconnections are discussed next.
Meshes Meshes are the simplest architectures. Although higherdimensional constructions are conceivable, dimensions (array), (grids or meshes, for short), and (threedimensional grids) have practical value. The array is a chain of modules, where each module communicates,
typically, with its adjacent modules. The mesh is a network whose modules are conventionally placed at the vertices of a regular tiling (triangular, square, etc.) of the plane. The most natural mesh, and the one discussed hereafter, is the rectangular n × m planar grid, whose modules are placed at points of integer coordinates and each communicates with its four adjacent modules (see Fig. a). A typical use of meshes is as systolic networks. The qualifier “systolic” suggests that data is fed to the network in waves to be processed in pipeline fashion. A useful illustration is provided by matrix multiplication, that is, the computation C = AB (A, B, and C are square matrices of dimension n). The network is an n × n square mesh, whose simple modules (innerproduct modules, see Fig. b) keep stored one operand z (initialized to ), receive inputs x and y from their upstream neighbors, perform the update z ← z + x ⋅ y, and supply the updated z to their downstream neighbors. Inputs enter the network as waves from the left (A) and from the top (B): these waves, however, are not the conventional rows and columns, but cross-diagonals, that is, vectors whose components have identical sums of their two indices. The product appears stored in the mesh at the completion of the systolic flow. The network has O(n ) area and the computation is completed in n steps, so that AT = O(n ). Since the corresponding area-time lower-bound is Ω(n ), this systolic network provides an optimal implementation. Note that √ the diameter of this network is O( N), where N is the number of its modules. Systolic implementations have been proposed for a variety of other problems, such as matrix-vector multiplication, open convolution, integer multiplication, solution of lower-triangular linear systems, etc. The
a
b
VLSI Computation. Fig. (a) A two-dimensional mesh; (b) inner-product module
VLSI Computation a1
a2
b1
a3
a4
b2
V
a16
b3
b8
VLSI Computation. Fig. A circuit for prefix computation
VLSI Computation. Fig. The H-tree layout of a -leaf tree network
resulting algorithms are, in general, simple and elegant; however, none of these appears to be area-time optimal. We shall return to meshes as simulators of other more complex architectures.
Tree-Based Networks The basic interconnection is represented by a rooted full binary tree. The standard planar layout of a tree with n = j leaves for j even has the shape of a square of sidelength Lj and is denoted H-tree by its typical appearance (see Fig. ). Denoting L the side-length of a module, we have the simple recurrence Lj = Lj− + L , yielding Lj = L j/ , that is A = O(n); as is intuitive, a bound of the same order is achievable by deploying larger modules as we proceed from the leaves to the root. To be noted is that a tree network has small (logarithmic) diameter. An important application of tree networks is the implementation of semigroup prefix computations, frequently referred to as “scans.” The operations are, typically, AND, OR, MAX, MIN, ADD, etc., generically denoted here as “∗ ”. We are given a sequence a , a , . . . , an and wish to compute the prefixes a , a ∗ a , a ∗ a ∗ a , . . . , a ∗ a ∗ . . . ∗ an (n a power of , for simplicity). Sequentially, this is trivially doable in time O(n) with constant area. What is the performance of a fastest (logarithmic time) implementation? Figure supplies the answer: The recursive box is a prefix computer for the sequence b , b , . . . , bn/ where
bi = ai− ∗ ai . We recognize that the explicit part of the diagram consists of the leafward two levels of two binary trees traversed in opposite directions (in addition, the bottom one misses its leftmost edge), so that we conclude that prefixes can be computed by two back-toback tree networks. Such network can be realized with the previously described H-tree layout, and we reach the conclusion that the network area is O(n) and the number of steps is O(log n). Lower-bound considerations show that prefix computations are governed by AT = Ω(n) log A Since AT/ log A = O(n log n/ log n) = O(n), the described implementation is area-time optimal. (Note that the sequential implementation is also optimal.) Prefix computations include binary addition, comparison-exchange, sorting of long keys, etc.
Hypercubic Networks A hypercubic network is generated by an iterative construction technique called dimensional duplication, where the basic network is a single module H (indexed by the integer ) and Hj is obtained by duplicating Hj− and connecting pairs of homologous modules indexed (i, j− + i) for i = , . . . , j− − . Obviously, Hj is a j-dimensional hypercube, and each of its modules is connected to j other modules, that is, the module type depends on the parameter j. This feature violates an assumption of the model (modules independent of network size), so that the hypercube cannot be viewed as a VLSI architecture in the narrow sense.
V
V
VLSI Computation
Module j stores operand T[j] and is provided with arithmetic capabilities. Hypercubic networks support the recursive combination paradigm, executed by the call ALGORITHM(, m − ) of the following procedure:
ALGORITHM(l, l + − ) h
if (h > ) then ALGORITHM(l, l + h− − ), ALGORITHM(l + h− , l + h − ) foreach l ≤ j ≤ l + h− − pardo (T[j], T[j + h− ]) ← OPER(T[j], T[j + h− ])
the emulation of normal algorithms by means of a linear array. The key activity to be emulated is foreach ( ≤ j ≤ h− − ) pardo (T[j], T[j + h− ]) ← OPER(T[j], T[j + h− ]) The operands are stored in the array T[, h − ]. The above operation can be executed in the array if operands T[j] and T[j + h− ] are brought to reside in adjacent modules. This is achieved by the following algorithm executed on the array: for s = to h− − foreach ≤ i ≤ s U ← h− − − s
Specialization of the operation OPER yields algorithms for such diverse applications as merging, Fast-FourierTransform, convolution, permutations, integer multiplication, etc. The paradigm handles one hypercube dimension at a time (parameter h) in a fixed order, and is also referred to as “normal” and “ASCEND-DESCEND.” Due to its nonstandard modules the hypercube does not lend itself to scalable VLSI layouts, because for very large size n the outdegree of the modules may be unrealizable. However, there exist effective emulators of the hypercube which have standard VLSI layouts. Significantly, since the computations considered obey the bound AT = Ω(n ), these emulators are found to be area-time optimal. Two of these emulators, the linear array and the mesh, have large diameters (respectively, √ O(n) and O( n)); two more emulators, the shuffleexchange and the cube-connected cycles, have O(log n) diameter. The diameter, of course, determines the computation time.
Composite Architectures The previous basic architectures have different diameters and are therefore targeted to different computation times. Interestingly enough, these architectures can be combined in a variety of ways, both to solve different classes of problems and to achieve a full range of computation times. The typical composite structure is a network of modules that are themselves complex VLSI architectures. To illustrate this philosophy we consider an areatime optimal emulator of the hypercube. We begin with
pardo (T[U + i], T[U + i + ]) ← EXCHANGE(T[U + i], T[U + i + ]) The overall time required by these rearrangements in processing the h dimensions is clearly O(n), so the array per se is not area-time optimal. However, suppose we partition the n operands into n/k sets of size k and assign each set to a distinct linear array to process the lowest log k dimensions as outlined above. In order to process the remaining log n − log k dimensions, these arrays are closed by an end-around edge, and are interconnected as a hypercube (where in each array different dimensions are handled by different modules). The resulting modules have outdegree regardless of n. The resulting composite architecture, called the cube-connected cycles, is an emulator of the hypercube. It runs in time T = O(k) and can be laid out in area O(n /k ), so that it is area-time optimal. Since k can be chosen between √ Ω(log n) and O( n), this is an explicit example of area-time trade-off.
A Critical Comment Research on VLSI computation was prompted by a technological revolution and developed in the s under the shadow of a funding policy that promoted massive parallel computing. A sufficiently simple model became established, unleashing a volume of research that elucidated important relationships between computations and parallel architectures. As mentioned before, the Achilles’ heel of the VLSI model is its second axiom concerning timing. It was
VNG
soon realized that with reference to the scalability of VLSI architectures, the finiteness of the speed of light and the inability to indefinitely reduce the size of the devices invalidate the timing axiom. It happens that small-diameter networks (notably trees and hypercubes) have maximum wire length growing nearly as the square-root of the network area. In a technology where the physical limits are approached (limiting technology) only meshes are truly scalable architectures. However, it must be mentioned that as this entry is written (), the approaching physical limits have revived interest in parallel computations, although along directions different from massive parallel computing.
Bibliographic Notes and Further Reading The origins of the theory of VLSI computation can be roughly traced back to the appearance in the late s of a book by Mead and Conway [], a somewhat simplified presentation of the technology, which enormously stimulated the algorithmic community. At about the same time, C. Thompson [] presented the VLSI computation model and the area-time approach, and shortly thereafter J.D. Ullman [] framed the emerging research area in the language of the theoretical community. The literature on VLSI computation is enormous. A comprehensive reference is the authoritative text by F. T. Leighton [], which presents a scholarly compendium of the computational/architectural aspect of VLSI at a time () when research interest was
V
beginning to wane. The interested reader is encouraged to consult this source. Suffice it to mention here: systolic networks were introduced by H.T. Kung and C.E. Leiserson [], the shuffle-exchange network was discussed in [], and the cube-connected-cycles in []. The notion of computational friction was proposed in []. The implications of the physical limitations are analyzed in [].
Bibliography . Mead C, Conway L () Introduction to VLSI systems. AddisonWesley, Reading, MA . Thompson C () A complexity theory for VLSI. Ph.D thesis, C.S. Department, Carnegie-Mellon University . Ullman JD () Computational aspects of VLSI. Computer Science Press, Rockville, MD . Leighton FT () Introduction to parallel algorithms and architectures: arrays-trees-hypercubes, Morgan Kaufmann Publishers, San Mateo, CA . Kung HT, Leiserson CE () Systolic arrays for VLSI. Sparse Matrix Proc. –, SIAM, pp – . Preparata FP, Vuillemin J () The cube-connected-cycles: a versatile network for parallel computation. Commun ACM ():– . Bilardi G, Preparata FP () Area-time lower-bound techniques with applications to sorting. Algorithmica ():– . Bilardi G, Preparata FP () Horizons of parallel computation. J Parallel Distrib Comput :–
VNG Vampir
V
W Warp and iWarp
Discussion Introduction
James R. Reinders Intel Corporation, Hillsboro, OR, USA
Definition The Warp project was a series of increasingly generalpurpose programmable systolic array systems and related software. The project was created by Carnegie Mellon University (CMU) and developed in conjunction with industrial partners G.E., Honeywell, and Intel with funding from the U.S. Defense Advanced Research Projects Agency (DARPA). There were three distinct machine designs known as the WW-Warp, PC-Warp, and iWarp. Each successive design was made increasingly general-purpose by increasing memory capacity and relaxing the tight synchronization between processors. A two-cell prototype of WW-Warp was completed in June , the first WW-Warp in February , the first PC-Warp in April , and the first iWarp system in March . iWarp systems were produced and sold by Intel in and to universities, government agencies, and industrial research laboratories. Signal and image processing were the target applications for the Warp program, a natural fit for systolic arrays. The Warp program produced pioneering work on software pipelining support in compilers to utilize the LIW capabilities of the cells (processors). The name Warp referred to warp factors used by the Star Trek TV series to induce very high velocities in anticipation of the new Star Trek episodes, known as “Star Trek: Next Generation,” which appeared in the fall of the year that PC-Warp became operational. iWarp added “i” for integrated for the VLSI version and was adopted before Intel was selected as recipient of the contract to design and manufacture it. Intel, at the time, used “i” as a prefix on many product names.
Systolic Arrays, first described in by H. T. Kung and Charles E. Leiserson, were originally conceived as a systematic approach to utilize rapid advances in electronics, known as Very-Large-Scale-Integrated (VLSI) technology, and coping with inherent difficulties present in designing VLSI systems. The systolic array concept proved to be successful at taking advantage of this evolution and led also to advancements in software models for utilizing hardware parallelism. At the same time, instruction-level parallelism was attracting considerable interest for utilizing the rapidly expanding capabilities of VLSI. The use of explicitly scheduled instructions in the form of long instruction word (LIW) design was of interest to H. T. Kung and his team. In , H. T. Kung, at Carnegie Mellon University, started the Warp program with a desire to investigate the many hardware, software, and application aspects of systolic array processing. Over the course of a following decade, the program yielded research results, publications, and advancements in general-purpose systolic hardware design, compiler design, and systolic software algorithms. At first, Warp was conceived as a limited prototype to be used by the vision group at CMU for low-level vision operations that were simple and highly regular (D or D convolutions, edge detection operators, etc.). As work progressed, opportunities for more complex uses for Warp machines were proposed. As the Warp design came together, the Strategic Computing Initiative (SCI) was looking for highperformance computing platforms to tackle specific problems, one of which was the development of an Autonomous Land Vehicle (ALV). It was decided to place the CMU prototype Warp machine (two cells) onboard the CMU ALV. The ALV program would need more than one Warp, and H. T. Kung decided to pursue
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
W
Warp and iWarp
funding from the SCI program. Through the support of DARPA, and working with industrial partners, a series of three Warp machines were produced over the span of a decade. Since using VLSI was the inspiration for the systolic array concept, it was natural for H. T. Kung to guide the Warp project from an idea, to a wire-wrapped prototype, to a printed circuit card implementation, and finally to a VLSI implementation. In , while the first wire-wrapped prototype of a -cell machine was being delivered and the printed circuit version was being designed, Intel was selected to be the industrial partner for the integrated circuit implementation of Warp. Consistent with the desire to harness leaps in VLSI and heavily influenced by a deep interest in signal and image processing, the use of LIW in the hardware and software was a key area of investigation as well. The Warp program focused specifically on developing a compiler scheduling technique of software pipelining as a key technique to utilize LIW. This contrasted with work by Joseph (Josh) Fisher at Yale University and Multiflow Computer Inc., which focused on developing a compiler scheduling technique of trace scheduling to utilize LIW. Neither effort resulted in hardware that benefited significantly from employing both compiler scheduling techniques together. Later, the Itanium project at HP and Intel included engineers formerly of both the Multiflow and the Warp projects. The Itanium design included hardware designed to be able to utilize both trace scheduling and software pipelining. Complementing the development of the machines, software research and development was a big part of the Warp program. Advances in compiler techniques for software pipelining, as well as advances in systolic programming and programming models, were made as part of the Warp program. All Warp and iWarp machines were Systolic Arrays, so they consisted of a set of interconnected cells, each capable of performing at least simple computational operations and communicating only with cells in close proximity. Warp machines were simple linear configurations, while iWarp machines were organized as -D arrays with the edges wrapping around to form a torus. In both cases, programming could generally be described in terms of streaming data through the array for processing as it passed through cell. In the simplest example, every cell could be programmed to read a data element, add one to it, and send the result along
to the next cell. Such a program on any of the Warp systems would be written roughly as sendf(receivef()+ .) and compile to a single instruction that executed in a single cycle, thereby highlighting the tight coupling of computation and communication. A stream of data sent through a machine thus programmed would emerge with every element of the stream incremented by N, where N was the number of cells in the machine. Many systolic algorithms have been developed to take advantage of the parallelism of such a system. The Warp program focused on practical realizations of systolic array concepts in order to facilitate advanced software research on real-world applications of the machine. This started when the first hand-built two-cell Warp machine, which was designed to solve vision problems, was quickly pressed into service in a real-world challenge in the CMU ALV. This early focus on vision and signal processing, coupled with a desire to solve real problems for researchers outside computer architecture research, lead to four defining characteristics for the Warp program. . Systolic communications: Efficient fine-grained communications. Internode communication is the most fundamental and key issue in the design of a parallel computer because the communication system directly impacts the mapping of an application onto the computer. The exploration of “balanced algorithms” at CMU, led by H. T. Kung starting in the mids, looked to find algorithms that scale without bounds. Communication with little to no overhead is needed to efficiently deal with the fine-grained parallelism that such algorithms tend to utilize as machine sizes grow. Systolic arrays accomplish this goal by providing a method to directly couple computation to communication. This was true in Warp and iWarp machines. This principle is an important consideration in “balanced” system design to allow high-performance systems to have applicability on a broader set of algorithms. . Systolic design advantages coupled with general programmability. The systolic array approach initially led to very rigidly synchronous hardware designs that were elegant yet proved overly constraining for many programming solutions. Each successive generation of
Warp and iWarp
the Warp program offered increased programmability by relaxing prior rigid synchronization design constraints and increasing cell autonomy by changes including notable increases in cell memory capacity. . Long instruction word (LIW) designs with optimizing compilers. The cells of all Warp machines utilized LIW (or VLIW, Very Long Instruction Word) and invested in compiler technology to utilize the LIW capabilities, including extensive research and publishing focused on developing software pipelining techniques. . Early prototypes for software developers. While inspired by VLSI opportunities, the Warp project focused on developing early prototype systems using a low-risk approach of combining MSI and LSI parts in a relatively conservative design instead of an immediate leap to VLSI. This led to functional hardware that provided a reliable platform for rich software investigations early in the program. Learnings from the prototypes were then implemented into a VLSI version that in turn could draw on the benefits of several years of software development on the early prototypes.
The Warp Machines Warp and iWarp machines were a series of increasingly general-purpose systolic array processors, created by Carnegie Mellon University (CMU), in conjunction with industrial partners G.E. (Syracuse, New York), Honeywell (Minneapolis, Minnesota) and Intel (Hillsboro, Oregon), and funded by the U.S. Defense Advanced Research Projects Agency (DARPA). There were three distinct machine designs known as the WW-Warp (Wire Wrap Warp), PC-Warp (Printed Circuit Warp), and iWarp (integrated circuit Warp, conveniently also a play on the “i” for Intel). Each successive generation became increasingly general-purpose by increasing memory capacity and loosening the coupling between processors. Only the original WW-Warp forced a truly lock step sequencing of stages, which severely restricted its programmability but was in a sense the purest “systolic array” design. Warp machines were attached to Sun workstations (UNIX based). Software development work for all models of Warp machines were done on Sun workstations (Figs. and ).
W
The WW-Warp and PC-Warp machines were systolic array computers with a linear array generally of ten or more cells, each of which is a programmable processor capable of performing ten million single-precision floating-point operations per second ( MFLOPS). A ten-cell machine had a peak performance of MFLOPS. The iWarp machines doubled this performance, delivering MFLOPS of single precision and supporting a double-precision floating point at half that performance. A two-cell prototype of WW-Warp was completed at Carnegie Mellon in June . Two essentially identical WW-Warp machines, each with ten identical boards (nodes) plus an address generator I/O board, were produced in . The system from G.E. was delivered in February ; the system from Honeywell was delivered in June . G.E. became the sole industrial partner for the PC-Warp machine construction. Significant redesign work was done to generalize the systolic array before production of the PC-Warp. The first PC-Warp was delivered by G.E. in April . About twenty production machines of the PC-Warp were produced and sold by G.E. during –. The PC-Warp machine sold for $, without the Sun workstation. Each cell (node board) in a PC-Warp consumed about W of power. In , Intel was selected, as a result of competitive bidding, to be the industrial partner for the integrated circuit implementation of Warp. The first iWarp system, a twelve-node system, became operational in March . Additional steppings of the chip resulted in the final C-Step part that improved the clock rate, beyond the MHz project target, to MHz and worked out most errata. About machines, with more than , processors combined, were produced and sold by Intel in and to universities, government agencies, and industrial research laboratories. The largest system assembled was a -processor machine, and one -processor machine was made for CMU. Typical production systems were -processor systems. The high-speed static memory and the high-performance low-latency communication system proved to be a well-suited target for research efforts and many “proof-of-concept” applications. The iWarp machines were based on a full-custom VLSI component integrating a , transistor LIW microprocessor, a network interface and a switching
W
W
Warp and iWarp
Sun workstation (file storage, network access)
VME
Host extension
Memory
Processor
Processor
Memory
Switch
Switch
Memory
Processor
bus
X
X
Node 0
addr
X
X
Y addr
Y
Y Node 1
addr
X
Node 2
addr
X
Y Node 3
addr
X
Y Node 4
addr
Y Node 5
addr
X
X Y Node 6
addr
X
Y Node 7
addr
Y Node 8
addr
Warp systolic array
Warp and iWarp. Fig. PC-Warp system architecture
X
Y
to X (node+1)
X queue 512 32
to Y (node–1)
Y queue 512 32 AReg 31 32 Memory 32K 32
FADD
Data crossbar
Mem 2K 32 MReg 31 32
FMUL
Literal
addr
Address queue 512 32
Address crossbar
to Addr (node+1) PC_Warp cell
Warp and iWarp. Fig. PC-Warp cell architecture
Address gen
Node 9
Warp and iWarp
Cell 2,0
Cell 3,0
Cell 0,1
Cell 1,1
Cell 2,1
Cell 3,1
Cell 0,2
Cell 1,2
Cell 2,2
Cell 3,2
Cell 0,3
Cell 1,3
Cell 2,3
Cell 3,3
cable
Cell 1,0
cable
Cell 0,0
W
I/O Cell
I/O Connection
Warp and iWarp. Fig. iWarp system architecture
node, into one single chip of . cm × . cm silicon. The processor dissipated up to W and was packaged in a ceramic pin grid array with pins. Intel marketed the iWarp with the tagline “Building Blocks for GigaFLOPs.” The iWarp processor was designed specifically for the Warp project, utilized LIW format instructions, and featured tightly integrated communications with the computational processor. The standard iWarp machines configuration arranged iWarp nodes in a m × n torus. All iWarp machines included the “back edges” and, therefore, were tori (Figs. and ). Programmability increased in each successive generation within the Warp program. In the WW-Warp design, all addresses for data accesses were created on the interface cell, which preceded the first computational cell, and passed along as data on the communication channels. Each WW-Warp cell was technically an independent SIMD machine with local instructions, but the whole design had an underlying assumption that cells would be identically programmed in a very rigid Single-Instruction-Multiple-Data (SIMD) fashion. The lack of ability to locally create data access addresses,
Up addr
Memory
data Left
iWarp processor
Right
Down
Warp and iWarp. Fig. iWarp cell architecture
in particular, helped ensure such rigid programming. The SIMD program preceded as a wave through the machine in which cells were skewed in their execution, as opposed to operating in a global lock step, as data and addresses streamed through the machine. The tight coupling of the cells meant that if a cell sent data into a full input queue, the data was lost. This tight coupling made it impossible for the compiler to support a pipeline programming mode in practice.
W
W
Warp and iWarp
The very rigid clocking of the WW-Warp was relaxed in PC-Warp by the increasing intracell buffering, introducing intracell flow control, as well as the addition of address generation units on individual cells for local address generation capabilities. With added hardware flow control and a local controller, cells were no longer tightly coupled and could operate independently. Direct coupling of neighboring cells on PC-Warp still had no allowance for data to bypass the computational unit of a cell. Only a programmed operation could move data through a PC-Warp cell because of the direct coupling of communication to the functional units in the cell. iWarp introduced a communication agent on the cell to allow for data to pass through a cell without the need to interact with the computational units. The design of iWarp was oriented toward preserving the benefits of systolic communication while generalizing the framework to allow for more application diversity. iWarp maintained the systolic feature of connections for low overhead communication, while providing for a broad range of applications by making those connections programmable. In iWarp, physical communication channels were augmented with virtual communications channels that allowed logical communication channels to be established over the same physical hardware to add great flexibility in the programming models. The iWarp processor was a -bit ReducedInstruction-Set-Computing (RISC) design with a -bit LIW instruction that, unlike the Intel i processor, had no performance penalty for transitioning between long and short instructions in an instruction stream. The final, C-stepping, of the processor was clocked at MHz. The processor had MIPS integer performance, supported full IEEE floating-point compliance for - and -bit operations with performance of MFLOPs and ten MFLOPs, respectively. Processors were grouped four on a double-height Eurocard PC board ( cm × cm). Sixteen of these quad-processor boards fit in a -inch rack along with a clock board, and four racks could be placed in a single modified iPSC cabinet (. cm × . cm × cm deep). Up to four cabinets could be connected together. WW-Warp and PC-Warp utilized their long instructions to route communication inputs directly to arithmetic units, and outputs of arithmetic units directly to
the communication outputs for a node. In iWarp, some of the register file was mapped to communication so that a communications input from a neighbor was represented as a register read, and a register write sent data to a neighbor. In all these machines, the communications were matched precisely to the computational capabilities of the machine. The interconnect capabilities of an iWarp processor were through four MB/s full duplex links. The hardware supported up to logical (virtual) channels that were freely configurable across all four directions and used two pools. Communications were clocked at MHz. Each successive generation of the Warp program also increased the available memory on each cell. The original WW-Warp design had a very limited K program memory but also had little use for large memory modules because the rigidly synchronous design did not encourage use of local storage nor large programs. The relaxation of the rigid connections led to algorithms that benefitted from more local storage so the PC-Warp design had X the program store and more memory on each cell. Further relaxing in turn demanded more memory, and the iWarp design initially offered high performance with very limited static memory modules. The creation of a daughter card supporting Mb of dynamic memory greatly increased available memory, for both the application and the resident operating code, and proved very popular. Memory access latency was ns static and ns dynamic. Memory access bandwidth was MB/s. The iWarp processor included eight programmable direct memory access (DMA) agents for data spooling to communication channels. Systems built from iWarp processors could range from four iWarp processors to , processors in a n × m torus. The typical system was cells in an × torus for a total . GFlop/s peak performance. Measured sustained performance of cells for dense matrix multiply was MFlop/s, sparse matrix multiply was MFlop/s, and FFT was MFlop/s (Photos –).
Compilers Multiple compilers associated with the Warp program explored exploitation of both cell programming and array level programming.
Warp and iWarp
W
Warp and iWarp. Photo Sixty-four iWarp cells mounted in a -inch rack (. GFLOPs)
Warp and iWarp. Photo The iWarp processor, die exposed
Warp and iWarp. Photo The iWarp quad-cell board (Four processor, memory, communication)
The cells of all Warp machines utilized LIW (or VLIW, very long instruction word). The WW-Warp and PC-Warp used VLIW to directly send commands to every unit on the board in every cycle, using the VLIW -bit wide instructions. The iWarp processor utilized a -bit LIW instruction with several possible encoding formats to control integer and floating-point operations in parallel. For instruction compaction, iWarp allowed
-bit RISC instructions to be freely mixed with the -big LIW instructions. The Warp program invested in compiler technology to utilize the LIW capabilities, including extensive research and publishing on software pipelining techniques. A research compiler, for a language known as “W,” targeted all three machines and was the only compiler for the WW-Warp and PC-Warp, while it served as an early compiler during development of the iWarp. The production compiler for iWarp was a C and Fortran compiler based on the AT&T pcc compiler for UNIX with extensive modifications and extensions for systolic programming and exploitation of LIW via software pipelining. The W compiler was written in Common LISP and ran on Sun Workstations. The W language was a restrictive procedural language similar in syntax to Pascal or C, and designed to specify a kernel of computation to be spread across the computation array as well as specify the data input and output sufficiently for the compiler to generate the code to feed the input stream and to recover the output stream. The initial focus was on producing code for both the interface cells and the computational cells of the WWWarp machine. The PC-Warp allowed more general programming by providing for local address generation that permitted a more general programming of each cell so the W language and compiler evolved to support
W
W
Warp and iWarp
Warp and iWarp. Photo The iWarp system cabinet containing four racks ( cells)
more general programming. Once operational with creating the proper control code for the machine, the W compiler focus took on software pipelining to fully utilize the combination of tightly coupled communication capabilities with the LIW instruction design. Monica Lam’s Ph.D. work “A Systolic Array Optimizing Compiler” covers the results in detail, including methods to software pipeline loops to any depth of nesting. The W compiler for PC-Warp was taken on by G.E. in ,
and CMU focused work on a port of the W compiler to iWarp. It was initially believed that programs for the W compiler would typically be “half a page” in length. This was consistent with systolic arrays up to that point. The generalization of PC-Warp saw programs X the originally anticipated norm, and automatic program generation systems stretched this even further. Developers of applications for the PC-Warp often requested a more robust compiler for a full and standard language. The production compiler for iWarp was a C and Fortran compiler based on the AT&T pcc compiler for UNIX. HCR Corporation of Toronto, under contract to Intel, provided the initial port of pcc for iWarp along with HCR’s own global optimizer technology for pcc. The compilers went on to be extensively modified and extended by Intel and CMU, and were mostly used by developers directly. However, they were also targeted by special frontends, including cfront (for C++), oxygen compiler (for a variant of Fortran), and the Fx compiler (for a variant of HPF). Several research compilers also produced code for iWarp, including the retargeted W compiler, a code generator for cmcc (a C compiler for research into optimizations at CMU), and a backend for the Stanford SUIF compiler (under the guidance of Prof. Monica Lam). With the advent of High Performance Fortran (HPF) research, CMU’s Fx project was born to explore HPF style investigations on the iWarp platform. Significant energy at CMU in compiler research went into the Fx compiler. Domain-specific or “application” compilers were an area of interest at CMU. Parallel program generators for iWarp were directed toward specific application domains and included the Apply project (for image processing) and the AL project (for scientific computing).
Applications One Warp system was installed in the NavLab, a GM van converted by researchers in the Robotics Institute at Carnegie Mellon University. The NavLab was a laboratory for research in autonomous road following, and a Warp machine (as well as a number of workstations) was installed in the van. The Warp machine turned out to be a good engine for neural net learning. It also performed a variety of other vision and planning
Whole Program Analysis
tasks. A Warp machine remained in the NavLab until , when the van caught fire after the air conditioning system leaked liquid onto the computers. Several NSF grand challenge problems were investigated. An iWarp machine had successful sea trials on a U.S. Navy submarine.
Warp’s Influence on the Industry The W compiler pioneered software pipelining for LIW. The Warp compiler projects provided extensive investigations into software pipelining, including nested software pipelining, and published solutions. These have become an important and popular compiler optimization for most modern computers. After iWarp, Intel continued work on interconnect chips derived from the iWarp design. The world’s first TeraFLOP computer, ASCI Red built for Sandia National Labs by Intel, utilized a variant of the logical channels developed first for iWarp. iWarp was the first Intel project that utilized workstations for simulations and mask generation, and worked to move Intel’s design tools into this environment for the first time. Future generations of Intel chip designs moved away from mainframes following the lead of the iWarp project.
Related Entries Systolic Arrays VLIW Processors
W
Bibliography . Annaratone M, Arnould E, Gross T, Kung HT, Lam M, Menzilcioglu O, Webb J () The warp computer: architecture, implementation and performance. IEEE Trans Comput C-(): – . Gross T, Lam M () Retrospective: a retrospective on the warp machines. In: ISCA ‘: years of the international symposia on computer architecture (selected papers). IEEE, Los Alamitos, pp – . Lam MS () A systolic array optimizing compiler. Kluwer Academic, Dordrecht . Borkar S, Cohn R, Cox G, Gleason S, Gross T () iWarp: an integrated solution of high-speed parallel computing. In: Proceedings of the ACM/IEEE conference on supercomputing, – Nov , Orlando, Florida, pp – . Adl-Tabatabai A-R, Gross T, Lueh G-Y, Reinders J () Modeling Instruction-Level Parallelism for Software Pipelining. In: Proceedings of the IFIP WG. working conference on architectures and compilation techniques for fine and medium grain parallelism, Orlando, pp – . Borkar S, Cohn R, Cox G, Gross T, Kung HT, Lam M, Levine M, Moore B, Moore W, Peterson C, Susman J, Sutton J, Urbanski J, Webb J () Supporting systolic and memory communication in iWarp. In: Proceedings of the th annual international symposium on computer architecture, Seattle, Washington, pp – . Intel Corp () iWarp microprocessor (Part Number ). Technical information, Order Number , Hillsboro, OR . Gross T, O’Hallaron DR () iWarp: anatomy of a parallel computing system. MIT, Cambridge, MA, p
Wavefront Arrays Systolic Arrays
Bibliographic Notes and Further Reading The first papers mentioning Warp that appeared in focused on machine vision and not the machine. Papers specifically about the Warp machines appeared in as the first machines were delivered with a pretty complete overview in []. A retrospective on Warp provides insights into the key lessons of Warp []. Monica Lam’s May Ph.D. dissertation focused largely on the pioneering work on software pipelining and was later published as a book []. A detailed book on iWarp and the experiences of the project from the perspective of two CMU researchers was published in [].
Weak Scaling Gustafson’s Law
W Whole Program Analysis FORGE
W
Work-Depth Model
Work-Depth Model Models of Computation, Theoretical
Workflow Scheduling Task Graph Scheduling
Z Z-Level Programming Language ZPL
ZPL Bradford Chamberlain Cray Inc., Seattle, WA, USA
Synonyms Z-level programming language
Definition ZPL is a parallel programming language that was developed at the University of Washington between and . ZPL was a contemporary of High Performance Fortran (HPF), targeting a similar class of applications by supporting data parallel computations via operations on global-view arrays distributed between a set of distinct processor memories. ZPL distinguished itself from HPF by providing a less ambiguous execution model, support for first-class index sets, and a syntactic performance model that supported the ability to trivially identify and reason about communication.
Discussion Foundations ZPL was initially developed by Lawrence Snyder and Calvin Lin at the University of Washington who strove to design a parallel programming language from first principles. To this end, the ZPL effort began by identifying abstract machine and programming models that would serve as the parallel equivalents of the von Neumann machine model and the imperativeprocedural programming model used in traditional sequential computing. For a machine model, ZPL built upon Snyder’s CTA or Candidate Type Architecture, defined as “a finite set
of sequential computers connected in a fixed, bounded degree graph, with a global controller.” []. In contrast to theoretical models like the PRAM, the CTA was designed to serve as a more practical machine model for realistic parallel architectures. At the same time, the CTA was intentionally vague about the parameterization of the target architecture’s capabilities due to the high degree of variation from one parallel system to another, combined with the fact that the complexity in such models often failed to add value to an end-user. The main point of the CTA was that locality matters: remote accesses are more expensive than local accesses and therefore need to be avoided whenever possible. In terms of its programming model, ZPL built on Lin et al.’s Phase Abstractions programming model [] which defined three levels of specification for parallel programming, named the X, Y, and Z levels. The phase abstractions define the X level as representing MultipleInstruction Multiple-Data (MIMD) programming in which the programmer is writing code from the point of view of a single sequential processor. The Y level deals with logical topology and specifies the virtual communication channels across which the processors communicate and interact with one another. Finally the Z level describes the high-level data parallel computations in which all of the processors execute similar instructions to cooperatively implement a single logical operation. It is this final level that gave ZPL its full – though in practice, rarely used – name: the Z-level Programming Language. ZPL was originally designed to be the data parallel component of a broader language named Orca C that spanned the X, Y, and Z levels defined by the phase abstractions. However, as ZPL matured, rather than making it part of a larger sublanguage, it was extended to support increasingly rich modes of parallel computation, ultimately including “X level” per-processor computation in its final years []. Theoretical foundations aside, from the programmer’s view, traditional ZPL computations provided a conceptual single-threaded model of computation in
David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----, © Springer Science+Business Media, LLC
Z
ZPL
which each statement could be thought of as executing to completion before the next began. In reality, the data parallel statements resulted in concurrent execution across all processors in a loosely coupled manner, but the typical programmer could be completely unaware of such details apart from the performance benefits that resulted. In terms of its implementation, the ZPL compiler generated a single-threaded Single-Program MultipleData (SPMD) executable that implemented the computation and communication required to realize the user’s data parallel program. The ZPL compiler took a sourceto-source compilation approach for the purposes of portability, translating the user’s source down to C code that would then be compiled using a platformspecific C compiler. Communication was implemented in terms of a novel API called the Ironman interface [] that separated the semantics of what data needed to be communicated, and when, from architecture-specific capabilities like shared memory, two-sided message passing, one-sided puts and gets, or even distinct communication threads potentially running on coprocessors. The result was a highly portable compiler that could often match the performance of a hand-coded MPI program and even outperform it on architectures that preferred a different style of communication than two-sided message passing [, , ].
Primary Concepts: Regions, Arrays, and Directions Perhaps the key feature of ZPL was its choice to support the notion of a distributed index set using a first-class language concept known as the region. Index sets are important for data parallel programming due to their role in defining data structures and iteration spaces. The distribution of these index sets between distinct processors plays a crucial role in dividing parallel work between compute resources while also establishing clear realms of locality. Most traditional languages have failed to provide abstractions for index sets, requiring the programmer to express them using collections of scalar values for each processor and array dimension. In contrast, ZPL’s regions support distributed index sets as a high-level language concept. From ZPL’s early stages, regions could be named and initialized. Over time they were extended to support assignment and to serve as function arguments while also being made more com-
patible with other types – for instance by supporting the ability to create arrays of regions or region fields within data structures. As a sample use of regions, the following code declares an m × n region and then uses it to declare an array of floating point values: r e g i o n R = [1.. m , 1.. n ]; var A : [ R ] f l o a t ;
In addition to declaring arrays, regions also serve the purpose of scoping statements or groups of statements to provide the indices over which array expressions should be evaluated. For example, the following assignments to A are prefixed by region scopes that control which elements of arrays A and Index are referenced. The initial region scope causes the first assignment to zero out the whole array A since it was declared over region R; the next restricts the second assignment to A’s jth row, causing those elements to take on the value of their column index, represented by the built-in virtual array Index: [R] A := 0; [ j , 1.. n ] A := I n d e x 2 ;
Another early ZPL concept was the direction, which was essentially a named offset vector. Like regions, directions were initially limited to named constants, but like regions they became increasingly general and first-class as the language progressed. Region bounds were often defined in terms of configuration variables – variables which are given an initial value that can be overridden on the executable’s command-line via compiler-generated argument parsing hooks. In addition to explicitly declaring regions in terms of their upper and lower bounds as in the example above, programmers could also create regions by applying a series of prepositional operators to existing regions. These operators typically combined the base region with a direction to create a new region. For example, the of operator creates regions that describe boundary conditions by using the offset vector to define a region that borders the base region. In contrast, the by operator uses its direction to compute a strided subset of its base region. Using regions, arrays, and directions, a programmer could fairly easily write simple data parallel computations such as the following Jacobi iteration which iteratively replaces an array’s elements with the average of its
ZPL
program jacobi;
c o n f i g var n : integer = 1000; epsilon : float = 0.0001;
-- p r o b l e m size -- c o n v e r g e n c e f a c t o r
region R = [1.. n , 1.. n ]; BigR = [0.. n +1 , 0.. n +1];
-- i n d e x set for p r o b l e m -- same , but with b o r d e r s
direction north = east = south = west =
p r o c e d u r e j a c o b i (); var A , Temp : [ BigR ] f l o a t ; delta : float; [R] begin
R] R] R] R]
A A A A
:= := := :=
-- c a r d i n a l d i r e c t i o n s
[ -1 , 0]; [ 0, 1]; [ 1, 0]; [ 0 , -1];
A := 0.0; [ n o r t h of [ east of [ west of [ s o u t h of
Z
0.0; 0.0; 0.0; 1.0;
-- c o m p u t e a r r a y s -- d e f a u l t r e g i o n s c o p e -- i n i t i a l i z a t i o n of a r r a y i n t e r i o r -- and b o u n d a r y c o n d i t i o n s
repeat -- main loop Temp := ( A @ n o r t h + A @ e a s t + A @ s o u t h + A @ w e s t ) / 4.0; d e l t a := max < < abs ( A - Temp ); -- c o m p u t e c o n v e r g e n c e A := Temp ; until delta < epsilon ; w r i t e l n ( A ); end ;
nearest neighbors. This continues until the largest difference between elements across any two iterations is less than a convergence value, epsilon:
Array Operators and the WYSIWYG Performance Model In order to support more interesting computations than purely element-wise operations, ZPL supports a number of array operators that alter the way an array reference’s enclosing region scope provides its indices. For example, the @-operator, seen in line of the Jacobi example above translates the region’s index set by the given offset before applying it to the array. Thus, A@north refers to the index set [0..n-1, 1..n]
rather than the default [1..n, 1..n] as specified by the enclosing region scope [R] on line . Similarly, the reduction operator (<<) used in line collapses values specified by that same region scope [R] by generating the maximum value computed by the promotion of the scalar abs() function and subtraction operator across array elements. Other primary ZPL operators include the flood operator (>>) which replicates subarrays of values, the scan operator (||) which computes parallel prefix operations, and the remap operator (#) which supports bulk random access to an array’s elements using whole-array indexing expressions. In addition, ZPL supports wrap and reflect operators that support common boundary
Z
Z
ZPL
conditions on regular grids and a variation on the @-operator to support loop-carried array references []. ZPL’s semantics require that any two array references within a single statement must be distributed identically, ensuring that like-indexed elements are colocated within a single processor’s memory (the remap operator is the one exception to this rule for reasons that we will return to in a few paragraphs). The impact of this rule is that all communication within a ZPL program can be identified and classified by the presence of its array operators. In particular, if a statement contains no array operators, since all of its array expressions are described by the same region scope and aligned, it will be implemented without communication, resulting in an embarrassingly parallel execution (or perhaps “pleasingly parallel”). In contrast, the application of an array operator implies that the enclosing region scope’s index set is transformed in some way for that array reference, requiring communication to bring the array’s elements into alignment with the enclosing region scope’s index set. For example, since the @-operator implies a translation of the region scope’s indices, a translation of array elements in the opposite direction is required to align the indices with those described by the enclosing region scope. In addition to signaling the need for communication, each array operator also implies a particular style of communication for its array’s elements. For example, given ZPL’s default block-distributed regions, the @-operator implies nearest neighbor point-to-point communication, while the flood and reduction operators imply broadcasts or reductions of array values across the dimensions of the virtual processor grid for which the operation is being computed. Because the remap operator can express an arbitrary shuffling of array values, it potentially requires all-to-all communication to bring elements into alignment with the enclosing region scope []. This is the reason that an array expression indexed by the remap operator is exempt from the requirement that it be aligned with other arrays in the statement: since remaps can result in arbitrary communication patterns, it hardly matters whether the array argument is aligned with the statement’s region scope or not. To this end, remap serves as the operator for bringing distinctly distributed arrays into alignment or for performing operations across arrays of differing rank.
The impact of these semantics was profound, since it permitted ZPL programmers to trivially and visually identify communication within their programs, to classify that communication, and to approximate its cost for a given architecture. As a result, the development team called this property ZPL’s What You See is What You Get or WYSIWYG performance model []. Equally important, this syntactic identification of communication also simplified the job of implementing ZPL by making the compiler analysis required to identify and insert communication trivial. For this reason, the WYSIWYG performance model probably served as the biggest differentiator between ZPL and HPF, whose language definition made communication impossible for the programmer to reason about in a portable manner, either syntactically or semantically. Moreover, HPF compilers tended to require complex index analyses to identify common patterns like the vectorizable nearestneighbor communications that were clearly identified in ZPL via the presence of an @-operator.
Advanced ZPL By the late s, the original scope of ZPL had largely been implemented, yet the language remained too restricted to take on more interesting parallel computations such as those represented by the NAS Parallel Benchmarks (NPB) or various scientific applications being undertaken in other departments at the University of Washington. To this end, the ZPL team began working on evolving ZPL, referring to the original feature set as classic ZPL and the new extensions as advanced ZPL. Over time, the classic versus advanced distinction faded and the name ZPL was ultimately used to describe the full set of features developed within the project. Several of the features attributed to advanced ZPL stemmed from the evolution of the region from a simple named constant to a first-class language concept, as mentioned previously. In particular, the ability to modify a region’s index set and to create arrays of regions supported the ability to describe sparse and hierarchical index sets and arrays [, , ]. Other changes involved adding the concept of processor-specific values and array dimensions to the language, permitting the user to break out of ZPL’s traditional single-threaded data parallel programming model and write computations on a processor-by-processor basis. Another extension promoted the representation of the virtual processor
ZPL
grid to a first-class concept, permitting arrays to be redistributed by reshaping the set of processors to which they were mapped. This led to a proposed taxonomy of distributions to extend ZPL’s traditional support for block-decomposed regions [].
Status, Evaluation, and Influence Several versions of ZPL were released to the public in a binary format during the late s. Thereafter, ZPL was released as open source under the MIT license with the final release occurring in April of . Shortly thereafter, Steve Deitz – the last member of the ZPL team – graduated, at which point the project went on permanent hiatus. At the time of this writing, ZPL remains available for download, but is not actively being developed or supported. ZPL had a number of limitations that prevented it from ever being broadly adopted. Perhaps chief among these was its support for only a single level of data parallelism (and toward the end, a form of SPMD task parallelism). While this represents a very important common case in parallel computing, real-world applications typically have several different types of parallelism composed in various ways that are richer than ZPL’s restricted data parallelism was able to express. Another drawback to ZPL was its use of distinct array types and operations to refer to data parallel computations vs. local, serial array computations. This distinction was crucial to its WYSIWYG performance model but ultimately resulted in a fair amount of frustration and confusion among users, particularly when the same logical computation would have to be written multiple times using distinct concepts to support both distributed and non-distributed arrays. The resulting failure to reuse code was well justified, but frustrating in practice. Finally, ZPL’s decision to focus on data parallelism to the exclusion of modern language concepts like object-oriented programming or even modern syntax presented challenges to adoption by a broader or more mainstream programming community. None of this is to suggest that ZPL was an unsuccessful language. On the contrary, in the (admittedly biased) opinion of this author, it represents the model of a successful academic project in that it picked its target – namely, data parallelism supporting the ability to reason about communication and locality – and focused on it to the exclusion of non-research distractions that were clearly orthogonal and could be dealt
Z
with later, such as the aforementioned lack of objectoriented capabilities or modern syntactic constructs. In doing so, ZPL made a number of significant contributions including the introduction of the region concept, the WYSIWYG performance model, and the identification of useful abstractions for representing processor sets and distributed arrays and index sets. Perhaps the biggest indicator of ZPL’s success was its influence on the next generation of parallel languages, represented most notably by the High Productivity Computing Systems (HPCS) phase III languages – Cray’s Chapel and IBM’s X. Both of these languages adopted the notion of a distributed first-class index set (Chapel’s domain and X’s region) while adapting it to deal with some of ZPL’s limitations and generalizing its use to support a more general class of parallel computations.
Related Entries Array Languages Chapel (Cray Inc. HPCS Language) HPF (High Performance Fortran) PRAM (Parallel Random Access Machines)
Bibliographic Notes and Further Reading For an introduction to ZPL and the ZPL project that is both more technical and colorful than this one, the reader is referred to Lawrence Snyder’s The Design and Development of ZPL, published in the History of Programming Languages Conference, []. Another reasonable technical overview would be a workshop paper published toward the end of the ZPL project that served as somewhat of a “greatest hits” overview of the language at the time that the High Performance Computing community was starting to focus increasingly on end-user productivity []. For a reference to the language, the most definitive guide is Snyder’s A Programmer’s Guide to ZPL []. However, it comes with the important caveat that the language continued to evolve after the book’s publication, and as a result many of the advanced ZPL directions were never captured in a comprehensive book or specification. The dissertations by Chamberlain and Deitz fill in some of the gaps by each starting with an overview of ZPL’s capabilities in their second chapters [, ]. These descriptions were intentionally incomplete, yet provide a good sense of where
Z
Z
ZPL
the language ended up. These dissertations also provide descriptions of a number of important design and implementation details in ZPL. Chamberlain’s thesis describes the WYSIWYG communication model, the Ironman interface, the runtime descriptors used for regions and arrays, and the extensions to support sparse and hierarchical regions. Deitz’s dissertation covers support for processor-local data and computation, the firstclass representation of processor grids and resulting redistribution machinery, and the proposed distribution taxonomy mentioned previously. The most important historical foundations of the language come in the form of descriptions of the CTA and phase abstractions [, ]. While the dissertations cited above and by other members of the ZPL project are often the best reference for the final work covered in this entry, stand-alone conference papers that may be more approachable due to their size exist for a number of the topics described here including the Ironman interface [], WYSIWYG performance model [], wavefront computations [], hierarchical regions [], and sparse regions []. Though not the focus of this entry, compiler optimizations were also a crucial part of ZPL’s ability to achieve good performance, and a number of papers detail optimizations made to improve the performance of communication [, ] and computation [].
.
.
.
.
.
.
.
.
.
Bibliography . Chamberlain BL () The design and implementation of a region-based parallel language. Ph.D. thesis, University of Washington, Nov . http://www.cs.washington.edu/homes/ brad/cv/pubs/degree/thesis.html . Chamberlain BL, Choi S-E, Deitz SJ, Snyder L () The high-level parallel language ZPL improves productivity and performance. In: Proceedings of the IEEE internationals workshop on productivity and performance in high-end computing, Madrid, . Chamberlain BL, Choi S-E, Lewis EC, Lin C, Snyder L, Weathersby WD () ZPL’s WYSIWYG performance model.
.
. .
In: Proceedings of the IEEE workshop on high-level parallel programming models and supportive environments, Orlando, Chamberlain BL, Choi S-E, Snyder L () A compiler abstraction for machine independent parallel communication generation. In: Proceedings of the workshop on languages and compilers for parallel computing, Minneapolis, Chamberlain BL, Deitz SJ, Snyder L () A comparative study of the NAS MG benchmark across parallel languages and architectures. In: Proceedings of the ACM international conference on supercomputing, Santa Fe, Chamberlain BL, Lewis EC, Snyder L () Array language support for wavefront and pipelined computations. In: Proceedings of the workshop on languages and compilers for parallel computing, La Jolla, Chamberlain BL, Snyder L () Array language support for parallel sparse computation. In: Proceedings of the ACM international conference on supercomputing, Sorrento, Choi S-E, Snyder L () Quantifying the effects of communication optimizations. In: Proceedings of the IEEE international conference on parallel processing, Bloomington, Deitz SJ () High-level programming language abstractions for advanced and dynamic parallel computations. Ph.D thesis, University of Washington, Feb . http://www.cs.washington. edu/projects/zpl/papers/data/DeitzThesis.pdf Deitz SJ, Chamberlain BL, Choi S-E, Snyder L () The design and implementation of a parallel array operator for the arbitrary remapping of data. In: Proceedings of the ACM conference on principles and practice of parallel programming, San Diego, Deitz SJ, Chamberlain BL, Snyder L () Eliminating redundancies in sum-of-product array computations. In: Proceedings of the ACM international conference on supercomputing, Sorrento, Lin C () The portability of parallel programs across MIMD computers. Ph.D. thesis, University of Washington, Department of Computer Science and Engineering Snyder L () Experimental validation of models of parallel computation. In: Hofmann A, van Leeuwen J (eds), Computer science today. Lecture notes in computer science, vol . Springer, Berlin, pp – Snyder L () Programming guide to ZPL. MIT, Cambridge Snyder L () The design and development of ZPL. In: HOPL III: Proceedings of the third ACM SIGPLAN conference on history of programming languages, San Diego. ACM, New York, pp -–-
List of Entries Ab Initio Molecular Dynamics Access Anomaly Actors Affinity Scheduling Ajtai–Komlós–Szemerédi Sorting Network AKS Network AKS Sorting Network Algebraic Multigrid Algorithm Engineering Algorithmic Skeletons All Prefix Sums Allen and Kennedy Algorithm Allgather All-to-All All-to-All Broadcast Altivec AMD Opteron Processor Barcelona Amdahl’s Argument Amdahl’s Law AMG Analytics, Massive-Scale Anomaly Detection Anton, A Special-Purpose Molecular Simulation Machine Application-Specific Integrated Circuits Applications and Parallelism Architecture Independence Area-Universal Networks Array Languages Array Languages, Compiler Techniques for Asynchronous Iterations Asynchronous Iterative Algorithms Asynchronous Iterative Computations ATLAS (Automatically Tuned Linear Algebra Software) Atomic Operations Automated Empirical Optimization Automated Empirical Tuning Automated Performance Tuning Automated Tuning
Automatically Tuned Linear Algebra Software (ATLAS) Autotuning Backpressure Bandwidth-Latency Models (BSP, Logp) Banerjee’s Dependence Test Barnes-Hut Barriers Basic Linear Algebra Subprograms (BLAS) Behavioral Equivalences Behavioral Relations Benchmarks Beowulf Clusters Beowulf-Class Clusters Bernstein’s Conditions Bioinformatics Bisimilarity Bisimulation Bisimulation Equivalence Bitonic Sort Bitonic Sorting Network Bitonic Sorting, Adaptive BLAS (Basic Linear Algebra Subprograms) Blocking Blue CHiP Blue CHiP Project Blue Gene/L Blue Gene/P Blue Gene/Q Branch Predictors Brent’s Law Brent’s Theorem Broadcast BSP BSP (Bulk Synchronous Parallelism) Bulk Synchronous Parallelism (BSP) Bus: Shared Channel Buses and Crossbars Butterfly C* Cache Affinity Scheduling
List of Entries
Cache Coherence Cache-Only Memory Architecture (COMA) Caches, NUMA Calculus of Mobile Processes Carbon Cycle Research Car-Parrinello Method CDC Cedar Multiprocessor CELL Cell Broadband Engine Processor Cell Processor Cell/B.E. Cellular Automata Chaco Chapel (Cray Inc. HPCS Language) Charm++ Checkpoint/Restart Checkpointing Checkpoint-Recovery CHiP Architecture CHiP Computer Cholesky Factorization Cilk Cilk Plus Cilk++ Cilk- Cilk- Cilkscreen Cluster File Systems Cluster of Workstations Clusters CM Fortran Cm* - The First Non-Uniform Memory Access Architecture CML CM-Lisp CnC Coarray Fortran Code Generation Collect Collective Communication Collective Communication, Network Support For COMA (Cache-Only Memory Architecture) Combinatorial Search Commodity Clusters Communicating Sequential Processes (CSP) Community Atmosphere Model (CAM)
Community Climate Model (CCM) Community Climate System Model Community Climate System Model (CCSM) Community Earth System Model (CESM) Community Ice Code (CICE) Community Land Model (CLM) Compiler Optimizations for Array Languages Compilers Complete Exchange Complex Event Processing Computational Biology Computational Chemistry Computational Models Computational Sciences Computer Graphics Computing Surface Concatenation Concurrency Control Concurrent Collections Programming Model Concurrent Logic Languages Concurrent ML Concurrent Prolog Configurable, Highly Parallel Computer Congestion Control Congestion Management Connected Components Algorithm Connection Machine Connection Machine Fortran Connection Machine Lisp Consistent Hashing Control Data Coordination Copy Core-Duo / Core-Quad Processors COW Crash Simulation Cray MTA Cray Red Storm Cray SeaStar Interconnect CRAY TE Cray Vector Computers Cray XMT Cray XT Series Cray XT Cray XT and Cray XT Series of Supercomputers Cray XT Cray XT and Seastar -D Torus Interconnect
List of Entries
Cray XT Cray XT Critical Race Critical Sections Crossbar CS- CSP (Communicating Sequential Processes) Cyclops Cyclops- Cydra DAG Scheduling Data Analytics Data Centers Data Distribution Data Flow Computer Architecture Data Flow Graphs Data Mining Data Race Detection Data Starvation Crisis Dataflow Supercomputer Data-Parallel Execution Extensions Deadlock Detection Deadlocks Debugging DEC Alpha Decentralization Decomposition Deep Analytics Denelcor HEP Dense Linear System Solvers Dependence Abstractions Dependence Accuracy Dependence Analysis Dependence Approximation Dependence Cone Dependence Direction Vector Dependence Level Dependence Polyhedron Dependences Detection of DOALL Loops Determinacy Determinacy Race Determinism Deterministic Parallel Java Direct Schemes Distributed Computer Distributed Hash Table (DHT)
Distributed Logic Languages Distributed Memory Computers Distributed Process Management Distributed Switched Networks Distributed-Memory Multiprocessor Ditonic Sorting DLPAR Doall Loops Domain Decomposition DPJ DR Dynamic Logical Partitioning for POWER Systems Dynamic LPAR Dynamic Reconfiguration Earth Simulator Eden Eigenvalue and Singular-Value Problems EPIC Processors Erlangen General Purpose Array (EGPA) ES Ethernet Event Stream Processing Eventual Values Exact Dependence Exaflop Computing Exaop Computing Exascale Computing Execution Ordering Experimental Parallel Algorithmics Extensional Equivalences Fast Fourier Transform (FFT) Fast Multipole Method (FMM) Fast Poisson Solvers Fat Tree Fault Tolerance Fences FFT (Fast Fourier Transform) Fast Algorithm for the Discrete Fourier Transform (DFT) FFTW File Systems Fill-Reducing Orderings First-Principles Molecular Dynamics Fixed-Size Speedup FLAME Floating Point Systems FPS-B and Derivatives Flow Control
List of Entries
Flynn’s Taxonomy Forall Loops FORGE Formal Methods–Based Tools for Race, Deadlock, and Other Errors Fortran and Its Successors Fortran, Connection Machine Fortress (Sun HPCS Language) Forwarding Fujitsu Vector Computers Fujitsu Vector Processors Fujitsu VPP Systems Functional Decomposition Functional Languages Futures GA Gather Gather-to-All Gaussian Elimination GCD Test Gene Networks Reconstruction Gene Networks Reverse-Engineering Generalized Meshes and Tori Genome Assembly Genome Sequencing GIO Glasgow Parallel Haskell (GpH) Global Arrays Global Arrays Parallel Programming Toolkit Gossiping GpH (Glasgow Parallel Haskell) GRAPE Graph Algorithms Graph Analysis Software Graph Partitioning Graph Partitioning Software Graphics Processing Unit Green Flash: Climate Machine (LBNL) Grid Partitioning Gridlock Group Communication Gustafson’s Law Gustafson–Barsis Law Half Vector Length Hang Harmful Shared-Memory Access Haskell
Hazard (in Hardware) HDF HEP, Denelcor Heterogeneous Element Processor Hierarchical Data Format High Performance Fortran (HPF) High-Level I/O Library High-Performance I/O Homology to Sequence Alignment, From Horizon HPC Challenge Benchmark HPF (High Performance Fortran) HPS Microarchitecture HT HT. Hybrid Programming With SIMPLE Hypercube Hypercubes and Meshes Hypergraph Partitioning Hyperplane Partitioning HyperTransport IBM Blue Gene Supercomputer IBM Power IBM Power Architecture IBM PowerPC IBM RS/ SP IBM SP IBM SP IBM SP IBM SP IBM System/ Model IEEE . Illegal Memory Access Illiac IV ILUPACK Impass Implementations of Shared Memory in Software Index InfiniBand Instant Replay Instruction-Level Parallelism Instruction Systolic Arrays Intel Celeron Intel Core Microarchitecture, x Processor Family Intel® Parallel Inspector Intel® Parallel Studio Intel® Thread Profiler
List of Entries
Intel® Threading Building Blocks (TBB) Interactive Parallelization Interconnection Network Interconnection Networks Internet Data Centers Inter-Process Communication I/O iPSC Isoefficiency JANUS FPGA-Based Machine Java JavaParty Job Scheduling k-ary n-cube k-ary n-fly k-ary n-tree Knowledge Discovery KSR LANai Languages LAPACK Large-Scale Analytics Latency Hiding Law of Diminishing Returns Laws Layout, Array LBNL Climate Computer libflame Libraries, Numerical Linda Linear Algebra Software Linear Algebra, Numerical Linear Equations Solvers Linear Least Squares and Orthogonal Factorization Linear Regression LINPACK Benchmark Linux Clusters *Lisp Lisp, Connection Machine Little’s Law Little’s Lemma Little’s Principle Little’s Result Little’s Theorem Livermore Loops Load Balancing Load Balancing, Distributed Memory
Locality of Reference and Parallel Processing Lock-Free Algorithms Locks Logarithmic-Depth Sorting Network Logic Languages LogP Bandwidth-Latency Model Loop Blocking Loop Nest Parallelization Loop Tiling Loops, Parallel LU Factorization Manycore MapReduce MasPar Massively Parallel Processor (MPP) Massive-Scale Analytics Matrix Computations Maude Media Extensions Meiko Memory Consistency Models Memory Models Memory Ordering Memory Wall MEMSY Mesh Mesh Partitioning Message Passing Message Passing Interface (MPI) Message-Passing Performance Models METIS and ParMETIS Metrics Microprocessors MILC MIMD (Multiple Instruction, Multiple Data) Machines MIMD Lattice Computation MIN ML MMX Model Coupling Toolkit (MCT) Models for Algorithm Design and Analysis Models of Computation, Theoretical Modulo Scheduling and Loop Pipelining Molecular Evolution Monitors Monitors, Axiomatic Verification of Moore’s Law
List of Entries
MPI (Message Passing Interface) MPI- I/O MPI-IO MPP Mul-T Multicomputers Multicore Networks Multiflow Computer Multifrontal Method Multi-Level Transactions Multilisp Multimedia Extensions Multiple-Instruction Issue Multiprocessor Networks Multiprocessor Synchronization Multiprocessors Multiprocessors, Symmetric MultiScheme Multistage Interconnection Networks Multi-Streamed Processors Multi-Threaded Processors Mumps MUMPS Mutual Exclusion Myri-G Myricom Myrinet NAMD (NAnoscale Molecular Dynamics) NAnoscale Molecular Dynamics (NAMD) NAS Parallel Benchmarks N-Body Computational Methods nCUBE NEC SX Series Vector Computers NESL Nested Loops Scheduling Nested Spheres of Control NetCDF I/O Library, Parallel Network Adapter Network Architecture Network Interfaces Network Obliviousness Network of Workstations Network offload Networks, Direct Networks, Fault-Tolerant Networks, Multistage NI (Network Interface)
NIC (Network Interface Controller or Network Interface Card) Node Allocation Non-Blocking Algorithms Nondeterminator Nonuniform Memory Access (NUMA) Machines NOW NUMA Caches Numerical Algorithms Numerical Libraries Numerical Linear Algebra NVIDIA GPU NWChem Omega Calculator Omega Library Omega Project Omega Test One-to-All Broadcast Open Distributed Systems OpenMP OpenMP Profiling with OmpP OpenSHMEM - Toward a Unified RMA Model Operating System Strategies Optimistic Loop Parallelization Orthogonal Factorization OS Jitter OS, Light-Weight Out-of-Order Execution Processors Overdetermined Systems Overlay Network Owicki-Gries Method of Axiomatic Verification Parafrase Parallel Communication Models Parallel Computing Parallel I/O Library (PIO) Parallel Ocean Program (POP) Parallel Operating System Parallel Prefix Algorithms Parallel Prefix Sums Parallel Random Access Machines (PRAM) Parallel Skeletons Parallel Tools Platform Parallelism Detection in Nested Loops, Optimal Parallelization Parallelization, Automatic Parallelization, Basic Block Parallelization, Loop Nest
List of Entries
ParaMETIS PARDISO PARSEC Benchmarks Partial Computation Particle Dynamics Particle Methods Partitioned Global Address Space (PGAS) Languages PASM Parallel Processing System Path Expressions PaToH (Partitioning Tool for Hypergraphs) Partitioning Tool for Hypergraphs (PaToH) PC Clusters PCI Express PCIe PCI-E PCI-Express Peer-to-Peer Pentium PERCS System Architecture Perfect Benchmarks Performance Analysis Tools Performance Measurement Performance Metrics Periscope Personalized All-to-All Exchange Petaflop Barrier Petascale Computer Petri Nets PETSc (Portable, Extensible Toolkit for Scientific Computation) PGAS (Partitioned Global Address Space) Languages Phylogenetic Inference Phylogenetics Pi-Calculus Pipelining Place-Transition Nets PLAPACK PLASMA PMPI Tools Pnetcdf Point-to-Point Switch Polaris Polyhedra Scanning Polyhedron Model Polytope Model Position Tree POSIX Threads (Pthreads)
Power Wall PRAM (Parallel Random Access Machines) Preconditioners for Sparse Iterative Methods Prefix Prefix Reduction Problem Architectures Process Algebras Process Calculi Process Description Languages Process Synchronization Processes, Tasks, and Threads Processor Allocation Processor Arrays Processors-in-Memory Profiling Profiling with OmpP, OpenMP Program Graphs Programmable Interconnect Computer Programming Languages Programming Models Prolog Prolog Machines Promises Protein Docking Pthreads (POSIX Threads) PVM (Parallel Virtual Machine) QCD apeNEXT Machines QCD (Quantum Chromodynamics) Computations QCD Machines QCDSP and QCDOC Computers QsNet Quadrics Quantum Chemistry Quantum Chromodynamics (QCD) Computations Quicksort Race Race Conditions Race Detection Techniques Race Detectors for Cilk and Cilk++ Programs Race Hazard Radix Sort Rapid Elliptic Solvers Reconfigurable Computer Reconfigurable Computers Reconstruction of Evolutionary Trees Reduce and Scan Relaxed Memory Consistency Models
List of Entries
Reliable Networks Rendezvous Reordering Resource Affinity Scheduling Resource Management for Parallel Computers Rewriting Logic Ring Roadrunner Project, Los Alamos Router Architecture Router-Based Networks Routing (Including Deadlock Avoidance) R-Stream Compiler Run Time Parallelization Runtime System Scalability Scalable Coherent Interface (SCI) ScaLAPACK Scalasca Scaled Speedup Scan for Distributed Memory, Message-Passing Systems Scan, Reduce and Scatter Scheduling Scheduling Algorithms SCI (Scalable Coherent Interface) Semantic Independence Semaphores Sequential Consistency Server Farm Shared Interconnect Shared Virtual Memory Shared-Medium Network Shared-Memory Multiprocessors SHMEM SIGMA- SIMD (Single Instruction, Multiple Data) Machines SIMD Extensions SIMD ISA Single System Image Singular-Value Decomposition (SVD) Sisal Small-World Network Analysis and Partitioning (SNAP) Framework SNAP (Small-World Network Analysis and Partitioning) Framework
SoC (System on Chip) Social Networks Software Autotuning Software Distributed Shared Memory Sorting Space-Filling Curves SPAI (SParse Approximate Inverse) Spanning Tree, Minimum Weight Sparse Approximate Inverse Matrix Sparse Direct Methods Sparse Gaussian Elimination Sparse Iterative Methods, Preconditioners for SPEC Benchmarks SPEC HPC SPEC HPC SPEC MPI SPEC OMP Special-Purpose Machines Speculation Speculation, Thread-Level Speculative Multithreading (SM) Speculative Parallelization Speculative Parallelization of Loops Speculative Run-Time Parallelization Speculative Threading Speculative Thread-Level Parallelization Speedup SPIKE Spiral SPMD Computational Model SSE Stalemate State Space Search Stream Processing Stream Programming Languages Strong Scaling Suffix Trees Superlinear Speedup SuperLU Supernode Partitioning Superscalar Processors SWARM: A Parallel Programming Framework for Multicore Processors Switch Architecture Switched-Medium Network Switching Techniques Symmetric Multiprocessors
List of Entries
Synchronization System Integration System on Chip (SoC) Systems Biology, Network Inference in Systolic Architecture Systolic Arrays Task Graph Scheduling Task Mapping, Topology Aware Tasks TAU TAU Performance System TBB (Intel Threading Building Blocks) Tensilica Tera MTA Terrestrial Ecosystem Carbon Modeling Terrestrial Ecosystem Modeling The High Performance Substrate Theory of Mazurkiewicz-Traces Thick Ethernet Thin Ethernet Thread Level Speculation (TLS) Parallelization Thread-Level Data Speculation (TLDS) Thread-Level Speculation Threads Tiling Titanium TLS TOP Topology Aware Task Mapping Torus Total Exchange Trace Scheduling Trace Theory Tracing Transactional Memories
Transactions, Nested Transpose TStreams Tuning and Analysis Utilities Ultracomputer, NYU Uncertainty Quantification Underdetermined Systems Unified Parallel C Unimodular Transformations Universal VLSI Circuits Universality in VLSI Computation UPC Use-Def Chains Vampir Vampir Vampir NG VampirServer VampirTrace Vector Extensions, Instruction-Set Architecture (ISA) Vectorization Verification of Parallel Shared-Memory Programs, Owicki-Gries Method of Axiomatic View from Berkeley Virtual Shared Memory VLIW Processors VLSI Algorithmics VLSI Computation VNG Warp and iWarp Wavefront Arrays Weak Scaling Whole Program Analysis Work-Depth Model Workflow Scheduling Z-Level Programming Language ZPL